Skip to content

Commit 3c887e1

Browse files
committed
responses to most questions, as requested
1 parent 97b7959 commit 3c887e1

File tree

9 files changed

+112
-70
lines changed

9 files changed

+112
-70
lines changed

doc/method/discussion.md

Lines changed: 36 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -10,43 +10,58 @@ include errors such as rank inconsistencies, duplications, spelling mistakes,
1010
and misplaced taxa.
1111

1212
Ultimately there is no fully automated and foolproof test to determine
13-
whether two nodes can be aligned - whether node A and node B are about
13+
whether two nodes can be aligned - whether node A and node B,
14+
from different source taxonomies,
15+
are about
1416
the same taxon. The information to do this is in the
1517
literature and in databases on the Internet, but often it is
1618
(understandably) missing from the source taxonomies.
1719

18-
It is not feasible to curate such problem individually, so the taxonomy
19-
synthesis methods identify and handle thousands of 'special cases' in an
20-
automated way. We currently use only the name-strings (and ranks, in some of the
21-
heuristics) to guide synthesis. Using other information contained in the source
22-
taxonomies (such as authority or nomenclatural information) could be possible in
23-
the future.
20+
It is not feasible to curate such problem individually, so the
21+
taxonomy synthesis methods identify and handle thousands of 'special
22+
cases' in an automated way. We currently use only the name-strings
23+
(and ranks, in some of the heuristics) to guide synthesis. Using other
24+
information contained in the source taxonomies, such as the structure
25+
of species names, authority strings, and other nomenclatural information,
26+
could be very helpful.
2427

2528
* [JAR: something about how messy the inputs are]
2629

2730
## Open Tree Taxonomy as a taxonomy
28-
We have developed the Open Tree Taxonomy (OTT) for the very specific purpose of aligning and synthesizing phylogenetic trees. We do not intend it to be a reference for nomenclature, or to fill the role of expert-curated taxonomic databases. Several features of OTT make it not suitable for taxonomic and nomenclatural purposes. It contains many names that are either not valid or not currently accepted. Some of these come from DNA sequencing via NCBI, which is also not a taxonomic reference, while others come directly from phylogenies submitted to OpenTree curators via our taxonomy curation features. OTT also contains more homonyms as compared to its sources. Many of these duplicated names are artifacts of the synthesis heuristics. For our purposes, these are not of great concern - when mapping OTUs in trees to taxa in OTT, we generally restrict mapping to a specific taxonomic context, and if there are multiple matches to OTT taxa with the same name, a curator can clearly see this situation and choose the taxa with the correct lineage. [JAR: anything else here?]
31+
32+
We have developed the Open Tree Taxonomy (OTT) for the very specific
33+
purpose of aligning and synthesizing phylogenetic trees. We do not
34+
intend it to be a reference for nomenclature, or to substitute for
35+
expert-curated taxonomic databases. Several features of OTT make it
36+
unsuitable for taxonomic and nomenclatural purposes. It contains
37+
many names that are either not valid or not currently accepted. Some
38+
of these come from DNA sequencing via NCBI Taxonomy, which is also not a
39+
taxonomic reference, while others come directly from phylogenies
40+
submitted to Open Tree curators via our taxonomy curation features. OTT
41+
also contains more homonyms as compared to its sources. Many of these
42+
duplicated names are artifacts of the synthesis heuristics. For our
43+
purposes, these are not of great concern - when mapping OTUs in trees
44+
to taxa in OTT, we generally restrict mapping to a specific taxonomic
45+
context, and if there are multiple matches to OTT taxa with the same
46+
name, a curator can clearly see this situation and choose the taxon
47+
with the correct lineage.
2948

3049
## Allowing community curation
50+
3151
We have also developed a system for curators to directly add new taxon records to the
32-
taxonomy from published phylogenies. These taxon additions include provenance
33-
information, including evidence for the taxon and identity of the curator. We
34-
expose this provenance information through the website and the taxonomy API.
52+
taxonomy from published phylogenies, which often contain newly
53+
described species that are not yet present in any source taxonomy.
54+
These taxon additions include provenance
55+
information, including evidence for the taxon and the identity of the curator. We
56+
expose this provenance information through the web site and the taxonomy API.
3557
Most of the feedback on the synthetic tree of life has been about taxonomy, and expanding this
3658
feature to other types of taxonomic information allows users to directly
3759
contribute expertise and allows projects to easily share that information.
38-
39-
## Extinct/extant annotations
40-
41-
One important taxon annotation is 'extinct'. The OTT backbone
42-
is essentially the NCBI Taxonomy, which records very few extinct taxa, so when such taxa are found in other taxonomies, there are often no higher
43-
taxa to put them into [JAR: figure out a better way to explain this]. This had a very negative impact on the subsequent phylogeny synthesis, with most extinct taxa badly placed in the tree - for example, there were many extinct genera and families that appeared as direct children of Mammalia.
44-
We therefore removing from synthesis those taxa in the
45-
taxonomy that are annotated as extinct, leading to a cleaner synthesis.
46-
47-
Incorporating extinct taxa into OTT would be better accomplished by adding a source that explicitly focuses on extinct taxa.
60+
[KC: not sure what you mean by "expanding this feature" - what
61+
feature? how?]
4862

4963
## Comparison to other taxonomies
64+
5065
Given the very different goals of Open Tree Taxonomy in comparison to most other taxonomy project, it is difficult to compare OTT to other taxonomies in a meaningful way. The Open Tree Taxonomy is most similar to the GBIF taxonomy, in the sense that
5166
both are a synthesis of existing taxonomies rather than a curated taxonomy. The
5267
GBIF method is yet unpublished. Once the GBIF method has been formally

doc/method/introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ taxon as a given name-string occurrence in another. Solving
2424
this equivalence problem requires detecting equivalence when the
2525
name-strings are different (synonym detection), as well as
2626
distinguishing occurrences that only coincidentally have the same
27-
name-string (homonym detection). We have developed a set of heuristics
27+
name-string (homonym sense detection). We have developed a set of heuristics
2828
that scalably address this equivalence problem.
2929

3030
## The Open Tree of Life project

doc/method/method-details.md

Lines changed: 37 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,16 @@ are performed:
1212

1313
1. Child taxa of "containers" in the source taxonomy are made to be
1414
children of the container's parent. "Containers" are
15-
groupings in the tree that don't represent taxa, for example nodes
15+
groupings in the source that don't represent taxa, for example nodes
1616
named "incertae sedis" or "environmental samples". The members of
1717
a container aren't more closely related to one another than they
1818
are to the container's siblings; the container is only present as
1919
a way to say something about the members. The fact that a node
2020
had originally been in a container is recorded as a flag on the
2121
child node.
22-
1. Monotypic homonym removal - when a node with name N has as its
23-
only child another node with the same name N, the parent is removed.
24-
This is done to avoid ambiguities when aligning taxonomies.
25-
[JAR: get examples by rerunning]
2622
1. Diacritics removal - accents and umlauts are removed in order to improve
2723
name matching, as well as to follow the nomenclatural codes, which prohibit them.
28-
The original name-string is kept, as a synonym.
24+
The original name-string is kept as a synonym.
2925

3026
The normalized versions of the taxonomies then become the input to subsequent
3127
processing phases.
@@ -76,7 +72,8 @@ be run as a script. Following are some examples of adjustments.
7672

7773
In the process of assembling the reference taxonomy, 284 ad hoc
7874
adjustments are made to the source taxonomies before they are
79-
aligned to the workspace. [JAR: check numbers for v3.0]
75+
aligned to the workspace.
76+
[JAR: check numbers when v3.0 is final: `python util/count_patches.py adjustments.py` ~= 289]
8077

8178
### Candidate identification
8279

@@ -256,7 +253,14 @@ not needed because it inconsistent or can be 'absorbed', and it is
256253
dropped. If it is not dropped, then there is a troublesome situation
257254
that calls for manual review.
258255

259-
[JAR: Need example here ...]
256+
For example, for GBIF _Katoella pulchra_, the candidates are NCBI
257+
_Davallodes pulchra_ and _Davallodes yunnanensis_. (There is no
258+
_Katoella pulchra_ in the workspace at the time of the alignment and
259+
the two candidates come from synonymies with _Katoella pulchra_
260+
declared by GBIF.)
261+
Neither candidate is preferable to the other, so
262+
_Katoella pulchra_ is left unaligned and
263+
is omitted from the assembly.
260264

261265

262266
## Merging unaligned source nodes into the workspace
@@ -318,6 +322,13 @@ and the new source, we retain the workspace.
318322
here is to add e to z and mark it as _incertae sedis_ (indicated above
319323
by the question mark).
320324

325+
For example, family Melyridae from GBIF has five genera, of which two
326+
(_Trichoceble_, _Danacaea_) are not found in the workspace,
327+
and the other three do not all have the same parent after alignment
328+
- they are in three different subfamilies. _Trichoceble_ and _Danacaea_
329+
are made to be _incertae sedis_ children of Melyridae, because
330+
there is no telling which NCBI subfamily they are supposed to go in.
331+
321332
1. (a,b,c,d,e)z + ((a,b)x,(c,d)y)z = (a,b,c,d,e)z
322333

323334
We don't want to lose the fact from the higher priority taxonomy S
@@ -327,21 +338,6 @@ and the new source, we retain the workspace.
327338

328339
So that we have a term for this situation, say that x is _absorbed_ into z.
329340

330-
1. ((a,b)x,(c,d)y)z + ((a,c,e)u,(b,d)v)z = ((a,b)x,(c,d)y,?e)
331-
332-
If the S' topology is incompatible with the S topology,
333-
we throw away the conflicting internal nodes from S' (u and v).
334-
Any leftover taxa (e)
335-
are flagged _incertae sedis_ in the attachment point, which in this case is z.
336-
337-
[JAR: need better example that doesn't involve synonyms] An example of inconsistency: gbif:7919320 = Helotium, contains
338-
Helotium lonicerae and Helotium infarciens, but IF knows Helotium infarciens as a synonym for
339-
Hymenoscyphus infarciens, which isn't in OTT Helotium
340-
341-
[JAR: need better example that doesn't involve synonyms]: GBIF (S') Paludomidae has children
342-
Tiphobia and Bridouxia, but the two children have different parents
343-
in S
344-
345341
## Finishing the assembly
346342

347343
After all source taxonomies are aligned and merged, we apply general ad hoc
@@ -350,19 +346,24 @@ employed with the source taxonomies. Patches are represented in a
350346
variety of formats representing historical accidents of curation (i.e. we
351347
have updated our patch system over time in response to curator feedback). Rather
352348
than convert all patches to
353-
some form already known to the system, we kept it in the original form,
354-
which facilitates further editing.
355-
356-
* give the number of patches [JAR: get number from v3.0; at least 123 - not clear how to count].
357-
358-
The final step is to assign unique, stable identifiers to nodes. As
359-
before [JAR: define what "as before" refers to], some identifiers are assigned on an ad hoc basis [JAR: define which nodes get ad hoc identifiers]. Then,
360-
automated identifier assignment is done by aligning the previous
361-
version of OTT to the new taxonomy. Additional candidates are
362-
found [JAR: clarify that 'are found by' is automated, not manual] by comparing node identifiers used in source taxonomies to
363-
source taxonomy node identifiers stored in the previous OTT version.
364-
After transferring identifiers of aligned nodes, any remaining workspace
365-
nodes are given newly 'minted' identifiers.
349+
some form already known to the system, we kept them in their original form.
350+
This practice facilitates further editing.
351+
352+
* give the number of patches [JAR: get number from v3.0 when final;
353+
`python util/count_patches.py amendments.py` ~= 121,
354+
`tail +2 feed/misc/chromista-spreadsheet.csv | wc` = 239,
355+
`cat amendments/*.json | grep original_label | wc` ~= 106]
356+
357+
The final step is to assign unique, stable identifiers to nodes.
358+
359+
Identifier assignment is done by aligning the previous version of OTT
360+
to the new taxonomy. As with the other alignments, there are scripted
361+
_ad hoc_ adjustments to correct for some errors that would otherwise
362+
be made by automated assignment. For this alignment, the set of
363+
heuristics is extended by adding rules to prefer candidates that have
364+
the same source taxonomy node id as the previous version node being
365+
aligned. After transferring identifiers of aligned nodes, any
366+
remaining workspace nodes are given newly 'minted' identifiers.
366367

367368
The previous OTT version is not merged into the new version; the alignment is
368369
computed only for the purpose of assigning identifiers. A node can only persist

doc/method/method-intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ benefits as well, such as the ability to add additional sources relatively
1919
easily, and to use the tool for other purposes.
2020

2121
In the following, any definite claims or measurements apply to the
22-
Open Tree reference taxonomy version 2.11.
22+
Open Tree reference taxonomy version 3.0.
2323

2424
## Terminology
2525

doc/method/removed-text.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -681,3 +681,17 @@ _absorbed into larger taxon due to conflict_: [should be described in methods se
681681
[Interesting?: 57 taxa that were unplaced in a higher priority source
682682
get placed by a lower priority source.]
683683

684+
685+
686+
2017-02-18
687+
688+
Removing this case because I cannot find an example of it!
689+
690+
1. ((a,b)x,(c,d)y)z + ((a,c,e)u,(b,d)v)z = ((a,b)x,(c,d)y,?e)
691+
692+
If sibling nodes in S' have different parents in S, we throw away their parent node.
693+
In the example, a and c, which are siblings under u, have different parents x and y
694+
in S, and similarly for b and d.
695+
Any unaligned children (e in this case)
696+
are flagged _incertae sedis_ in the attachment point, which in this case is z.
697+

doc/method/results.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,9 @@ reclassification); and in the remaining four cases, either the taxon
6262
was added to OTT after the study was curated, or the curation task was
6363
left incomplete.
6464

65-
[JAR: measure of how many mapped OTUs come from NCBI, i.e. how close NCBI gets us to the mapping requirement]
65+
[JAR: measure of how many mapped OTUs come from NCBI, i.e. how close NCBI
66+
gets us to the mapping requirement: `python doc/method/otus_in_ncbi.py` =
67+
'172440 out of 195675 OTUs are in NCBI' (88%)]
6668

6769
### Taxonomic coverage
6870

@@ -72,9 +74,8 @@ combination of the inputs has greater coverage than CoL, and in part
7274
because OTT has many names that are either not valid or not currently
7375
accepted.
7476

75-
Since the GBIF source we used includes the 2011 edition of CoL [JAR: 2011],
76-
OTT includes everything in that edition of CoL. [JAR: did it get updated
77-
for 2016 GBIF?]
77+
Since the GBIF source we used includes the Catalogue of Life [ref],
78+
OTT includes everything in CoL.
7879

7980
This level of coverage would seem to meet Open Tree's taxonomic
8081
coverage requirement as well as any other available taxonomic source.
@@ -94,9 +95,9 @@ coverage requirement as well as any other available taxonomic source.
9495

9596
### Ongoing update
9697

97-
Building OTT version 2.11 from sources requires 11 minutes 42 second of real time. Our process currently runs on a machine with 16GB of memory, and 8GB is not sufficient.
98+
Building OTT version 3.0 from sources requires 15 minutes of real time. Our process currently runs on a machine with 16GB of memory; 8GB is not sufficient.
9899

99-
In the upgrade from 2.10 to 2.11, we added new versions of both NCBI and GBIF. NCBI updates frequently, so changes tend to be minimal and incorporating the new version was trivial. In contrast, the version from GBIF represented both a major change in their taxonomy synthesis method. Many taxa disappeared, requiring changes to our ad hoc patches during the normalization stage. In addition, the new version of GBIF used a different taxonomy file format, which requires extensive changes to our import code (most notably, handling taxon name-strings that now included authority information).
100+
In the upgrade from 2.10 to 3.0, we added new versions of both NCBI and GBIF. NCBI updates frequently, so changes tend to be minimal and incorporating the new version was trivial. In contrast, the version from GBIF represented both a major change in their taxonomy synthesis method. Many taxa disappeared, requiring changes to our ad hoc patches during the normalization stage. In addition, the new version of GBIF used a different taxonomy file format, which requires extensive changes to our import code (most notably, handling taxon name-strings that now included authority information).
100101

101102
We estimate the the update from OTT v2.10 to OTT v3.0 required approximately 3 days of development time. This was much greater than previous updates due to the changes required to handle the major changes in GBIF content and format.
102103

doc/method/taxonomy_metrics.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
def doit(ott, sep, outpath):
1111

12-
do_rug = os.path.isdir('out/ruggiero')
12+
do_rug = False #os.path.isdir('out/ruggiero')
1313

1414
if do_rug:
1515
rug = Taxonomy.getRawTaxonomy('out/ruggiero/', 'rug')

org/opentreeoflife/smasher/AlignmentByName.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,8 @@ public Answer findAlignment(Taxon node) {
192192
anyCandidate.witness);
193193
// Hack for debugging / example selection
194194
Answer winner = winners.get(0);
195-
((UnionTaxonomy)target).choicesMade.add(new Answer[]{winner, anyLoser});
195+
if (target instanceof UnionTaxonomy)
196+
((UnionTaxonomy)target).choicesMade.add(new Answer[]{winner, anyLoser});
196197
} else {
197198
result = new Answer(node, anyCandidate.target, score,
198199
"confirmed",

util/reason_report.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Metrics for taxonomy writeup.
22

3-
import sys, os, json
3+
import sys, os, json, csv
44

55
def format_report(taxo_path, metrics_path):
66
print """# Results
@@ -63,12 +63,22 @@ def show_table(table, description):
6363
print
6464
# show_table_ascii(table)
6565
show_table_html(table)
66+
show_table_csv(table)
67+
68+
def show_table_csv(table):
69+
print '-----'
70+
writer = csv.writer(sys.stdout)
71+
for (count, label) in table:
72+
writer.writerow([count, label])
73+
print '-----'
6674

6775
def show_table_ascii(table):
6876
for (count, label) in table:
6977
print fmt % (count, label)
7078
print
7179

80+
# returns (list of (rank, label, count), total)
81+
7282
def prepare_table(summary, label_info, totalize):
7383
for key in summary:
7484
if not key in all_keys:
@@ -219,8 +229,8 @@ def cell(val, atts):
219229
(80, 'Source node absorbed into larger workspace taxon'),
220230
"reject/inconsistent":
221231
(82, 'Source node absorbed into larger taxon due to conflict [fix me]'),
222-
"reject/wayward":
223-
(84, 'Source parent does not descend from nearest aligned [fix me]'),
232+
# "reject/wayward":
233+
# (84, 'Source parent does not descend from nearest aligned [fix me]'),
224234
}
225235

226236
all_keys = {}

0 commit comments

Comments
 (0)