You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/method/discussion.md
+36-21Lines changed: 36 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,43 +10,58 @@ include errors such as rank inconsistencies, duplications, spelling mistakes,
10
10
and misplaced taxa.
11
11
12
12
Ultimately there is no fully automated and foolproof test to determine
13
-
whether two nodes can be aligned - whether node A and node B are about
13
+
whether two nodes can be aligned - whether node A and node B,
14
+
from different source taxonomies,
15
+
are about
14
16
the same taxon. The information to do this is in the
15
17
literature and in databases on the Internet, but often it is
16
18
(understandably) missing from the source taxonomies.
17
19
18
-
It is not feasible to curate such problem individually, so the taxonomy
19
-
synthesis methods identify and handle thousands of 'special cases' in an
20
-
automated way. We currently use only the name-strings (and ranks, in some of the
21
-
heuristics) to guide synthesis. Using other information contained in the source
22
-
taxonomies (such as authority or nomenclatural information) could be possible in
23
-
the future.
20
+
It is not feasible to curate such problem individually, so the
21
+
taxonomy synthesis methods identify and handle thousands of 'special
22
+
cases' in an automated way. We currently use only the name-strings
23
+
(and ranks, in some of the heuristics) to guide synthesis. Using other
24
+
information contained in the source taxonomies, such as the structure
25
+
of species names, authority strings, and other nomenclatural information,
26
+
could be very helpful.
24
27
25
28
*[JAR: something about how messy the inputs are]
26
29
27
30
## Open Tree Taxonomy as a taxonomy
28
-
We have developed the Open Tree Taxonomy (OTT) for the very specific purpose of aligning and synthesizing phylogenetic trees. We do not intend it to be a reference for nomenclature, or to fill the role of expert-curated taxonomic databases. Several features of OTT make it not suitable for taxonomic and nomenclatural purposes. It contains many names that are either not valid or not currently accepted. Some of these come from DNA sequencing via NCBI, which is also not a taxonomic reference, while others come directly from phylogenies submitted to OpenTree curators via our taxonomy curation features. OTT also contains more homonyms as compared to its sources. Many of these duplicated names are artifacts of the synthesis heuristics. For our purposes, these are not of great concern - when mapping OTUs in trees to taxa in OTT, we generally restrict mapping to a specific taxonomic context, and if there are multiple matches to OTT taxa with the same name, a curator can clearly see this situation and choose the taxa with the correct lineage. [JAR: anything else here?]
31
+
32
+
We have developed the Open Tree Taxonomy (OTT) for the very specific
33
+
purpose of aligning and synthesizing phylogenetic trees. We do not
34
+
intend it to be a reference for nomenclature, or to substitute for
35
+
expert-curated taxonomic databases. Several features of OTT make it
36
+
unsuitable for taxonomic and nomenclatural purposes. It contains
37
+
many names that are either not valid or not currently accepted. Some
38
+
of these come from DNA sequencing via NCBI Taxonomy, which is also not a
39
+
taxonomic reference, while others come directly from phylogenies
40
+
submitted to Open Tree curators via our taxonomy curation features. OTT
41
+
also contains more homonyms as compared to its sources. Many of these
42
+
duplicated names are artifacts of the synthesis heuristics. For our
43
+
purposes, these are not of great concern - when mapping OTUs in trees
44
+
to taxa in OTT, we generally restrict mapping to a specific taxonomic
45
+
context, and if there are multiple matches to OTT taxa with the same
46
+
name, a curator can clearly see this situation and choose the taxon
47
+
with the correct lineage.
29
48
30
49
## Allowing community curation
50
+
31
51
We have also developed a system for curators to directly add new taxon records to the
32
-
taxonomy from published phylogenies. These taxon additions include provenance
33
-
information, including evidence for the taxon and identity of the curator. We
34
-
expose this provenance information through the website and the taxonomy API.
52
+
taxonomy from published phylogenies, which often contain newly
53
+
described species that are not yet present in any source taxonomy.
54
+
These taxon additions include provenance
55
+
information, including evidence for the taxon and the identity of the curator. We
56
+
expose this provenance information through the web site and the taxonomy API.
35
57
Most of the feedback on the synthetic tree of life has been about taxonomy, and expanding this
36
58
feature to other types of taxonomic information allows users to directly
37
59
contribute expertise and allows projects to easily share that information.
38
-
39
-
## Extinct/extant annotations
40
-
41
-
One important taxon annotation is 'extinct'. The OTT backbone
42
-
is essentially the NCBI Taxonomy, which records very few extinct taxa, so when such taxa are found in other taxonomies, there are often no higher
43
-
taxa to put them into [JAR: figure out a better way to explain this]. This had a very negative impact on the subsequent phylogeny synthesis, with most extinct taxa badly placed in the tree - for example, there were many extinct genera and families that appeared as direct children of Mammalia.
44
-
We therefore removing from synthesis those taxa in the
45
-
taxonomy that are annotated as extinct, leading to a cleaner synthesis.
46
-
47
-
Incorporating extinct taxa into OTT would be better accomplished by adding a source that explicitly focuses on extinct taxa.
60
+
[KC: not sure what you mean by "expanding this feature" - what
61
+
feature? how?]
48
62
49
63
## Comparison to other taxonomies
64
+
50
65
Given the very different goals of Open Tree Taxonomy in comparison to most other taxonomy project, it is difficult to compare OTT to other taxonomies in a meaningful way. The Open Tree Taxonomy is most similar to the GBIF taxonomy, in the sense that
51
66
both are a synthesis of existing taxonomies rather than a curated taxonomy. The
52
67
GBIF method is yet unpublished. Once the GBIF method has been formally
If the S' topology is incompatible with the S topology,
333
-
we throw away the conflicting internal nodes from S' (u and v).
334
-
Any leftover taxa (e)
335
-
are flagged _incertae sedis_ in the attachment point, which in this case is z.
336
-
337
-
[JAR: need better example that doesn't involve synonyms] An example of inconsistency: gbif:7919320 = Helotium, contains
338
-
Helotium lonicerae and Helotium infarciens, but IF knows Helotium infarciens as a synonym for
339
-
Hymenoscyphus infarciens, which isn't in OTT Helotium
340
-
341
-
[JAR: need better example that doesn't involve synonyms]: GBIF (S') Paludomidae has children
342
-
Tiphobia and Bridouxia, but the two children have different parents
343
-
in S
344
-
345
341
## Finishing the assembly
346
342
347
343
After all source taxonomies are aligned and merged, we apply general ad hoc
@@ -350,19 +346,24 @@ employed with the source taxonomies. Patches are represented in a
350
346
variety of formats representing historical accidents of curation (i.e. we
351
347
have updated our patch system over time in response to curator feedback). Rather
352
348
than convert all patches to
353
-
some form already known to the system, we kept it in the original form,
354
-
which facilitates further editing.
355
-
356
-
* give the number of patches [JAR: get number from v3.0; at least 123 - not clear how to count].
357
-
358
-
The final step is to assign unique, stable identifiers to nodes. As
359
-
before [JAR: define what "as before" refers to], some identifiers are assigned on an ad hoc basis [JAR: define which nodes get ad hoc identifiers]. Then,
360
-
automated identifier assignment is done by aligning the previous
361
-
version of OTT to the new taxonomy. Additional candidates are
362
-
found [JAR: clarify that 'are found by' is automated, not manual] by comparing node identifiers used in source taxonomies to
363
-
source taxonomy node identifiers stored in the previous OTT version.
364
-
After transferring identifiers of aligned nodes, any remaining workspace
365
-
nodes are given newly 'minted' identifiers.
349
+
some form already known to the system, we kept them in their original form.
350
+
This practice facilitates further editing.
351
+
352
+
* give the number of patches [JAR: get number from v3.0 when final;
Copy file name to clipboardExpand all lines: doc/method/results.md
+7-6Lines changed: 7 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,7 +62,9 @@ reclassification); and in the remaining four cases, either the taxon
62
62
was added to OTT after the study was curated, or the curation task was
63
63
left incomplete.
64
64
65
-
[JAR: measure of how many mapped OTUs come from NCBI, i.e. how close NCBI gets us to the mapping requirement]
65
+
[JAR: measure of how many mapped OTUs come from NCBI, i.e. how close NCBI
66
+
gets us to the mapping requirement: `python doc/method/otus_in_ncbi.py` =
67
+
'172440 out of 195675 OTUs are in NCBI' (88%)]
66
68
67
69
### Taxonomic coverage
68
70
@@ -72,9 +74,8 @@ combination of the inputs has greater coverage than CoL, and in part
72
74
because OTT has many names that are either not valid or not currently
73
75
accepted.
74
76
75
-
Since the GBIF source we used includes the 2011 edition of CoL [JAR: 2011],
76
-
OTT includes everything in that edition of CoL. [JAR: did it get updated
77
-
for 2016 GBIF?]
77
+
Since the GBIF source we used includes the Catalogue of Life [ref],
78
+
OTT includes everything in CoL.
78
79
79
80
This level of coverage would seem to meet Open Tree's taxonomic
80
81
coverage requirement as well as any other available taxonomic source.
@@ -94,9 +95,9 @@ coverage requirement as well as any other available taxonomic source.
94
95
95
96
### Ongoing update
96
97
97
-
Building OTT version 2.11 from sources requires 11 minutes 42 second of real time. Our process currently runs on a machine with 16GB of memory, and 8GB is not sufficient.
98
+
Building OTT version 3.0 from sources requires 15 minutes of real time. Our process currently runs on a machine with 16GB of memory; 8GB is not sufficient.
98
99
99
-
In the upgrade from 2.10 to 2.11, we added new versions of both NCBI and GBIF. NCBI updates frequently, so changes tend to be minimal and incorporating the new version was trivial. In contrast, the version from GBIF represented both a major change in their taxonomy synthesis method. Many taxa disappeared, requiring changes to our ad hoc patches during the normalization stage. In addition, the new version of GBIF used a different taxonomy file format, which requires extensive changes to our import code (most notably, handling taxon name-strings that now included authority information).
100
+
In the upgrade from 2.10 to 3.0, we added new versions of both NCBI and GBIF. NCBI updates frequently, so changes tend to be minimal and incorporating the new version was trivial. In contrast, the version from GBIF represented both a major change in their taxonomy synthesis method. Many taxa disappeared, requiring changes to our ad hoc patches during the normalization stage. In addition, the new version of GBIF used a different taxonomy file format, which requires extensive changes to our import code (most notably, handling taxon name-strings that now included authority information).
100
101
101
102
We estimate the the update from OTT v2.10 to OTT v3.0 required approximately 3 days of development time. This was much greater than previous updates due to the changes required to handle the major changes in GBIF content and format.
0 commit comments