responses to most questions, as requested

jar398 · jar398 · commit 3c887e182985 · 2017-02-18T23:05:59.000-05:00
diff --git a/doc/method/discussion.md b/doc/method/discussion.md
@@ -10,43 +10,58 @@ include errors such as rank inconsistencies, duplications, spelling mistakes,
 and misplaced taxa.
 
 Ultimately there is no fully automated and foolproof test to determine
-whether two nodes can be aligned - whether node A and node B are about
+whether two nodes can be aligned - whether node A and node B,
+from different source taxonomies,
+are about
 the same taxon. The information to do this is in the
 literature and in databases on the Internet, but often it is
 (understandably) missing from the source taxonomies.
 
-It is not feasible to curate such problem individually, so the taxonomy
-synthesis methods identify and handle thousands of 'special cases' in an
-automated way. We currently use only the name-strings (and ranks, in some of the
-heuristics) to guide synthesis. Using other information contained in the source
-taxonomies (such as authority or nomenclatural information) could be possible in
-the future.
+It is not feasible to curate such problem individually, so the
+taxonomy synthesis methods identify and handle thousands of 'special
+cases' in an automated way. We currently use only the name-strings
+(and ranks, in some of the heuristics) to guide synthesis. Using other
+information contained in the source taxonomies, such as the structure
+of species names, authority strings, and other nomenclatural information,
+could be very helpful.
 
 * [JAR: something about how messy the inputs are]
 
 ## Open Tree Taxonomy as a taxonomy
-We have developed the Open Tree Taxonomy (OTT) for the very specific purpose of aligning and synthesizing phylogenetic trees. We do not intend it to be a reference for nomenclature, or to fill the role of expert-curated taxonomic databases. Several features of OTT make it not suitable for taxonomic and nomenclatural purposes. It contains many names that are either not valid or not currently accepted. Some of these come from DNA sequencing via NCBI, which is also not a taxonomic reference, while others come directly from phylogenies submitted to OpenTree curators via our taxonomy curation features. OTT also contains more homonyms as compared to its sources. Many of these duplicated names are artifacts of the synthesis heuristics. For our purposes, these are not of great concern - when mapping OTUs in trees to taxa in OTT, we generally restrict mapping to a specific taxonomic context, and if there are multiple matches to OTT taxa with the same name, a curator can clearly see this situation and choose the taxa with the correct lineage. [JAR: anything else here?]
+
+We have developed the Open Tree Taxonomy (OTT) for the very specific
+purpose of aligning and synthesizing phylogenetic trees. We do not
+intend it to be a reference for nomenclature, or to substitute for
+expert-curated taxonomic databases. Several features of OTT make it
+unsuitable for taxonomic and nomenclatural purposes. It contains
+many names that are either not valid or not currently accepted. Some
+of these come from DNA sequencing via NCBI Taxonomy, which is also not a
+taxonomic reference, while others come directly from phylogenies
+submitted to Open Tree curators via our taxonomy curation features. OTT
+also contains more homonyms as compared to its sources. Many of these
+duplicated names are artifacts of the synthesis heuristics. For our
+purposes, these are not of great concern - when mapping OTUs in trees
+to taxa in OTT, we generally restrict mapping to a specific taxonomic
+context, and if there are multiple matches to OTT taxa with the same
+name, a curator can clearly see this situation and choose the taxon
+with the correct lineage.
 
 ## Allowing community curation
+
 We have also developed a system for curators to directly add new taxon records to the
-taxonomy from published phylogenies. These taxon additions include provenance
-information, including evidence for the taxon and identity of the curator. We
-expose this provenance information through the website and the taxonomy API.
+taxonomy from published phylogenies, which often contain newly
+described species that are not yet present in any source taxonomy.
+These taxon additions include provenance
+information, including evidence for the taxon and the identity of the curator. We
+expose this provenance information through the web site and the taxonomy API.
 Most of the feedback on the synthetic tree of life has been about taxonomy, and expanding this
 feature to other types of taxonomic information allows users to directly
 contribute expertise and allows projects to easily share that information.
-
-## Extinct/extant annotations
-
-One important taxon annotation is 'extinct'.  The OTT backbone
-is essentially the NCBI Taxonomy, which records very few extinct taxa, so when such taxa are found in other taxonomies, there are often no higher
-taxa to put them into [JAR: figure out a better way to explain this]. This had a very negative impact on the subsequent phylogeny synthesis, with most extinct taxa badly placed in the tree - for example, there were many extinct genera and families that appeared as direct children of Mammalia.  
-We therefore removing from synthesis those taxa in the
-taxonomy that are annotated as extinct, leading to a cleaner synthesis.
-
-Incorporating extinct taxa into OTT would be better accomplished by adding a source that explicitly focuses on extinct taxa. 
+[KC: not sure what you mean by "expanding this feature" - what
+feature? how?]
 
 ## Comparison to other taxonomies
+
 Given the very different goals of Open Tree Taxonomy in comparison to most other taxonomy project, it is difficult to compare OTT to other taxonomies in a meaningful way. The Open Tree Taxonomy is most similar to the GBIF taxonomy, in the sense that
 both are a synthesis of existing taxonomies rather than a curated taxonomy. The
 GBIF method is yet unpublished. Once the GBIF method has been formally
diff --git a/doc/method/introduction.md b/doc/method/introduction.md
@@ -24,7 +24,7 @@ taxon as a given name-string occurrence in another.  Solving
 this equivalence problem requires detecting equivalence when the
 name-strings are different (synonym detection), as well as
 distinguishing occurrences that only coincidentally have the same
-name-string (homonym detection). We have developed a set of heuristics
+name-string (homonym sense detection). We have developed a set of heuristics
 that scalably address this equivalence problem.
 
 ## The Open Tree of Life project
diff --git a/doc/method/method-details.md b/doc/method/method-details.md
@@ -12,20 +12,16 @@ are performed:
 
  1. Child taxa of "containers" in the source taxonomy are made to be
     children of the container's parent.  "Containers" are
-    groupings in the tree that don't represent taxa, for example nodes
+    groupings in the source that don't represent taxa, for example nodes
     named "incertae sedis" or "environmental samples".  The members of
     a container aren't more closely related to one another than they
     are to the container's siblings; the container is only present as
     a way to say something about the members.  The fact that a node
     had originally been in a container is recorded as a flag on the
     child node.
- 1. Monotypic homonym removal - when a node with name N has as its
-    only child another node with the same name N, the parent is removed.
-    This is done to avoid ambiguities when aligning taxonomies.
-    [JAR: get examples by rerunning]
  1. Diacritics removal - accents and umlauts are removed in order to improve
     name matching, as well as to follow the nomenclatural codes, which prohibit them.
-    The original name-string is kept, as a synonym.
+    The original name-string is kept as a synonym.
 
 The normalized versions of the taxonomies then become the input to subsequent
 processing phases.
@@ -76,7 +72,8 @@ be run as a script.  Following are some examples of adjustments.
 
 In the process of assembling the reference taxonomy, 284 ad hoc
 adjustments are made to the source taxonomies before they are
-aligned to the workspace. [JAR: check numbers for v3.0]
+aligned to the workspace.
+[JAR: check numbers when v3.0 is final: `python util/count_patches.py adjustments.py` ~= 289]
 
 ### Candidate identification
 
@@ -256,7 +253,14 @@ not needed because it inconsistent or can be 'absorbed', and it is
 dropped.  If it is not dropped, then there is a troublesome situation
 that calls for manual review.
 
-[JAR: Need example here ...]
+For example, for GBIF _Katoella pulchra_, the candidates are NCBI
+_Davallodes pulchra_ and _Davallodes yunnanensis_.  (There is no
+_Katoella pulchra_ in the workspace at the time of the alignment and
+the two candidates come from synonymies with _Katoella pulchra_
+declared by GBIF.)  
+Neither candidate is preferable to the other, so
+_Katoella pulchra_ is left unaligned and
+is omitted from the assembly.
 
 
 ## Merging unaligned source nodes into the workspace
@@ -318,6 +322,13 @@ and the new source, we retain the workspace.
    here is to add e to z and mark it as _incertae sedis_ (indicated above
    by the question mark).
 
+   For example, family Melyridae from GBIF has five genera, of which two
+   (_Trichoceble_, _Danacaea_) are not found in the workspace,
+   and the other three do not all have the same parent after alignment
+   - they are in three different subfamilies.  _Trichoceble_ and _Danacaea_
+   are made to be _incertae sedis_ children of Melyridae, because
+   there is no telling which NCBI subfamily they are supposed to go in.
+
 1. (a,b,c,d,e)z + ((a,b)x,(c,d)y)z = (a,b,c,d,e)z
 
    We don't want to lose the fact from the higher priority taxonomy S
@@ -327,21 +338,6 @@ and the new source, we retain the workspace.
 
    So that we have a term for this situation, say that x is _absorbed_ into z.
 
-1. ((a,b)x,(c,d)y)z + ((a,c,e)u,(b,d)v)z = ((a,b)x,(c,d)y,?e)
-
-   If the S' topology is incompatible with the S topology,
-   we throw away the conflicting internal nodes from S' (u and v).
-   Any leftover taxa (e)
-   are flagged _incertae sedis_ in the attachment point, which in this case is z.
-
-   [JAR: need better example that doesn't involve synonyms] An example of inconsistency: gbif:7919320 = Helotium, contains
-   Helotium lonicerae and Helotium infarciens, but IF knows Helotium infarciens as a synonym for
-   Hymenoscyphus infarciens, which isn't in OTT Helotium
-
-   [JAR: need better example that doesn't involve synonyms]: GBIF (S') Paludomidae has children
-   Tiphobia and Bridouxia, but the two children have different parents
-   in S
-
 ## Finishing the assembly
 
 After all source taxonomies are aligned and merged, we apply general ad hoc
@@ -350,19 +346,24 @@ employed with the source taxonomies.  Patches are represented in a
 variety of formats representing historical accidents of curation (i.e. we
 have updated our patch system over time in response to curator feedback).  Rather
 than convert all patches to
-some form already known to the system, we kept it in the original form,
-which facilitates further editing.
-
-* give the number of patches [JAR: get number from v3.0; at least 123 - not clear how to count].
-
-The final step is to assign unique, stable identifiers to nodes.  As
-before [JAR: define what "as before" refers to], some identifiers are assigned on an ad hoc basis [JAR: define which nodes get ad hoc identifiers].  Then,
-automated identifier assignment is done by aligning the previous
-version of OTT to the new taxonomy. Additional candidates are
-found [JAR: clarify that 'are found by' is automated, not manual] by comparing node identifiers used in source taxonomies to
-source taxonomy node identifiers stored in the previous OTT version.
-After transferring identifiers of aligned nodes, any remaining workspace
-nodes are given newly 'minted' identifiers.
+some form already known to the system, we kept them in their original form.
+This practice facilitates further editing.
+
+* give the number of patches [JAR: get number from v3.0 when final; 
+`python util/count_patches.py amendments.py` ~= 121,
+`tail +2 feed/misc/chromista-spreadsheet.csv | wc` = 239,
+`cat amendments/*.json | grep original_label | wc` ~= 106]
+
+The final step is to assign unique, stable identifiers to nodes.  
+
+Identifier assignment is done by aligning the previous version of OTT
+to the new taxonomy.  As with the other alignments, there are scripted
+_ad hoc_ adjustments to correct for some errors that would otherwise
+be made by automated assignment.  For this alignment, the set of
+heuristics is extended by adding rules to prefer candidates that have
+the same source taxonomy node id as the previous version node being
+aligned.  After transferring identifiers of aligned nodes, any
+remaining workspace nodes are given newly 'minted' identifiers.
 
 The previous OTT version is not merged into the new version; the alignment is
 computed only for the purpose of assigning identifiers.  A node can only persist
diff --git a/doc/method/method-intro.md b/doc/method/method-intro.md
@@ -19,7 +19,7 @@ benefits as well, such as the ability to add additional sources relatively
 easily, and to use the tool for other purposes.
 
 In the following, any definite claims or measurements apply to the
-Open Tree reference taxonomy version 2.11.
+Open Tree reference taxonomy version 3.0.
 
 ## Terminology
 
diff --git a/doc/method/removed-text.md b/doc/method/removed-text.md
@@ -681,3 +681,17 @@ _absorbed into larger taxon due to conflict_: [should be described in methods se
 [Interesting?:  57 taxa that were unplaced in a higher priority source
 get placed by a lower priority source.]
 
+
+
+2017-02-18
+
+Removing this case because I cannot find an example of it!
+
+1. ((a,b)x,(c,d)y)z + ((a,c,e)u,(b,d)v)z = ((a,b)x,(c,d)y,?e)
+
+   If sibling nodes in S' have different parents in S, we throw away their parent node.
+   In the example, a and c, which are siblings under u, have different parents x and y
+   in S, and similarly for b and d.
+   Any unaligned children (e in this case)
+   are flagged _incertae sedis_ in the attachment point, which in this case is z.
+
diff --git a/doc/method/results.md b/doc/method/results.md
@@ -62,7 +62,9 @@ reclassification); and in the remaining four cases, either the taxon
 was added to OTT after the study was curated, or the curation task was
 left incomplete.
 
-[JAR: measure of how many mapped OTUs come from NCBI, i.e. how close NCBI gets us to the mapping requirement]
+[JAR: measure of how many mapped OTUs come from NCBI, i.e. how close NCBI 
+gets us to the mapping requirement: `python doc/method/otus_in_ncbi.py` = 
+'172440 out of 195675 OTUs are in NCBI' (88%)]
 
 ### Taxonomic coverage
 
@@ -72,9 +74,8 @@ combination of the inputs has greater coverage than CoL, and in part
 because OTT has many names that are either not valid or not currently
 accepted.
 
-Since the GBIF source we used includes the 2011 edition of CoL [JAR: 2011],
-OTT includes everything in that edition of CoL.  [JAR: did it get updated
-for 2016 GBIF?]
+Since the GBIF source we used includes the Catalogue of Life [ref],
+OTT includes everything in CoL.
 
 This level of coverage would seem to meet Open Tree's taxonomic
 coverage requirement as well as any other available taxonomic source.
@@ -94,9 +95,9 @@ coverage requirement as well as any other available taxonomic source.
 
 ### Ongoing update
 
-Building OTT version 2.11 from sources requires 11 minutes 42 second of real time. Our process currently runs on a machine with 16GB of memory, and 8GB is not sufficient.
+Building OTT version 3.0 from sources requires 15 minutes of real time. Our process currently runs on a machine with 16GB of memory; 8GB is not sufficient.
 
-In the upgrade from 2.10 to 2.11, we added new versions of both NCBI and GBIF. NCBI updates frequently, so changes tend to be minimal and incorporating the new version was trivial. In contrast, the version from GBIF represented both a major change in their taxonomy synthesis method. Many taxa disappeared, requiring changes to our ad hoc patches during the normalization stage. In addition, the new version of GBIF used a different taxonomy file format, which requires extensive changes to our import code (most notably, handling taxon name-strings that now included authority information).
+In the upgrade from 2.10 to 3.0, we added new versions of both NCBI and GBIF. NCBI updates frequently, so changes tend to be minimal and incorporating the new version was trivial. In contrast, the version from GBIF represented both a major change in their taxonomy synthesis method. Many taxa disappeared, requiring changes to our ad hoc patches during the normalization stage. In addition, the new version of GBIF used a different taxonomy file format, which requires extensive changes to our import code (most notably, handling taxon name-strings that now included authority information).
 
 We estimate the the update from OTT v2.10 to OTT v3.0 required approximately 3 days of development time. This was much greater than previous updates due to the changes required to handle the major changes in GBIF content and format.  
 
diff --git a/doc/method/taxonomy_metrics.py b/doc/method/taxonomy_metrics.py
@@ -9,7 +9,7 @@
 
 def doit(ott, sep, outpath):
 
-    do_rug = os.path.isdir('out/ruggiero')
+    do_rug = False  #os.path.isdir('out/ruggiero')
 
     if do_rug:
         rug = Taxonomy.getRawTaxonomy('out/ruggiero/', 'rug')
diff --git a/org/opentreeoflife/smasher/AlignmentByName.java b/org/opentreeoflife/smasher/AlignmentByName.java
@@ -192,7 +192,8 @@ public Answer findAlignment(Taxon node) {
                                         anyCandidate.witness);
                     // Hack for debugging / example selection
                     Answer winner = winners.get(0);
-                    ((UnionTaxonomy)target).choicesMade.add(new Answer[]{winner, anyLoser});
+                    if (target instanceof UnionTaxonomy)
+                        ((UnionTaxonomy)target).choicesMade.add(new Answer[]{winner, anyLoser});
                 } else {
                     result = new Answer(node, anyCandidate.target, score,
                                         "confirmed",
diff --git a/util/reason_report.py b/util/reason_report.py
@@ -1,6 +1,6 @@
 # Metrics for taxonomy writeup.
 
-import sys, os, json
+import sys, os, json, csv
 
 def format_report(taxo_path, metrics_path):
     print """# Results
@@ -63,12 +63,22 @@ def show_table(table, description):
     print
     # show_table_ascii(table)
     show_table_html(table)
+    show_table_csv(table)
+
+def show_table_csv(table):
+    print '-----'
+    writer = csv.writer(sys.stdout)
+    for (count, label) in table:
+        writer.writerow([count, label])
+    print '-----'
 
 def show_table_ascii(table):
     for (count, label) in table:
         print fmt % (count, label)
     print
 
+# returns (list of (rank, label, count), total)
+
 def prepare_table(summary, label_info, totalize):
     for key in summary:
         if not key in all_keys:
@@ -219,8 +229,8 @@ def cell(val, atts):
     (80, 'Source node absorbed into larger workspace taxon'),
     "reject/inconsistent":
     (82, 'Source node absorbed into larger taxon due to conflict [fix me]'),
-    "reject/wayward":
-    (84, 'Source parent does not descend from nearest aligned [fix me]'),
+    # "reject/wayward":
+    # (84, 'Source parent does not descend from nearest aligned [fix me]'),
 }
 
 all_keys = {}