Skip to content

inconsistent handling of non-ancestral material in non-strict, unpolarised branch stats #2983

@petrelharp

Description

@petrelharp

Consider the following example:

# 0 is a sample, 1 is not
#
# 2.00┊  2  ┊ 2 ┊  2  ┊
#     ┊ ┏┻┓ ┊ ┃ ┊ ┏┻┓ ┊
# 1.00┊ ┃ 1 ┊ 1 ┊ ┃ 1 ┊
#     ┊ ┃   ┊ ┃ ┊ ┃   ┊
# 0.00┊ 0   ┊ 0 ┊ 0   ┊
#     0     1   2     3

tables = tskit.TableCollection(sequence_length=3)

node_times = [0, 1, 2]
samples = [0]
for n, t in enumerate(node_times):
    tables.nodes.add_row(time=t, flags=tskit.NODE_IS_SAMPLE if n in samples else 0)

# p, c, l, r
edges = [
    (1, 0, 1, 2),
    (2, 0, 0, 1),
    (2, 0, 2, 3),
    (2, 1, 0, 3),
]
for p, c, l, r in edges:
    tables.edges.add_row(parent=p, child=c, left=l, right=r)

# this makes it so 'site' mode counts branches
for x in range(int(tables.sequence_length)):
    for n in range(tables.nodes.num_rows - 1):
        offset = n / tables.nodes.num_rows
        s = tables.sites.add_row(position=x+offset, ancestral_state='0')
        tables.mutations.add_row(site=s, node=n, derived_state='1')

ts = tables.tree_sequence()

def f(x):
    return x

for polarised, mode, answer in [
            (True, "branch", 6),
            (True, "site", 4),
            (False, "branch", 8),
            (False, "site", 6),
        ]:
    stat, = ts.sample_count_stat([[0]], f, 1, strict=False, span_normalise=False, polarised=polarised, mode=mode)
    print(polarised, mode, answer, stat)
    # assert stat == answer

which produces

True branch 6 6.0
True site 4 4.0
False branch 8 7.0   # <------ whoops!
False site 6 6.0

This summary function is (for this case, of a single node) "count everything above the sample". So, the polarised branch stat here is 2 + 2 + 2 = 6. Good so far. However, if it is unpolarised, then the summary function produces "count everything either above or not above the sample", i.e., everything. This should thus be 3 + 2 + 3 = 8. However, we produce 7. The reason is that here (or rather, the equivalent place in C), we initialize summaries only for samples, assuming that the summary will be zero for isolated non-samples. This is true for strict summary functions, but is not true for this example. The solution is to initialize for all nodes, not just for samples. This means that when we start, we assume that the summary function for node 1 is zero, when it should be 1. In the next tree, we update the summary function for node 1, and again when transititioning to the next tree. So, the only bit we miss out on is the branch above 1 in the first tree; thus an answer of 7, not 8.

Clearly something is wrong here, because the branch above 1 is treated differently in the left-hand tree and in the right-hand tree. (So, there's not some consistent definition that makes this all make sense that I'm missing here.)

This is unlikely to have affected anyone, ever, since it only affects portions of the tree sequence that is not ancestral to any samples, which only occurs in unsimplified trees (and requires setting their own non-strict summary function).

The fix is, I think, to initialize all nodes, rather than just samples, (and the equivalent place in C).

This does not appear to affect site or node stats.

When we were setting up statistics we debated whether bits of the trees not ancestral to any samples should affect results, and IIRC we had it in at some point so that non-ancestral bits could never affect stats. However, we removed that restriction, which is the right thing to do because (a) it simplifies the implementation and interpretation, (b) you only hit these issues with unpolarised, non-strict stats, and (c) you can easily make a summary function not affected by these regions (see e.g., the summary func for segregating_sites).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions