inconsistent handling of non-ancestral material in non-strict, unpolarised branch stats

Consider the following example:
```
# 0 is a sample, 1 is not
#
# 2.00┊  2  ┊ 2 ┊  2  ┊
#     ┊ ┏┻┓ ┊ ┃ ┊ ┏┻┓ ┊
# 1.00┊ ┃ 1 ┊ 1 ┊ ┃ 1 ┊
#     ┊ ┃   ┊ ┃ ┊ ┃   ┊
# 0.00┊ 0   ┊ 0 ┊ 0   ┊
#     0     1   2     3

tables = tskit.TableCollection(sequence_length=3)

node_times = [0, 1, 2]
samples = [0]
for n, t in enumerate(node_times):
    tables.nodes.add_row(time=t, flags=tskit.NODE_IS_SAMPLE if n in samples else 0)

# p, c, l, r
edges = [
    (1, 0, 1, 2),
    (2, 0, 0, 1),
    (2, 0, 2, 3),
    (2, 1, 0, 3),
]
for p, c, l, r in edges:
    tables.edges.add_row(parent=p, child=c, left=l, right=r)

# this makes it so 'site' mode counts branches
for x in range(int(tables.sequence_length)):
    for n in range(tables.nodes.num_rows - 1):
        offset = n / tables.nodes.num_rows
        s = tables.sites.add_row(position=x+offset, ancestral_state='0')
        tables.mutations.add_row(site=s, node=n, derived_state='1')

ts = tables.tree_sequence()

def f(x):
    return x

for polarised, mode, answer in [
            (True, "branch", 6),
            (True, "site", 4),
            (False, "branch", 8),
            (False, "site", 6),
        ]:
    stat, = ts.sample_count_stat([[0]], f, 1, strict=False, span_normalise=False, polarised=polarised, mode=mode)
    print(polarised, mode, answer, stat)
    # assert stat == answer
```
which produces
```
True branch 6 6.0
True site 4 4.0
False branch 8 7.0   # <------ whoops!
False site 6 6.0
```
This summary function is (for this case, of a single node) "count everything above the sample". So, the polarised branch stat here is `2 + 2 + 2 = 6`. Good so far. However, if it is unpolarised, then the summary function produces "count everything either above or not above the sample", i.e., everything. This *should* thus be `3 + 2 + 3 = 8`. However, we produce `7`. The reason is that [here](https://github.com/tskit-dev/tskit/blob/46578bdc28211230ad7815b116564e57c2ba88e7/python/tests/test_tree_stats.py#L214) (or rather, the equivalent place in C), we initialize summaries *only for samples*, assuming that the summary will be zero for isolated non-samples. This is true for `strict` summary functions, but is not true for this example. The solution is to initialize for all nodes, not just for samples. This means that when we start, we assume that the summary function for node `1` is zero, when it should be 1. In the next tree, we update the summary function for node `1`, and again when transititioning to the next tree. So, the only bit we miss out on is the branch above `1` in the first tree; thus an answer of `7`, not `8`.

Clearly something is wrong here, because the branch above `1` is treated differently in the left-hand tree and in the right-hand tree. (So, there's not some consistent definition that makes this all make sense that I'm missing here.)

This is unlikely to have affected anyone, ever, since it only affects portions of the tree sequence that is not ancestral to any samples, which only occurs in unsimplified trees (and requires setting their own non-strict summary function).

The fix is, I think, to initialize all nodes, rather than just samples, (and the [equivalent place in C](https://github.com/tskit-dev/tskit/blob/46578bdc28211230ad7815b116564e57c2ba88e7/c/tskit/trees.c#L1314)).

This does not appear to affect site or node stats.

When we were setting up statistics we debated whether bits of the trees not ancestral to any samples should affect results, and IIRC we had it in at some point so that non-ancestral bits could never affect stats. However, we removed that restriction, which is the right thing to do because (a) it simplifies the implementation and interpretation, (b) you only hit these issues with unpolarised, non-strict stats, and (c) you can easily make a summary function not affected by these regions (see e.g., the [summary func for `segregating_sites`](https://tskit.dev/tskit/docs/latest/stats.html#summary-functions)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent handling of non-ancestral material in non-strict, unpolarised branch stats #2983

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inconsistent handling of non-ancestral material in non-strict, unpolarised branch stats #2983

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions