Re-use work of getting state for a given `state_group` (`_get_state_groups_from_groups`) by MadLittleMods · Pull Request #15617 · matrix-org/synapse

MadLittleMods · 2023-05-17T23:58:31Z

Re-use work of getting state for a given state_group. This is useful when we _compute_event_context_with_maybe_missing_prevs(...) -> _get_state_groups_from_groups(...)

Part of making /messages faster: #13356

Background (explaining the situation before this PR)

When we have to _compute_event_context_with_maybe_missing_prevs, _get_state_groups_from_groups currently takes up most of the time. It makes the following recursive query for each state group given which goes up the entire state_group chain and returns the complete distinct state which means thousands of membership events at the very least. For example, with a random /messages request in the Matrix HQ room we did 10x of these queries which each took 0.5-2 seconds and return roughly 88k events every time totaling 14s.

WITH RECURSIVE sgs(state_group) AS (
  VALUES 
    (? :: bigint) 
  UNION ALL 
  SELECT 
    prev_state_group 
  FROM 
    state_group_edges e, sgs s 
  WHERE 
    s.state_group = e.state_group
) 
SELECT 
  DISTINCT ON (type, state_key) type, state_key, event_id 
FROM 
  state_groups_state 
WHERE 
  state_group IN (
    SELECT state_group FROM sgs
  ) 
ORDER BY type, state_key, state_group DESC

https://explain.depesz.com/s/OuOe
https://explain.dalibo.com/plan/ef6bc290f2dced17

EXPLAIN ANAYLZE query output

EXPLAIN ANALYZE WITH RECURSIVE sgs(state_group) AS (
  VALUES 
    (739988088 :: bigint) 
  UNION ALL 
  SELECT 
    prev_state_group 
  FROM 
    state_group_edges e, sgs s 
  WHERE 
    s.state_group = e.state_group
) 
SELECT 
  DISTINCT ON (type, state_key) type, state_key, event_id 
FROM 
  state_groups_state 
WHERE 
  state_group IN (
    SELECT state_group FROM sgs
  ) 
ORDER BY type, state_key, state_group DESC;
                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=7161.42..7199.29 rows=5050 width=91) (actual time=1084.809..1099.816 rows=88180 loops=1)
   CTE sgs
     ->  Recursive Union  (cost=0.00..284.28 rows=101 width=8) (actual time=0.004..21.247 rows=59 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
           ->  Nested Loop  (cost=0.57..28.23 rows=10 width=8) (actual time=0.358..0.359 rows=1 loops=59)
                 ->  WorkTable Scan on sgs s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=59)
                 ->  Index Only Scan using state_group_edges_unique_idx on state_group_edges e  (cost=0.57..2.79 rows=1 width=16) (actual time=0.357..0.357 rows=1 loops=59)
                       Index Cond: (state_group = s.state_group)
                       Heap Fetches: 0
   ->  Sort  (cost=6877.14..6889.76 rows=5050 width=91) (actual time=1084.806..1089.100 rows=88221 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 14782kB
         ->  Nested Loop  (cost=3.11..6566.51 rows=5050 width=91) (actual time=23.522..890.988 rows=88221 loops=1)
               ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=21.348..21.383 rows=59 loops=1)
                     Group Key: sgs.state_group
                     Batches: 1  Memory Usage: 24kB
                     ->  CTE Scan on sgs  (cost=0.00..2.02 rows=101 width=8) (actual time=0.007..21.282 rows=59 loops=1)
               ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.84..64.48 rows=50 width=91) (actual time=0.959..14.462 rows=1495 loops=59)
                     Index Cond: (state_group = sgs.state_group)
 Planning Time: 4.485 ms
 Execution Time: 1105.539 ms
(21 rows)

Time: 1112.886 ms (00:01.113)

Full Jaeger trace JSON

Before

Say we have state_groups 1-10 where state_group 1 would be the m.room.create, and then onwards, etc.

Currently, when you ask for the state at state_groups = [5, 8], we will fetch state from 5 -> 1 and 8 -> 1 which is pretty wasteful because we're mostly fetching the same thing again and time consuming (imagine this with Matrix HQ which has chains of 88k events).

flowchart RL
  10 --> 9 --> 8 --> 7 --> 6 --> 5 --> 4 --> 3 --> 2 --> 1

After

Now instead when you ask for the state at state_groups = [5, 8], we fetch 5 -> 1 as normal, but then only fetch 8 -> 5 and layer it on top of the work we did for the state at state_group = 5.

In terms of query performance, our first query will take the same amount of time as before but any subsequent state_groups that are shared will take significantly less time.

In the sample request above, simulating how the app would make the queries with the changes in this PR, we don't actually see that much benefit because it turns out that there isn't much state_group sharing amongst events. That 14s SQL query is just fetching the state at each of the prev_events of one event and only one seems to share state_groups.

state_group chains not actually shared that much in that 14s database query 😬

flowchart BT
    subgraph $LrOm7aODe5StbJwD9tFAfrTZ1ClzWJLkIUsMH1BMc-Y
        direction BT
        736242551 --> 736244167 --> 736258375 --> 736271166 --> 736272133 --> 736273053 -->
        736279599 --> 736283383 --> 736285535 --> 736306125 --> 736307774 --> 736315167 -->
        736320311 --> 736321249 --> 736329091 --> 736332354 --> 736349788 --> 736355933 -->
        736356577 --> 736359340 --> 736361356 --> 736364055 --> 736364651 --> 736364778 -->
        736366361 --> 736374000 --> 736374226 --> 736376011 --> 736380262 --> 736381521 -->
        736390487 --> 736392574 --> 736394358 --> 736394537 --> 736402720 --> 736411655 -->
        736412991 --> 736414800 --> 736417133 --> 736417721 --> 736424434 --> 736425468 -->
        736426289 --> 736430049 --> 736430450 --> 736433258 --> 736437043 --> 736438637 -->
        736439262 --> 736439384
    end

    subgraph $06merKAyQAOg4R5KSvvwq-aY8YxmLxBYKJuYh8S2oFg
        direction BT
        738551430 --> 738551567 --> 738551631 --> 738551632 --> 738554279 --> 738555253 -->
        738556945 --> 738563579 --> 738564700 --> 738567178 --> 738567279 --> 738567972 -->
        738568036 --> 738568218 --> 738568518 --> 738569471 --> 738570430 --> 738571084 -->
        738574267 --> 738574570 --> 738574739 --> 738576792 --> 738577181 --> 738579766 -->
        738579767
    end

    subgraph $OXO_pg-0sMtWmY23TeA3XWacwe-rtrtRPsCuuC5nI9E
        direction BT
        739921430 --> 739923253 --> 739926162
    end

    subgraph $d_C3gK6O9cHcEkL4wb1ixzgc5I2Z5YYCaaP4PkS2w8Q
        direction BT
        747450250 --> 747454137 --> 747454497 --> 747457549 --> 747458705 --> 747459083 -->
        747460418 --> 747460615 --> 747460879 --> 747460907 --> 747462327 --> 747463184 -->
        747463210 --> 747463728 --> 747471813 --> 747472497 --> 747472883 --> 747473622 -->
        747474075 --> 747475361 --> 747476403 --> 747477315 --> 747478191 --> 747478527 -->
        747479040 --> 747479405 --> 747487181 --> 747487318 --> 747490575 --> 747490649 -->
        747491096 --> 747493955 --> 747494115 --> 747495543 --> 747496126 --> 747496673 -->
        747500245 --> 747501014 --> 747502386 --> 747503796 --> 747504895 --> 747505981 -->
        747507859 --> 747508081 --> 747508938 --> 747511840 --> 747511898 --> 747511899 -->
        747512157 --> 747512397 --> 747514214 --> 747514428 --> 747515134 --> 747515135 -->
        747515473 --> 747515813 --> 747515814 --> 747517299 --> 747517409 --> 747517726 -->
        747517888 --> 747518621 --> 747522938 --> 747522939 --> 747524253 --> 747524349 -->
        747531289 --> 747532433 --> 747537361 --> 747538391 --> 747538672 --> 747538876 -->
        747539259 --> 747539318 --> 747539721 --> 747540484 --> 747540957 --> 747540985 -->
        747541741 --> 747541853 --> 747543282 --> 747543429 --> 747544548 --> 747545428 -->
        747545527 --> 747546551 --> 747546629 --> 747546765 --> 747546878 --> 747546913 -->
        747547518 --> 747548528 --> 747548884 --> 747549000
    end

    subgraph  $R4RlkxDq6my3GMq4P4kEoSgpZPcFxRm_YZGOZcf7KHk
        direction BT
        747894355 --> 747894387 --> 747894392 --> 747894435 --> 747894456 --> 747894481 -->
        747894520 --> 747894566 --> 747894582 --> 747894595 --> 747894607 --> 747894613 -->
        747894654 --> 747894696 --> 747894728 --> 747894756 --> 747894790
    end

    subgraph $HeKxcdhKnOl-ocKIXEHOzHsOvkL7B-cjTFcw9RHfeeY
        direction BT
        748155154 --> 748155201 --> 748155202 --> 748155210 --> 748156064 --> 748156371 -->
        748157020 --> 748157216 --> 748159224 --> 748161119 --> 748161155 --> 748161391 -->
        748161847 --> 748167002 --> 748167718 --> 748172461 --> 748172864 --> 748172962 -->
        748173425 --> 748174163 --> 748174259 --> 748175893 --> 748176100 --> 748177162 -->
        748178903
    end

    subgraph $1xVp4R7Wh8jldeVFvdNoNi0kV9Mp_zPDrnQ9rESs9Vw
        direction BT
        749290472 --> 749291708 --> 749291709 --> 749292485 --> 749292486 --> 749293705 -->
        749293836 --> 749293941 --> 749294167 --> 749294279 --> 749294292 --> 749294295 -->
        749294297 --> 749294324 --> 749294350
    end

    subgraph $hYIHVcmQoxy_VUIk8GpKWynuSQr_n1A-NHGdwh5Qn3U
        direction BT
        %%749290472 --> 749291708 --> 749291709 --> 749292485 --> 749292486 --> 749293705 -->
        %%749293836 --> 749293941 --> 749294167 --> 749294279 --> 749294292 --> 749294295 -->
        749294297 --> 749294496 --> 749294565 --> 749294632 --> 749294668 --> 749294692 -->
        749294733 --> 749294882
    end

    subgraph $CskkFIQnNzFMhfgaqOijXmSDNcJ1VyEzDCyYlQwN934
        direction BT
        745396521 --> 745396885 --> 745397790 --> 745398057 --> 745400234 --> 745401111 -->
        745401737 --> 745401872 --> 745404304 --> 745404475 --> 745406724 --> 745409854 -->
        745410377 --> 745410535 --> 745413344 --> 745417326 --> 745420130 --> 745420131 -->
        745421885 --> 745422806 --> 745423670 --> 745435724 --> 745438992 --> 745439103 -->
        745441593 --> 745441594 --> 745442004 --> 745442437 --> 745443725 --> 745444001 -->
        745447038 --> 745447167 --> 745447479 --> 745450805 --> 745450841 --> 745454335 -->
        776857245
    end

Initially, I thought the lack of sharing was quite strange but this is because of the state_group snapshotting feature where if an event requires more than 100 hops on state_event_edges, then a Synaspe will create a new state_group with a snapshot of all of that state. It seems like this isn't done very efficiently though. Relevant docs.

And it turns out the event is an org.matrix.dummy_event which Synapse automatically puts in the DAG to resolve outstanding forward extremities and these events aren't even shown to clients so we don't even need to waste time waiting for them to backfill. Tracked by #15632

Generally, I think this PR could bring great gains in conjunction to running some sort of state compressor over the database to get a lot more sharing. In addition to trying to fix the online state_group snapshotting logic to be smarter. I don't know how the existing state_compressors work but I imagine we could create snapshots and bucket for years -> months -> weeks -> days -> hours -> individual events and create new state_group chains which utilize these from biggest to smallest to get maximal sharing.

Dev notes

https://docs.gitlab.com/ee/development/database/understanding_explain_plans.html

WITH RECURSIVE sgs(state_group, state_group_reached) AS (
  VALUES 
    (739988088::bigint, NULL::bigint)
  UNION ALL 
  SELECT 
    prev_state_group,
    CASE
      /* Specify state_groups we have already done the work for */
      WHEN @prev_state_group = 739965897::bigint OR @prev_state_group = 739956035::bigint THEN prev_state_group
      ELSE NULL
    END AS state_group_reached
  FROM 
    state_group_edges e, sgs s 
  WHERE 
    s.state_group = e.state_group
    /* Stop when we connect up to another state_group that we already did the work for */
    AND s.state_group_reached IS NULL
)
(
    SELECT NULL, NULL, NULL, state_group_reached
    FROM sgs
    ORDER BY state_group ASC
    LIMIT 1
) UNION ALL (
    SELECT 
      DISTINCT ON (type, state_key) type, state_key, event_id, state_group
    FROM 
      state_groups_state
    /*INNER JOIN sgs USING (state_group)*/
    WHERE 
      state_group IN (
        SELECT state_group FROM sgs
      ) 
    ORDER BY type, state_key, state_group DESC
);

_compute_event_context_with_maybe_missing_prevs
compute_event_context
calculate_context_info
resolve_state_groups_for_events(prev_events)
get_state_for_groups
_get_state_for_groups
_get_state_groups_from_groups
_get_state_groups_from_groups_txn

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO python -m twisted.trial tests.test_visibility.FilterEventsForServerTestCase.test_filtering

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
~~Pull request includes a sign off~~
Code style is correct
(run the linters)

synapse/storage/databases/state/bg_updates.py

MadLittleMods · 2023-05-18T07:48:22Z

synapse/storage/databases/state/bg_updates.py

+
+                state_groups_we_have_already_fetched.add(group)
+
        else:


To not complicate the diff, I've held off on applying the same treatment to SQLite.

We can iterate on this in another PR or just opt for people to use Postgres in order to see performance

…rom-previous-group

MadLittleMods · 2023-05-18T21:49:39Z

synapse/storage/databases/state/bg_updates.py

+                    results[group] = partial_state_map_for_state_group
+
+                state_groups_we_have_already_fetched.add(group)
+


After

[...]

In the sample request above, simulating how the app would make the queries with the changes in this PR, we don't actually see that much benefit because it turns out that there isn't much state_group sharing amongst events. That 14s SQL query is just fetching the state at each of the prev_events of one event and only one seems to share state_groups.

[...]

Initially, I thought the lack of sharing was quite strange but this is because of the state_group snapshotting feature where if an event requires more than 100 hops on state_event_edges, then a Synaspe will create a new state_group with a snapshot of all of that state. It seems like this isn't done very efficiently though. Relevant docs.

And it turns out the event is an org.matrix.dummy_event which Synapse automatically puts in the DAG to resolve outstanding forward extremities and these events aren't even shown to clients so we don't even need to waste time waiting for them to backfill. Tracked by #15632

Generally, I think this PR could bring great gains in conjunction to running some sort of state compressor over the database to get a lot more sharing. In addition to trying to fix the online state_group snapshotting logic to be smarter. I don't know how the existing state_compressors work but I imagine we could create snapshots and bucket for years -> months -> weeks -> days -> hours -> individual events and create new state_group chains which utilize these from biggest to smallest to get maximal sharing.

-- "After" section of the PR description

This PR hasn't made as big of an impact as I thought it would for that type of request. Are we still interested in a change like this? It may work well for sequential events that we backfill.

It seems like our state_group sharing is realllly sub-par and the way that state_groups can only have a max of 100 hops puts an upper limit on how much gain this PR can give. I didn't anticipate that's how state_groups worked and thought it was one state_group per-state-change which it is until it starts doing snapshots.

Maybe it's more interesting to improve our state_group logic to be much smarter first and we could re-visit something like this. Or look into the state compressor stuff to optimize our backlog which would help for the Matrix Public Archive. I'm not sure if the current state compressors optimize for disk space or sharing or how inter-related those two goals are.

clokep · 2023-09-25T17:14:15Z

It sounds like this didn't make much of an improvement and significantly complicates the code. I'm going to close this due to that, but I did grab the nice improvements to the comments and move them to #16383.

Start of idea to re-use work of getting state for a given state_group

4676e53

MadLittleMods added A-Federation T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) labels May 17, 2023

MadLittleMods added 4 commits May 17, 2023 20:03

Re-use work from previous state_groups

6a19afc

Add changelog

02a9959

Fix lints

5704e3b

Fix empty case

3d80449

MadLittleMods commented May 18, 2023

View reviewed changes

synapse/storage/databases/state/bg_updates.py Show resolved Hide resolved

MadLittleMods added 5 commits May 18, 2023 01:59

Fix when the state_filter prevented us from returning any rows before

ab576b6

Remove debug logs

1f60fcb

Replace sketchy string manip with placeholders

333fc51

More comments

17aeee7

Fix lint

7abb745

MadLittleMods commented May 18, 2023

View reviewed changes

MadLittleMods added 2 commits May 18, 2023 02:49

Fix typo

79e6d61

Merge branch 'develop' into madlittlemods/re-use-work-to-grab-state-f…

4b2a5fe

…rom-previous-group

MadLittleMods mentioned this pull request May 18, 2023

Don't waste time processing backfill events that we never show to the client #15632

Open

MadLittleMods commented May 18, 2023

View reviewed changes

MadLittleMods added the A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db label May 18, 2023

MadLittleMods changed the title ~~Re-use work of getting state for a given state_group~~ Re-use work of getting state for a given state_group (_get_state_groups_from_groups) May 19, 2023

MadLittleMods mentioned this pull request May 23, 2023

Optimize /backfill to respond faster #15653

Open

MadLittleMods mentioned this pull request Jul 6, 2023

Investigate why /messages is slow #13356

Open

clokep mentioned this pull request Sep 25, 2023

Improve comments in StateGroupBackgroundUpdateStore. #16383

Merged

clokep closed this Sep 25, 2023

This was referenced Dec 21, 2023

Don't waste time processing backfill events that we never show to the client element-hq/synapse#15632

Open

Optimize /backfill to respond faster element-hq/synapse#15653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-use work of getting state for a given `state_group` (`_get_state_groups_from_groups`)#15617

Re-use work of getting state for a given `state_group` (`_get_state_groups_from_groups`)#15617
MadLittleMods wants to merge 12 commits intodevelopfrom
madlittlemods/re-use-work-to-grab-state-from-previous-group

MadLittleMods commented May 17, 2023 •

edited

Loading

Uh oh!

Uh oh!

MadLittleMods May 18, 2023

Uh oh!

MadLittleMods May 18, 2023 •

edited

Loading

Uh oh!

clokep commented Sep 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		results[group] = partial_state_map_for_state_group

		state_groups_we_have_already_fetched.add(group)

Uh oh!

Conversation

MadLittleMods commented May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background (explaining the situation before this PR)

Before

After

Dev notes

Pull Request Checklist

Uh oh!

Uh oh!

MadLittleMods May 18, 2023

Choose a reason for hiding this comment

Uh oh!

MadLittleMods May 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

After

Uh oh!

clokep commented Sep 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MadLittleMods commented May 17, 2023 •

edited

Loading

MadLittleMods May 18, 2023 •

edited

Loading