fix: demand control actual costs should consider each subgraph fetch by carodewig · Pull Request #8827 · apollographql/router

carodewig · 2026-01-22T20:55:27Z

The demand control feature estimates query costs by summing together the cost of each subgraph operation. This allows it to capture any intermediate work that must be completed to return a complete response.

Right now the actual query cost computation only looks at the final response shape; it does not include any of the intermediate work done in its total. I believe this is a bug; it's not meaningful to compare estimated and actual costs unless they're computed the same way.

This PR fixes that behavior to compute the actual query cost as the sum of all subgraph response costs. It:

Computes the cost of each subgraph response at the subgraph_response plugin stage, and sums the results in the execution_response stage
Slightly modifies the ResponseCostCalculator::score_response_field function to support calculating the cost of an _entities query
Adds a new configuration option demand_control.strategy.static_estimated.actual_cost_mode to disable the new cost calculation behavior; the default value by_subgraph is the new behavior, the other value by_response_shape reverts to the old behavior
Adds numerous integration tests to capture estimated costs and both modes of actual costs

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Exceptions

Note any exceptions here

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

apollo-librarian · 2026-01-22T20:55:37Z

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed

* graphos/routing/(latest)/security/demand-control.mdx

Build ID: 36b3c7ef67f7e6fb727e58de
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/36b3c7ef67f7e6fb727e58de

morriswchris

I did an initial pass on the logic and have a few question, mainly for my own comprehension 🙂

morriswchris · 2026-01-22T21:59:07Z

apollo-router/src/plugins/demand_control/mod.rs

+    #[default]
+    BySubgraph,
+
+    #[deprecated(since = "TBD", note = "use `BySubgraph` instead")]


Should since have an actual value? (I'm not sure the pattern we use for deprecated in this repo)

The only other uses of #[deprecated] in this repo either don't specify a since, or use since = TBD, so I think we're fine from a 'router pattern' standpoint!

I'm happy to specify 3.0 if we're ready to commit to removing this option then; I just used TBD since I wasn't sure if we wanted to commit to it.

apollo-router/src/plugins/demand_control/mod.rs

morriswchris · 2026-01-22T22:25:02Z

apollo-router/src/plugins/demand_control/cost_calculator/static_cost.rs

+        // We need to have a field definition for later processing, unless the query is an
+        // `_entities` query. If the field should be there and isn't, return now.
+        let is_entities_query = parent_ty == "Query" && field.name == "_entities";
+        if definition.is_none() && !is_entities_query {


I'm unsure if this changes any behaviour, but I wonder if we could end up in a situation where:

we don't have a definition

is_entities_query is true

and if that should continue with calculating cost? Reading the original code, it would seem like we would not calculate cost of entity queries (though I don't know if it's possible to end up in the state above)

Computing the cost in that scenario is exactly the situation we needed to support :)

Query._entities is a special operation which doesn't have a definition in the schema. It is used to search for objects by 'primary key' and can have any return type, based on the selections within the query. It's not something a user would call directly and is only used in federated queries.

It was not supported in the original code because this function was used at the execution stage, which has the final graphQL response (and users couldn't call _entities). But now, this function can be called at the subgraph stage, which means it needs to handle partial graphQL responses, including _entities queries.

Fortunately, once you go a few levels into an _entities query, you start finding real types which have definitions. So since this function is (highly) recursive, the is_entities_query escape hatch lets us recurse down into the real types and sum up the cost of those.

Let me know if that makes sense or you have other questions - this took me a while to piece together, and I'm not sure I explained it well!

morriswchris · 2026-01-22T22:25:38Z

.changesets/fix_caroline_demand_control_actuals.md

+    static_estimated:
+      max: 10
+      list_size: 10
+      actual_cost_mode: by_subgraph # the default value


❤️ for making this a default config, but allowing users to revert.

should we make legacy default for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)

@aaronArinder The reason I made the new behavior the default is that the current approach pretty severely undercounts actuals; the current actuals aren't really meaningful. My hope is that we can make the better behavior the default, as long as the change is called out in the changelog and we provide the option to revert. But if you feel strongly otherwise, I could be convinced to swap the defaults!

I feel medium strong but not super strong (like a... 6 out of 10?), which isn't coming from the behavior (I think by_subgraph should be the default) but from what our customers might expect--if this makes their graphs or monitors wonky, it'd give them a reason to distrust our releases

I don't really know if that's at stake though because I don't know how folks use this in the wild; maybe a changelog entry is enough for folks? It should be, but I suspect it sometimes gets overlooked

overall, dealer's choice! raising it as food for thought, not as a blocker

aaronArinder

lgtm! only real question is around whether we can clobber costs

aaronArinder · 2026-01-26T20:48:27Z

apollo-router/src/plugins/demand_control/cost_calculator/static_cost.rs

+                        self.cost += score;
                    }
+                } else {
+                    tracing::debug!(


not sure folks will see the debug, but if they're using query cost to determine anything about the safety of their systems, it might be worth elevating to warn!()

not a blocker, I don't really know the context or actual use of this, but something to consider

aaronArinder · 2026-01-26T20:50:45Z

.changesets/fix_caroline_demand_control_actuals.md

+    static_estimated:
+      max: 10
+      list_size: 10
+      actual_cost_mode: by_subgraph # the default value


should we make legacy default for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)

apollo-router/src/plugins/demand_control/cost_calculator/mod.rs

apollo-router/src/plugins/demand_control/mod.rs

aaronArinder · 2026-01-26T21:05:08Z

apollo-router/src/plugins/demand_control/mod.rs

                .insert(subgraph_name.clone(), demand_controlled_subgraph_schema);
        }

+        init.config.strategy.validate()?;


this is the line I squinted at with worry--looks like a new way for insert_cost_result to fail, but really, it's won't fail here because this fn doesn't err

Yeah, the intent is that it will fail if the config is invalid - it'll cause the plugin creation to fail thereby stopping the router / hot reload. This is something that is done in other plugins, but wasn't previously a part of demand control

oh interesting, so the next pr that introduces a failure mode to validate() could blow up the demand control plugin if validation doesn't pass

that's probably fine, I think? I worry that if someone doesn't have hot-reload on (I hate that I'm saying this), they might update themselves into a non-running router and potentially cause a production incident; this is probably more for the next pr, but what do you think about just logging out a warning for a bit until folks have enough time to get their stuff together? (might not matter if validation is easy/straightforward/whatever, but I'm worried that we'll cause a production outage on accident)

The validation is really straight-forward - since the validation is based only on the configuration, it's sufficient to try running the config in a dev/staging environment to see if it'll pass. I don't have a lot of sympathy if you push a new config into prod without trying it in some form of testing environment first?
And to be clear, all the new config options are optional (both in this PR and the next) - so this won't cause errors in existing configs.
IDK - if you still disagree I can change it, but I feel like erroring on startup with a bad config is better than letting bad configs exist quietly - many users might not look at the logs to know to change something, but will react to the router not starting 🤷

I don't really understand what's being validated but it sounds simple enough to maybe not be a worry? I think sometimes folks have different configs for different environments (even worse, sometimes they're templated and maybe not totally maintained) and I'm really wary of making something errorable that wasn't in the past

but, it sounds simple and you understand it best; so, you're in the best position to make the decision. I feel like you understand my view and I'm happy to agree with whatever you think is the right move

mabuyo

Docs look good to me, thank you! Left a non-blocking comment

docs/source/routing/security/demand-control.mdx

justindoherty · 2026-01-29T11:46:22Z

First off, thanks for the work on this! My org has recently started using demand control for cost limiting, and we've also noticed that estimated and actual costs differ substantially.

We have found that a major source of this discrepancy comes from the _entities queries; they seem to assume the query size matches static_estimated.list_size (we currently default to 20) regardless of how many entities are actually requested. This makes it hard to tune, too low and we let expensive queries through, too high and we over-estimate, blocking valid queries. Sometimes the delta is in the hundreds of thousands.

That said, I wanted to offer a perspective from a public-facing supergraph. Since our consumers have no visibility into our subgraph boundaries (and shouldn't know it's a supergraph), we were thinking it would be preferable if the cost logic to behave as if it were a monolith, as it does today with the actual cost, but not the estimated cost. It is easier for consumers to reason about a cost that ignores the hidden overhead of federation.

Because of this, could I suggest avoiding the name legacy for the existing mode? It implies that the method is outdated or deprecated, whereas for public graphs, it is possibly the preferred behavior. Perhaps by_supergraph or monolith would be more descriptive?

One question: Have you encountered the specific issue with the _entities list size estimation yourself? If that weren't causing such a huge delta, I might actually be able to adopt your new by_subgraph method, as I’d prefer to reflect the real cost if I could make the estimation more accurate.

carodewig · 2026-01-29T16:13:35Z

Thank you for your comment, @justindoherty! I really appreciate you sharing your experience with this feature and have a few follow-up questions for you.

We have found that a major source of [the difference between estimated and actual costs] comes from the _entities queries; they seem to assume the query size matches static_estimated.list_size regardless of how many entities are actually requested. Have you encountered the specific issue with the _entities list size estimation yourself?

I only learned about this yesterday; it is indeed using the configured list size, regardless of how many entities are requested. That's definitely a bug to fix, although it's out of scope of this PR since it requires changes to the query planner output.

That said, I wanted to offer a perspective from a public-facing supergraph. Since our consumers have no visibility into our subgraph boundaries (and shouldn't know it's a supergraph), we were thinking it would be preferable if the cost logic to behave as if it were a monolith, as it does today with the actual cost, but not the estimated cost. It is easier for consumers to reason about a cost that ignores the hidden overhead of federation.

I'd love to hear more from you on this, because it's almost the opposite of how I was thinking about the demand control feature!

I see demand control as a feature for platform operators. They want to protect their infrastructure (the router, subgraphs) from queries which could overwhelm it. If a client writes a query that looks simple, but actually requires significant fan-out, fetches, etc. in a federated system, I think it's reasonable for an operator to reject that query due to the complexity which is not visible to / expected by the client.

The router does have an operation limit feature which rejects operations based on operation depth, height, etc. Does that more closely approximate the 'monolith-like' behavior you mentioned?

If [the _entities estimation issue] wasn't causing such a huge delta, I might actually be able to adopt your new by_subgraph method, as I’d prefer to reflect the real cost if I could make the estimation more accurate.

I'm curious why you'd prefer to stick with the existing actuals computation. From my perspective, the existing approach both:

Does not reflect how the estimates are computed (by subgraph), so comparing estimates to actuals isn't meaningful
Does not reflect how much work was done to obtain the result

I recognize that the _entities issue does cause a massive delta - but I'm not sure how that would apply to the actuals here.

Because of this, could I suggest avoiding the name legacy for the existing mode? It implies that the method is outdated or deprecated, whereas for public graphs, it is possibly the preferred behavior. Perhaps by_supergraph or monolith would be more descriptive?

I see what you mean and will make this change - I think I'll go with response_shape to try to suggest that the cost is determined only by the shape of the completed response. (I'm concerned that monolith implies a style we might not be upholding)

doc: add more of an explanation to why you'd want each mode

justindoherty · 2026-01-29T18:50:14Z

Thanks for the reply @carodewig :)

I'd love to hear more from you on this, because it's almost the opposite of how I was thinking about the demand control feature!

I see demand control as a feature for platform operators. They want to protect their infrastructure (the router, subgraphs) from queries which could overwhelm it. If a client writes a query that looks simple, but actually requires significant fan-out, fetches, etc. in a federated system, I think it's reasonable for an operator to reject that query due to the complexity which is not visible to / expected by the client.

I totally get this point of view, and I may be better off adopting it as my own too. I'm however just a little worried that a user will encounter the error and possibly not really know how to fix their query to be compliant short of trial and error. In the response_shape mode, we can at least attempt to explain how cost is calculated to the consumer and perhaps provide the field costs in the descriptions so they are aware of which fields are more expensive and can perform the calculation themselves if they wished. Really though, our journey is just starting and I'm just making guesses on how I think consumers will react.

The router does have an operation limit feature which rejects operations based on operation depth, height, etc. Does that more closely approximate the 'monolith-like' behavior you mentioned?

These were where we started and they are definitely useful. Our need for more robust cost control primarily comes from a few fields being proportionally very expensive to calculate. This is fine in low quantities but if a user were to request this field in a large list such as a connection from a search query, it would potentially overwhelm our subgraph.

I'm curious why you'd prefer to stick with the existing actuals computation. From my perspective, the existing approach both:

Does not reflect how the estimates are computed (by subgraph), so comparing estimates to actuals isn't meaningful

Does not reflect how much work was done to obtain the result

I recognize that the _entities issue does cause a massive delta - but I'm not sure how that would apply to the actuals here.

I agree on this, primarily I'd stick with the current response_shape actuals if the estimate were calculated the say way.
I also agree on this point as well but I guess up to this point I haven't been terribly concerned about the overhead of the fanout itself even though it does incur network costs and requiring the overall system to look up the same entity multiple times and the simplicity in explaining the consumer is worth more to me.

I'm mainly of the mindset that I want predictable cost calculations that a consumer can understand and work with. Having the cost change based on how a query plan pans out from similar but different queries adds another layer of complexity to the explanation.

I see what you mean and will make this change - I think I'll go with response_shape to try to suggest that the cost is determined only by the shape of the completed response. (I'm concerned that monolith implies a style we might not be upholding)

Thanks, much appreciated! One thought on this, but definitely not a deal breaker because all I wanted was to avoid the legacy term. I was thinking about what if the estimated cost mode were made configurable as you've done here what would we call the configuration options? Would we call it by the same names, by_subgraph and response_shape? response_shape kinda feels weird for the estimation mode, but having a mismatch between the estimation and actuals mode names also feels weird. Last thought, I imagine if response_shape were brought into the estimation mode, would it be a separate config or would it be a consolidated property like cost_mode with options for by_subgraph and response_shape that would apply to both estimation and actuals? Again, I would definitely be able to live with the names chosen here.

carodewig · 2026-02-02T16:57:56Z

Thanks so much for sharing your perspective, @justindoherty!

I can definitely see where query rejection by cost could be confusing for clients, and how the response_shape mode would be easier for them to understand. If we were to add an option to use response_shape as an estimation strategy, it might work as a variable, or it might be better to make it an entirely separate strategy (ie static_estimated_$alternative.max would exist in the configuration).

I'm going to think a bit more about the name - I agree that it doesn't fit well on the estimation side, but I'm struggling to come up with good alternatives 😅

carodewig · 2026-02-02T19:31:29Z

NB: ended up going with by_response_shape as that will still apply on the estimation side, if we were to add it as a feature - it's just a predicted response shape based on the operation, rather than an actual.

…8827)

carodewig added 9 commits January 21, 2026 14:57

feat: add mode to support toggling between cost calculation modes

24b228e

feat: compute cost on each subgraph response when enabled

4808218

fix: support cost calculation for _entities query

007da11

test: update snapshots for new actuals

bea0e31

test: add call count to context to assert in tests

23d4e3c

test: add integration tests for demand control

65ca9c2

test: clean up tests to simplify via #[values]

1f77c2c

test: additional cases based on federated_ships_fragment

9e23447

test: fix snapshot with new cost

6641c6b

This comment has been minimized.

Sign in to view

carodewig added 3 commits January 22, 2026 16:01

doc: add new config option to the docs

337f6a5

doc: create changeset

7e84668

Merge branch 'dev' into caroline/demand-control-actuals

b73e877

carodewig marked this pull request as ready for review January 22, 2026 21:07

carodewig requested a review from a team January 22, 2026 21:07

carodewig requested a review from a team as a code owner January 22, 2026 21:07

morriswchris reviewed Jan 22, 2026

View reviewed changes

chore: refactor StrategyConfig::validate

643a14e

carodewig mentioned this pull request Jan 23, 2026

feat: support subgraph-level demand control #8829

Merged

10 tasks

test: test_cache_metrics is super flaky

468b80b

aaronArinder approved these changes Jan 26, 2026

View reviewed changes

mabuyo approved these changes Jan 27, 2026

View reviewed changes

docs/source/routing/security/demand-control.mdx Outdated Show resolved Hide resolved

test: remove dbg! that can cause test to timeout

0b25099

carodewig added 3 commits January 29, 2026 11:41

chore: refactor s/legacy/response_shape

32285c3

doc: add more of an explanation to why you'd want each mode

Merge branch 'dev' into caroline/demand-control-actuals

314bbe1

doc: use html bc <> breaks things

ee06bf2

carodewig added 2 commits February 2, 2026 14:20

refactor: s/response_shape/by_response_shape

944a66a

Merge branch 'dev' into caroline/demand-control-actuals

7aee659

carodewig added 4 commits February 2, 2026 14:38

doc: appease angry style bot

d2a5a14

doc: additional rework

8596ca7

test: mv tests to account for new names

2efbb5a

Merge branch 'dev' into caroline/demand-control-actuals

e57e609

carodewig merged commit 4bdd2ac into dev Feb 3, 2026
15 checks passed

carodewig deleted the caroline/demand-control-actuals branch February 3, 2026 23:13

the-gigi-apollo pushed a commit that referenced this pull request Feb 4, 2026

fix: demand control actual costs should consider each subgraph fetch (#…

ef38f88

…8827)

abernix mentioned this pull request Feb 24, 2026

prep release: v2.12.0 #8912

Merged

Conversation

carodewig commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

apollo-librarian bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Docs preview ready

Uh oh!

This comment has been minimized.

morriswchris left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronArinder left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carodewig Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mabuyo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justindoherty commented Jan 29, 2026

Uh oh!

carodewig commented Jan 29, 2026

Uh oh!

justindoherty commented Jan 29, 2026

Uh oh!

carodewig commented Feb 2, 2026

Uh oh!

carodewig commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

carodewig commented Jan 22, 2026 •

edited

Loading

apollo-librarian bot commented Jan 22, 2026 •

edited

Loading

carodewig Jan 26, 2026 •

edited

Loading