Skip to content

fix: demand control actual costs should consider each subgraph fetch#8827

Merged
carodewig merged 24 commits intodevfrom
caroline/demand-control-actuals
Feb 3, 2026
Merged

fix: demand control actual costs should consider each subgraph fetch#8827
carodewig merged 24 commits intodevfrom
caroline/demand-control-actuals

Conversation

@carodewig
Copy link
Contributor

@carodewig carodewig commented Jan 22, 2026

The demand control feature estimates query costs by summing together the cost of each subgraph operation. This allows it to capture any intermediate work that must be completed to return a complete response.

Right now the actual query cost computation only looks at the final response shape; it does not include any of the intermediate work done in its total. I believe this is a bug; it's not meaningful to compare estimated and actual costs unless they're computed the same way.

This PR fixes that behavior to compute the actual query cost as the sum of all subgraph response costs. It:

  • Computes the cost of each subgraph response at the subgraph_response plugin stage, and sums the results in the execution_response stage
  • Slightly modifies the ResponseCostCalculator::score_response_field function to support calculating the cost of an _entities query
  • Adds a new configuration option demand_control.strategy.static_estimated.actual_cost_mode to disable the new cost calculation behavior; the default value by_subgraph is the new behavior, the other value by_response_shape reverts to the old behavior
  • Adds numerous integration tests to capture estimated costs and both modes of actual costs

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@apollo-librarian
Copy link

apollo-librarian bot commented Jan 22, 2026

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed
* graphos/routing/(latest)/security/demand-control.mdx

Build ID: 36b3c7ef67f7e6fb727e58de
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/36b3c7ef67f7e6fb727e58de

@github-actions

This comment has been minimized.

@carodewig carodewig marked this pull request as ready for review January 22, 2026 21:07
@carodewig carodewig requested a review from a team January 22, 2026 21:07
@carodewig carodewig requested a review from a team as a code owner January 22, 2026 21:07
Copy link
Contributor

@morriswchris morriswchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial pass on the logic and have a few question, mainly for my own comprehension 🙂

#[default]
BySubgraph,

#[deprecated(since = "TBD", note = "use `BySubgraph` instead")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should since have an actual value? (I'm not sure the pattern we use for deprecated in this repo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other uses of #[deprecated] in this repo either don't specify a since, or use since = TBD, so I think we're fine from a 'router pattern' standpoint!

I'm happy to specify 3.0 if we're ready to commit to removing this option then; I just used TBD since I wasn't sure if we wanted to commit to it.

// We need to have a field definition for later processing, unless the query is an
// `_entities` query. If the field should be there and isn't, return now.
let is_entities_query = parent_ty == "Query" && field.name == "_entities";
if definition.is_none() && !is_entities_query {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if this changes any behaviour, but I wonder if we could end up in a situation where:

  • we don't have a definition
  • is_entities_query is true

and if that should continue with calculating cost? Reading the original code, it would seem like we would not calculate cost of entity queries (though I don't know if it's possible to end up in the state above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing the cost in that scenario is exactly the situation we needed to support :)

Query._entities is a special operation which doesn't have a definition in the schema. It is used to search for objects by 'primary key' and can have any return type, based on the selections within the query. It's not something a user would call directly and is only used in federated queries.

It was not supported in the original code because this function was used at the execution stage, which has the final graphQL response (and users couldn't call _entities). But now, this function can be called at the subgraph stage, which means it needs to handle partial graphQL responses, including _entities queries.

Fortunately, once you go a few levels into an _entities query, you start finding real types which have definitions. So since this function is (highly) recursive, the is_entities_query escape hatch lets us recurse down into the real types and sum up the cost of those.

Let me know if that makes sense or you have other questions - this took me a while to piece together, and I'm not sure I explained it well!

static_estimated:
max: 10
list_size: 10
actual_cost_mode: by_subgraph # the default value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for making this a default config, but allowing users to revert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make legacy default for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronArinder The reason I made the new behavior the default is that the current approach pretty severely undercounts actuals; the current actuals aren't really meaningful. My hope is that we can make the better behavior the default, as long as the change is called out in the changelog and we provide the option to revert. But if you feel strongly otherwise, I could be convinced to swap the defaults!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel medium strong but not super strong (like a... 6 out of 10?), which isn't coming from the behavior (I think by_subgraph should be the default) but from what our customers might expect--if this makes their graphs or monitors wonky, it'd give them a reason to distrust our releases

I don't really know if that's at stake though because I don't know how folks use this in the wild; maybe a changelog entry is enough for folks? It should be, but I suspect it sometimes gets overlooked

overall, dealer's choice! raising it as food for thought, not as a blocker

Copy link
Contributor

@aaronArinder aaronArinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! only real question is around whether we can clobber costs

self.cost += score;
}
} else {
tracing::debug!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure folks will see the debug, but if they're using query cost to determine anything about the safety of their systems, it might be worth elevating to warn!()

not a blocker, I don't really know the context or actual use of this, but something to consider

static_estimated:
max: 10
list_size: 10
actual_cost_mode: by_subgraph # the default value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make legacy default for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)

.insert(subgraph_name.clone(), demand_controlled_subgraph_schema);
}

init.config.strategy.validate()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the line I squinted at with worry--looks like a new way for insert_cost_result to fail, but really, it's won't fail here because this fn doesn't err

Copy link
Contributor Author

@carodewig carodewig Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the intent is that it will fail if the config is invalid - it'll cause the plugin creation to fail thereby stopping the router / hot reload. This is something that is done in other plugins, but wasn't previously a part of demand control

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting, so the next pr that introduces a failure mode to validate() could blow up the demand control plugin if validation doesn't pass

that's probably fine, I think? I worry that if someone doesn't have hot-reload on (I hate that I'm saying this), they might update themselves into a non-running router and potentially cause a production incident; this is probably more for the next pr, but what do you think about just logging out a warning for a bit until folks have enough time to get their stuff together? (might not matter if validation is easy/straightforward/whatever, but I'm worried that we'll cause a production outage on accident)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation is really straight-forward - since the validation is based only on the configuration, it's sufficient to try running the config in a dev/staging environment to see if it'll pass. I don't have a lot of sympathy if you push a new config into prod without trying it in some form of testing environment first?
And to be clear, all the new config options are optional (both in this PR and the next) - so this won't cause errors in existing configs.
IDK - if you still disagree I can change it, but I feel like erroring on startup with a bad config is better than letting bad configs exist quietly - many users might not look at the logs to know to change something, but will react to the router not starting 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what's being validated but it sounds simple enough to maybe not be a worry? I think sometimes folks have different configs for different environments (even worse, sometimes they're templated and maybe not totally maintained) and I'm really wary of making something errorable that wasn't in the past

but, it sounds simple and you understand it best; so, you're in the best position to make the decision. I feel like you understand my view and I'm happy to agree with whatever you think is the right move

Copy link
Contributor

@mabuyo mabuyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good to me, thank you! Left a non-blocking comment

@justindoherty
Copy link

First off, thanks for the work on this! My org has recently started using demand control for cost limiting, and we've also noticed that estimated and actual costs differ substantially.

We have found that a major source of this discrepancy comes from the _entities queries; they seem to assume the query size matches static_estimated.list_size (we currently default to 20) regardless of how many entities are actually requested. This makes it hard to tune, too low and we let expensive queries through, too high and we over-estimate, blocking valid queries. Sometimes the delta is in the hundreds of thousands.

That said, I wanted to offer a perspective from a public-facing supergraph. Since our consumers have no visibility into our subgraph boundaries (and shouldn't know it's a supergraph), we were thinking it would be preferable if the cost logic to behave as if it were a monolith, as it does today with the actual cost, but not the estimated cost. It is easier for consumers to reason about a cost that ignores the hidden overhead of federation.

Because of this, could I suggest avoiding the name legacy for the existing mode? It implies that the method is outdated or deprecated, whereas for public graphs, it is possibly the preferred behavior. Perhaps by_supergraph or monolith would be more descriptive?

One question: Have you encountered the specific issue with the _entities list size estimation yourself? If that weren't causing such a huge delta, I might actually be able to adopt your new by_subgraph method, as I’d prefer to reflect the real cost if I could make the estimation more accurate.

@carodewig
Copy link
Contributor Author

Thank you for your comment, @justindoherty! I really appreciate you sharing your experience with this feature and have a few follow-up questions for you.


We have found that a major source of [the difference between estimated and actual costs] comes from the _entities queries; they seem to assume the query size matches static_estimated.list_size regardless of how many entities are actually requested. Have you encountered the specific issue with the _entities list size estimation yourself?

I only learned about this yesterday; it is indeed using the configured list size, regardless of how many entities are requested. That's definitely a bug to fix, although it's out of scope of this PR since it requires changes to the query planner output.


That said, I wanted to offer a perspective from a public-facing supergraph. Since our consumers have no visibility into our subgraph boundaries (and shouldn't know it's a supergraph), we were thinking it would be preferable if the cost logic to behave as if it were a monolith, as it does today with the actual cost, but not the estimated cost. It is easier for consumers to reason about a cost that ignores the hidden overhead of federation.

I'd love to hear more from you on this, because it's almost the opposite of how I was thinking about the demand control feature!

I see demand control as a feature for platform operators. They want to protect their infrastructure (the router, subgraphs) from queries which could overwhelm it. If a client writes a query that looks simple, but actually requires significant fan-out, fetches, etc. in a federated system, I think it's reasonable for an operator to reject that query due to the complexity which is not visible to / expected by the client.

The router does have an operation limit feature which rejects operations based on operation depth, height, etc. Does that more closely approximate the 'monolith-like' behavior you mentioned?


If [the _entities estimation issue] wasn't causing such a huge delta, I might actually be able to adopt your new by_subgraph method, as I’d prefer to reflect the real cost if I could make the estimation more accurate.

I'm curious why you'd prefer to stick with the existing actuals computation. From my perspective, the existing approach both:

  1. Does not reflect how the estimates are computed (by subgraph), so comparing estimates to actuals isn't meaningful
  2. Does not reflect how much work was done to obtain the result

I recognize that the _entities issue does cause a massive delta - but I'm not sure how that would apply to the actuals here.


Because of this, could I suggest avoiding the name legacy for the existing mode? It implies that the method is outdated or deprecated, whereas for public graphs, it is possibly the preferred behavior. Perhaps by_supergraph or monolith would be more descriptive?

I see what you mean and will make this change - I think I'll go with response_shape to try to suggest that the cost is determined only by the shape of the completed response. (I'm concerned that monolith implies a style we might not be upholding)

@justindoherty
Copy link

Thanks for the reply @carodewig :)

I'd love to hear more from you on this, because it's almost the opposite of how I was thinking about the demand control feature!

I see demand control as a feature for platform operators. They want to protect their infrastructure (the router, subgraphs) from queries which could overwhelm it. If a client writes a query that looks simple, but actually requires significant fan-out, fetches, etc. in a federated system, I think it's reasonable for an operator to reject that query due to the complexity which is not visible to / expected by the client.

I totally get this point of view, and I may be better off adopting it as my own too. I'm however just a little worried that a user will encounter the error and possibly not really know how to fix their query to be compliant short of trial and error. In the response_shape mode, we can at least attempt to explain how cost is calculated to the consumer and perhaps provide the field costs in the descriptions so they are aware of which fields are more expensive and can perform the calculation themselves if they wished. Really though, our journey is just starting and I'm just making guesses on how I think consumers will react.

The router does have an operation limit feature which rejects operations based on operation depth, height, etc. Does that more closely approximate the 'monolith-like' behavior you mentioned?

These were where we started and they are definitely useful. Our need for more robust cost control primarily comes from a few fields being proportionally very expensive to calculate. This is fine in low quantities but if a user were to request this field in a large list such as a connection from a search query, it would potentially overwhelm our subgraph.

I'm curious why you'd prefer to stick with the existing actuals computation. From my perspective, the existing approach both:

  1. Does not reflect how the estimates are computed (by subgraph), so comparing estimates to actuals isn't meaningful
  2. Does not reflect how much work was done to obtain the result

I recognize that the _entities issue does cause a massive delta - but I'm not sure how that would apply to the actuals here.

  1. I agree on this, primarily I'd stick with the current response_shape actuals if the estimate were calculated the say way.
  2. I also agree on this point as well but I guess up to this point I haven't been terribly concerned about the overhead of the fanout itself even though it does incur network costs and requiring the overall system to look up the same entity multiple times and the simplicity in explaining the consumer is worth more to me.

I'm mainly of the mindset that I want predictable cost calculations that a consumer can understand and work with. Having the cost change based on how a query plan pans out from similar but different queries adds another layer of complexity to the explanation.

I see what you mean and will make this change - I think I'll go with response_shape to try to suggest that the cost is determined only by the shape of the completed response. (I'm concerned that monolith implies a style we might not be upholding)

Thanks, much appreciated! One thought on this, but definitely not a deal breaker because all I wanted was to avoid the legacy term. I was thinking about what if the estimated cost mode were made configurable as you've done here what would we call the configuration options? Would we call it by the same names, by_subgraph and response_shape? response_shape kinda feels weird for the estimation mode, but having a mismatch between the estimation and actuals mode names also feels weird. Last thought, I imagine if response_shape were brought into the estimation mode, would it be a separate config or would it be a consolidated property like cost_mode with options for by_subgraph and response_shape that would apply to both estimation and actuals? Again, I would definitely be able to live with the names chosen here.

@carodewig
Copy link
Contributor Author

Thanks so much for sharing your perspective, @justindoherty!

I can definitely see where query rejection by cost could be confusing for clients, and how the response_shape mode would be easier for them to understand. If we were to add an option to use response_shape as an estimation strategy, it might work as a variable, or it might be better to make it an entirely separate strategy (ie static_estimated_$alternative.max would exist in the configuration).

I'm going to think a bit more about the name - I agree that it doesn't fit well on the estimation side, but I'm struggling to come up with good alternatives 😅

@carodewig
Copy link
Contributor Author

NB: ended up going with by_response_shape as that will still apply on the estimation side, if we were to add it as a feature - it's just a predicted response shape based on the operation, rather than an actual.

@carodewig carodewig merged commit 4bdd2ac into dev Feb 3, 2026
15 checks passed
@carodewig carodewig deleted the caroline/demand-control-actuals branch February 3, 2026 23:13
@abernix abernix mentioned this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants