Skip to content
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
24b228e
feat: add mode to support toggling between cost calculation modes
carodewig Jan 21, 2026
4808218
feat: compute cost on each subgraph response when enabled
carodewig Jan 21, 2026
007da11
fix: support cost calculation for `_entities` query
carodewig Jan 21, 2026
bea0e31
test: update snapshots for new actuals
carodewig Jan 21, 2026
23d4e3c
test: add call count to context to assert in tests
carodewig Jan 21, 2026
65ca9c2
test: add integration tests for demand control
carodewig Jan 22, 2026
1f77c2c
test: clean up tests to simplify via #[values]
carodewig Jan 22, 2026
9e23447
test: additional cases based on federated_ships_fragment
carodewig Jan 22, 2026
6641c6b
test: fix snapshot with new cost
carodewig Jan 22, 2026
337f6a5
doc: add new config option to the docs
carodewig Jan 22, 2026
7e84668
doc: create changeset
carodewig Jan 22, 2026
b73e877
Merge branch 'dev' into caroline/demand-control-actuals
carodewig Jan 22, 2026
643a14e
chore: refactor StrategyConfig::validate
carodewig Jan 23, 2026
468b80b
test: test_cache_metrics is super flaky
carodewig Jan 26, 2026
0b25099
test: remove dbg! that can cause test to timeout
carodewig Jan 28, 2026
32285c3
chore: refactor s/legacy/response_shape
carodewig Jan 29, 2026
314bbe1
Merge branch 'dev' into caroline/demand-control-actuals
carodewig Jan 29, 2026
ee06bf2
doc: use html bc <> breaks things
carodewig Jan 29, 2026
944a66a
refactor: s/response_shape/by_response_shape
carodewig Feb 2, 2026
7aee659
Merge branch 'dev' into caroline/demand-control-actuals
carodewig Feb 2, 2026
d2a5a14
doc: appease angry style bot
carodewig Feb 2, 2026
8596ca7
doc: additional rework
carodewig Feb 2, 2026
2efbb5a
test: mv tests to account for new names
carodewig Feb 3, 2026
e57e609
Merge branch 'dev' into caroline/demand-control-actuals
carodewig Feb 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .changesets/fix_caroline_demand_control_actuals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
### Demand control actual costs should consider each subgraph fetch ([PR #8827](https://github.com/apollographql/router/pull/8827))

The demand control feature estimates query costs by summing together the cost of each subgraph operation. This allows it
to capture any intermediate work that must be completed to return a complete response.

Prior to this version, the actual query cost computation only considered the final response shape; it did not include
any of the intermediate work done in its total.

This version fixes that behavior to compute the actual query cost as the sum of all subgraph response costs. This more
accurately reflects the work done per operation and allows a more meaningful comparison
between actual and estimated costs.

Note: if you would like to disable the new actual cost computation behavior, you should set the router configuration
option `demand_control.strategy.static_estimated.actual_cost_mode` to `legacy`.

```yaml
demand_control:
enabled: true
mode: enforce
strategy:
static_estimated:
max: 10
list_size: 10
actual_cost_mode: by_subgraph # the default value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for making this a default config, but allowing users to revert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make legacy default for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronArinder The reason I made the new behavior the default is that the current approach pretty severely undercounts actuals; the current actuals aren't really meaningful. My hope is that we can make the better behavior the default, as long as the change is called out in the changelog and we provide the option to revert. But if you feel strongly otherwise, I could be convinced to swap the defaults!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel medium strong but not super strong (like a... 6 out of 10?), which isn't coming from the behavior (I think by_subgraph should be the default) but from what our customers might expect--if this makes their graphs or monitors wonky, it'd give them a reason to distrust our releases

I don't really know if that's at stake though because I don't know how folks use this in the wild; maybe a changelog entry is enough for folks? It should be, but I suspect it sometimes gets overlooked

overall, dealer's choice! raising it as food for thought, not as a blocker

# actual_cost_mode: legacy # disable new cost calculation mode
```

By [@carodewig](https://github.com/carodewig) in https://github.com/apollographql/router/pull/8827
1 change: 1 addition & 0 deletions .config/nextest.toml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ or ( binary_id(=apollo-router::integration_tests) & test(=integration::connector
or ( binary_id(=apollo-router::integration_tests) & test(=integration::connectors::authentication::test_aws_sig_v4_signing) )
or ( binary_id(=apollo-router::integration_tests) & test(=integration::coprocessor::test_coprocessor_response_handling) )
or ( binary_id(=apollo-router::integration_tests) & test(=integration::coprocessor::test_error_not_propagated_to_client) )
or ( binary_id(=apollo-router::integration_tests) & test(=integration::entity_cache::test_cache_metrics) )
or ( binary_id(=apollo-router::integration_tests) & test(=integration::file_upload::it_fails_incompatible_query_order) )
or ( binary_id(=apollo-router::integration_tests) & test(=integration::file_upload::it_fails_invalid_file_order) )
or ( binary_id(=apollo-router::integration_tests) & test(=integration::file_upload::it_fails_invalid_multipart_order) )
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,21 @@ expression: "&schema"
}
]
},
"ActualCostMode": {
"oneOf": [
{
"enum": [
"by_subgraph"
],
"type": "string"
},
{
"const": "legacy",
"deprecated": true,
"type": "string"
}
]
},
"All": {
"enum": [
"all"
Expand Down Expand Up @@ -9137,6 +9152,14 @@ expression: "&schema"
"static_estimated": {
"additionalProperties": false,
"properties": {
"actual_cost_mode": {
"allOf": [
{
"$ref": "#/definitions/ActualCostMode"
}
],
"description": "The strategy used to calculate the actual cost incurred by an operation.\n\n* `by_subgraph` (default) computes the cost of each subgraph response and sums them\n to get the total query cost.\n* `legacy` computes the cost based on the final structure of the composed response, not\n including any interim structures from subgraph responses that did not make it to the\n composed response."
},
"list_size": {
"description": "The assumed length of lists returned by the operation.",
"format": "uint32",
Expand Down
18 changes: 18 additions & 0 deletions apollo-router/src/plugins/demand_control/cost_calculator/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,22 @@ mod directives;
pub(in crate::plugins::demand_control) mod schema;
pub(crate) mod static_cost;

use std::collections::HashMap;

use crate::plugins::demand_control::DemandControlError;

#[derive(Clone, Default, Debug, serde::Serialize, serde::Deserialize)]
pub(crate) struct CostBySubgraph(HashMap<String, f64>);
impl CostBySubgraph {
pub(crate) fn add_or_insert(&mut self, subgraph: &str, value: f64) {
if let Some(subgraph_cost) = self.0.get_mut(subgraph) {
*subgraph_cost += value;
} else {
self.0.insert(subgraph.to_string(), value);
}
}

pub(crate) fn total(&self) -> f64 {
self.0.values().sum()
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -553,53 +553,57 @@ impl<'schema> ResponseCostCalculator<'schema> {
if field.name == TYPENAME {
return;
}
if let Some(definition) = self.schema.output_field_definition(parent_ty, &field.name) {
match value {
Value::Null | Value::Bool(_) | Value::Number(_) | Value::String(_) => {
self.cost += definition
.cost_directive()
.map_or(0.0, |cost| cost.weight());
}
Value::Array(items) => {
for item in items {
self.visit_list_item(request, variables, parent_ty, field, item);
}
}
Value::Object(children) => {
self.cost += definition
.cost_directive()
.map_or(1.0, |cost| cost.weight());
self.visit_selections(request, variables, &field.selection_set, children);

let definition = self.schema.output_field_definition(parent_ty, &field.name);

// We need to have a field definition for later processing, unless the query is an
// `_entities` query. If the field should be there and isn't, return now.
let is_entities_query = parent_ty == "Query" && field.name == "_entities";
if definition.is_none() && !is_entities_query {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if this changes any behaviour, but I wonder if we could end up in a situation where:

  • we don't have a definition
  • is_entities_query is true

and if that should continue with calculating cost? Reading the original code, it would seem like we would not calculate cost of entity queries (though I don't know if it's possible to end up in the state above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing the cost in that scenario is exactly the situation we needed to support :)

Query._entities is a special operation which doesn't have a definition in the schema. It is used to search for objects by 'primary key' and can have any return type, based on the selections within the query. It's not something a user would call directly and is only used in federated queries.

It was not supported in the original code because this function was used at the execution stage, which has the final graphQL response (and users couldn't call _entities). But now, this function can be called at the subgraph stage, which means it needs to handle partial graphQL responses, including _entities queries.

Fortunately, once you go a few levels into an _entities query, you start finding real types which have definitions. So since this function is (highly) recursive, the is_entities_query escape hatch lets us recurse down into the real types and sum up the cost of those.

Let me know if that makes sense or you have other questions - this took me a while to piece together, and I'm not sure I explained it well!

tracing::debug!(
"Failed to get schema definition for field {}.{}. The resulting response cost will be a partial result.",
parent_ty,
field.name,
);
return;
}

match value {
Value::Null | Value::Bool(_) | Value::Number(_) | Value::String(_) => {
self.cost += definition
.and_then(|d| d.cost_directive())
.map_or(0.0, |cost| cost.weight());
}
Value::Array(items) => {
for item in items {
self.visit_list_item(request, variables, parent_ty, field, item);
}
}
Value::Object(children) => {
self.cost += definition
.and_then(|d| d.cost_directive())
.map_or(1.0, |cost| cost.weight());
self.visit_selections(request, variables, &field.selection_set, children);
}
}

if include_argument_score {
for argument in &field.arguments {
if let Some(argument_definition) = definition.argument_by_name(&argument.name) {
if let Ok(score) = score_argument(
&argument.value,
argument_definition,
self.schema,
variables,
) {
self.cost += score;
}
} else {
tracing::debug!(
"Failed to get schema definition for argument {}.{}({}:). The resulting response cost will be a partial result.",
parent_ty,
field.name,
argument.name,
)
if include_argument_score && let Some(definition) = definition {
for argument in &field.arguments {
if let Some(argument_definition) = definition.argument_by_name(&argument.name) {
if let Ok(score) =
score_argument(&argument.value, argument_definition, self.schema, variables)
{
self.cost += score;
}
} else {
tracing::debug!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure folks will see the debug, but if they're using query cost to determine anything about the safety of their systems, it might be worth elevating to warn!()

not a blocker, I don't really know the context or actual use of this, but something to consider

"Failed to get schema definition for argument {}.{}({}:). The resulting response cost will be a partial result.",
parent_ty,
field.name,
argument.name,
)
}
}
} else {
tracing::debug!(
"Failed to get schema definition for field {}.{}. The resulting response cost will be a partial result.",
parent_ty,
field.name,
)
}
}
}
Expand Down
74 changes: 73 additions & 1 deletion apollo-router/src/plugins/demand_control/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ use crate::json_ext::Object;
use crate::layers::ServiceBuilderExt;
use crate::plugin::Plugin;
use crate::plugin::PluginInit;
use crate::plugins::demand_control::cost_calculator::CostBySubgraph;
use crate::plugins::demand_control::cost_calculator::schema::DemandControlledSchema;
use crate::plugins::demand_control::strategy::Strategy;
use crate::plugins::demand_control::strategy::StrategyFactory;
Expand All @@ -51,6 +52,9 @@ pub(crate) const COST_ACTUAL_KEY: &str = "apollo::demand_control::actual_cost";
pub(crate) const COST_RESULT_KEY: &str = "apollo::demand_control::result";
pub(crate) const COST_STRATEGY_KEY: &str = "apollo::demand_control::strategy";

pub(crate) const COST_BY_SUBGRAPH_ACTUAL_KEY: &str =
"apollo::demand_control::actual_cost_by_subgraph";

/// Algorithm for calculating the cost of an incoming query.
#[derive(Clone, Debug, Deserialize, JsonSchema)]
#[serde(deny_unknown_fields, rename_all = "snake_case")]
Expand All @@ -73,6 +77,16 @@ pub(crate) enum StrategyConfig {
list_size: u32,
/// The maximum cost of a query
max: f64,

/// The strategy used to calculate the actual cost incurred by an operation.
///
/// * `by_subgraph` (default) computes the cost of each subgraph response and sums them
/// to get the total query cost.
/// * `legacy` computes the cost based on the final structure of the composed response, not
/// including any interim structures from subgraph responses that did not make it to the
/// composed response.
#[serde(default)]
actual_cost_mode: ActualCostMode,
},

#[cfg(test)]
Expand All @@ -82,6 +96,38 @@ pub(crate) enum StrategyConfig {
},
}

#[derive(Copy, Clone, Debug, Default, Deserialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub(crate) enum ActualCostMode {
#[default]
BySubgraph,

#[deprecated(since = "TBD", note = "use `BySubgraph` instead")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should since have an actual value? (I'm not sure the pattern we use for deprecated in this repo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other uses of #[deprecated] in this repo either don't specify a since, or use since = TBD, so I think we're fine from a 'router pattern' standpoint!

I'm happy to specify 3.0 if we're ready to commit to removing this option then; I just used TBD since I wasn't sure if we wanted to commit to it.

#[warn(deprecated_in_future)]
Legacy,
}

impl StrategyConfig {
fn validate(&self) -> Result<(), BoxError> {
let actual_cost_mode = match self {
StrategyConfig::StaticEstimated {
actual_cost_mode, ..
} => actual_cost_mode,
#[cfg(test)]
StrategyConfig::Test { .. } => return Ok(()),
};

#[allow(deprecated_in_future)]
if matches!(actual_cost_mode, ActualCostMode::Legacy) {
tracing::warn!(
"Actual cost computation mode `legacy` will be deprecated in the future; migrate to `by_subgraph` when possible",
);
}

Ok(())
}
}

#[derive(Copy, Clone, Debug, Serialize, Deserialize, JsonSchema, Eq, PartialEq)]
#[serde(deny_unknown_fields, rename_all = "snake_case")]
pub(crate) enum Mode {
Expand Down Expand Up @@ -268,6 +314,30 @@ impl Context {
Ok(estimated.zip(actual).map(|(est, act)| est - act))
}

pub(crate) fn get_actual_cost_by_subgraph(
&self,
) -> Result<Option<CostBySubgraph>, DemandControlError> {
self.get::<&str, CostBySubgraph>(COST_BY_SUBGRAPH_ACTUAL_KEY)
.map_err(|e| DemandControlError::ContextSerializationError(e.to_string()))
}

pub(crate) fn update_actual_cost_by_subgraph(
&self,
subgraph: &str,
cost: f64,
) -> Result<(), DemandControlError> {
// combine this cost with the cost that already exists in the context
self.upsert(
COST_BY_SUBGRAPH_ACTUAL_KEY,
|mut existing_cost: CostBySubgraph| {
existing_cost.add_or_insert(subgraph, cost);
existing_cost
},
)
.map_err(|e| DemandControlError::ContextSerializationError(e.to_string()))?;
Ok(())
}

pub(crate) fn insert_cost_result(&self, result: String) -> Result<(), DemandControlError> {
self.insert(COST_RESULT_KEY, result)
.map_err(|e| DemandControlError::ContextSerializationError(e.to_string()))?;
Expand Down Expand Up @@ -348,6 +418,8 @@ impl Plugin for DemandControl {
.insert(subgraph_name.clone(), demand_controlled_subgraph_schema);
}

init.config.strategy.validate()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the line I squinted at with worry--looks like a new way for insert_cost_result to fail, but really, it's won't fail here because this fn doesn't err

Copy link
Contributor Author

@carodewig carodewig Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the intent is that it will fail if the config is invalid - it'll cause the plugin creation to fail thereby stopping the router / hot reload. This is something that is done in other plugins, but wasn't previously a part of demand control

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting, so the next pr that introduces a failure mode to validate() could blow up the demand control plugin if validation doesn't pass

that's probably fine, I think? I worry that if someone doesn't have hot-reload on (I hate that I'm saying this), they might update themselves into a non-running router and potentially cause a production incident; this is probably more for the next pr, but what do you think about just logging out a warning for a bit until folks have enough time to get their stuff together? (might not matter if validation is easy/straightforward/whatever, but I'm worried that we'll cause a production outage on accident)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation is really straight-forward - since the validation is based only on the configuration, it's sufficient to try running the config in a dev/staging environment to see if it'll pass. I don't have a lot of sympathy if you push a new config into prod without trying it in some form of testing environment first?
And to be clear, all the new config options are optional (both in this PR and the next) - so this won't cause errors in existing configs.
IDK - if you still disagree I can change it, but I feel like erroring on startup with a bad config is better than letting bad configs exist quietly - many users might not look at the logs to know to change something, but will react to the router not starting 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what's being validated but it sounds simple enough to maybe not be a worry? I think sometimes folks have different configs for different environments (even worse, sometimes they're templated and maybe not totally maintained) and I'm really wary of making something errorable that wasn't in the past

but, it sounds simple and you understand it best; so, you're in the best position to make the decision. I feel like you understand my view and I'm happy to agree with whatever you think is the right move


Ok(DemandControl {
strategy_factory: StrategyFactory::new(
init.config.clone(),
Expand Down Expand Up @@ -501,7 +573,7 @@ impl Plugin for DemandControl {
|(subgraph_name, req): (String, Arc<Valid<ExecutableDocument>>), fut| async move {
let resp: subgraph::Response = fut.await?;
let strategy = resp.context.get_demand_control_context().map(|c| c.strategy).expect("must have strategy");
Ok(match strategy.on_subgraph_response(req.as_ref(), &resp) {
Ok(match strategy.on_subgraph_response(req.as_ref(), &resp, &subgraph_name) {
Ok(_) => resp,
Err(err) => subgraph::Response::builder()
.errors(
Expand Down
14 changes: 12 additions & 2 deletions apollo-router/src/plugins/demand_control/strategy/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,12 @@ impl Strategy {
&self,
request: &ExecutableDocument,
response: &subgraph::Response,
subgraph_name: &str,
) -> Result<(), DemandControlError> {
match self.inner.on_subgraph_response(request, response) {
match self
.inner
.on_subgraph_response(request, response, subgraph_name)
{
Err(e) if self.mode == Mode::Enforce => Err(e),
_ => Ok(()),
}
Expand Down Expand Up @@ -93,8 +97,13 @@ impl StrategyFactory {

pub(crate) fn create(&self) -> Strategy {
let strategy: Arc<dyn StrategyImpl> = match &self.config.strategy {
StrategyConfig::StaticEstimated { list_size, max } => Arc::new(StaticEstimated {
StrategyConfig::StaticEstimated {
list_size,
max,
actual_cost_mode,
} => Arc::new(StaticEstimated {
max: *max,
actual_cost_mode: *actual_cost_mode,
cost_calculator: StaticCostCalculator::new(
self.supergraph_schema.clone(),
self.subgraph_schemas.clone(),
Expand Down Expand Up @@ -122,6 +131,7 @@ pub(crate) trait StrategyImpl: Send + Sync {
&self,
request: &ExecutableDocument,
response: &subgraph::Response,
subgraph_name: &str,
) -> Result<(), DemandControlError>;
fn on_execution_response(
&self,
Expand Down
Loading