-
Notifications
You must be signed in to change notification settings - Fork 330
fix: demand control actual costs should consider each subgraph fetch #8827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
24b228e
4808218
007da11
bea0e31
23d4e3c
65ca9c2
1f77c2c
9e23447
6641c6b
337f6a5
7e84668
b73e877
643a14e
468b80b
0b25099
32285c3
314bbe1
ee06bf2
944a66a
7aee659
d2a5a14
8596ca7
2efbb5a
e57e609
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| ### Demand control actual costs should consider each subgraph fetch ([PR #8827](https://github.com/apollographql/router/pull/8827)) | ||
|
|
||
| The demand control feature estimates query costs by summing together the cost of each subgraph operation. This allows it | ||
| to capture any intermediate work that must be completed to return a complete response. | ||
|
|
||
| Prior to this version, the actual query cost computation only considered the final response shape; it did not include | ||
| any of the intermediate work done in its total. | ||
|
|
||
| This version fixes that behavior to compute the actual query cost as the sum of all subgraph response costs. This more | ||
| accurately reflects the work done per operation and allows a more meaningful comparison | ||
| between actual and estimated costs. | ||
|
|
||
| Note: if you would like to disable the new actual cost computation behavior, you should set the router configuration | ||
| option `demand_control.strategy.static_estimated.actual_cost_mode` to `response_shape`. | ||
|
|
||
| ```yaml | ||
| demand_control: | ||
| enabled: true | ||
| mode: enforce | ||
| strategy: | ||
| static_estimated: | ||
| max: 10 | ||
| list_size: 10 | ||
| actual_cost_mode: by_subgraph # the default value | ||
| # actual_cost_mode: response_shape # revert to prior actual cost computation mode | ||
| ``` | ||
|
|
||
| By [@carodewig](https://github.com/carodewig) in https://github.com/apollographql/router/pull/8827 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -553,53 +553,57 @@ impl<'schema> ResponseCostCalculator<'schema> { | |
| if field.name == TYPENAME { | ||
| return; | ||
| } | ||
| if let Some(definition) = self.schema.output_field_definition(parent_ty, &field.name) { | ||
| match value { | ||
| Value::Null | Value::Bool(_) | Value::Number(_) | Value::String(_) => { | ||
| self.cost += definition | ||
| .cost_directive() | ||
| .map_or(0.0, |cost| cost.weight()); | ||
| } | ||
| Value::Array(items) => { | ||
| for item in items { | ||
| self.visit_list_item(request, variables, parent_ty, field, item); | ||
| } | ||
| } | ||
| Value::Object(children) => { | ||
| self.cost += definition | ||
| .cost_directive() | ||
| .map_or(1.0, |cost| cost.weight()); | ||
| self.visit_selections(request, variables, &field.selection_set, children); | ||
|
|
||
| let definition = self.schema.output_field_definition(parent_ty, &field.name); | ||
|
|
||
| // We need to have a field definition for later processing, unless the query is an | ||
| // `_entities` query. If the field should be there and isn't, return now. | ||
| let is_entities_query = parent_ty == "Query" && field.name == "_entities"; | ||
| if definition.is_none() && !is_entities_query { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm unsure if this changes any behaviour, but I wonder if we could end up in a situation where:
and if that should continue with calculating cost? Reading the original code, it would seem like we would not calculate cost of entity queries (though I don't know if it's possible to end up in the state above)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Computing the cost in that scenario is exactly the situation we needed to support :)
It was not supported in the original code because this function was used at the Fortunately, once you go a few levels into an Let me know if that makes sense or you have other questions - this took me a while to piece together, and I'm not sure I explained it well! |
||
| tracing::debug!( | ||
| "Failed to get schema definition for field {}.{}. The resulting response cost will be a partial result.", | ||
| parent_ty, | ||
| field.name, | ||
| ); | ||
| return; | ||
| } | ||
|
|
||
| match value { | ||
| Value::Null | Value::Bool(_) | Value::Number(_) | Value::String(_) => { | ||
| self.cost += definition | ||
| .and_then(|d| d.cost_directive()) | ||
| .map_or(0.0, |cost| cost.weight()); | ||
| } | ||
| Value::Array(items) => { | ||
| for item in items { | ||
| self.visit_list_item(request, variables, parent_ty, field, item); | ||
| } | ||
| } | ||
| Value::Object(children) => { | ||
| self.cost += definition | ||
| .and_then(|d| d.cost_directive()) | ||
| .map_or(1.0, |cost| cost.weight()); | ||
| self.visit_selections(request, variables, &field.selection_set, children); | ||
| } | ||
| } | ||
|
|
||
| if include_argument_score { | ||
| for argument in &field.arguments { | ||
| if let Some(argument_definition) = definition.argument_by_name(&argument.name) { | ||
| if let Ok(score) = score_argument( | ||
| &argument.value, | ||
| argument_definition, | ||
| self.schema, | ||
| variables, | ||
| ) { | ||
| self.cost += score; | ||
| } | ||
| } else { | ||
| tracing::debug!( | ||
| "Failed to get schema definition for argument {}.{}({}:). The resulting response cost will be a partial result.", | ||
| parent_ty, | ||
| field.name, | ||
| argument.name, | ||
| ) | ||
| if include_argument_score && let Some(definition) = definition { | ||
| for argument in &field.arguments { | ||
| if let Some(argument_definition) = definition.argument_by_name(&argument.name) { | ||
| if let Ok(score) = | ||
| score_argument(&argument.value, argument_definition, self.schema, variables) | ||
| { | ||
| self.cost += score; | ||
| } | ||
| } else { | ||
| tracing::debug!( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure folks will see the debug, but if they're using query cost to determine anything about the safety of their systems, it might be worth elevating to warn!() not a blocker, I don't really know the context or actual use of this, but something to consider |
||
| "Failed to get schema definition for argument {}.{}({}:). The resulting response cost will be a partial result.", | ||
| parent_ty, | ||
| field.name, | ||
| argument.name, | ||
| ) | ||
| } | ||
| } | ||
| } else { | ||
| tracing::debug!( | ||
| "Failed to get schema definition for field {}.{}. The resulting response cost will be a partial result.", | ||
| parent_ty, | ||
| field.name, | ||
| ) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,6 +34,7 @@ use crate::json_ext::Object; | |
| use crate::layers::ServiceBuilderExt; | ||
| use crate::plugin::Plugin; | ||
| use crate::plugin::PluginInit; | ||
| use crate::plugins::demand_control::cost_calculator::CostBySubgraph; | ||
| use crate::plugins::demand_control::cost_calculator::schema::DemandControlledSchema; | ||
| use crate::plugins::demand_control::strategy::Strategy; | ||
| use crate::plugins::demand_control::strategy::StrategyFactory; | ||
|
|
@@ -51,6 +52,9 @@ pub(crate) const COST_ACTUAL_KEY: &str = "apollo::demand_control::actual_cost"; | |
| pub(crate) const COST_RESULT_KEY: &str = "apollo::demand_control::result"; | ||
| pub(crate) const COST_STRATEGY_KEY: &str = "apollo::demand_control::strategy"; | ||
|
|
||
| pub(crate) const COST_BY_SUBGRAPH_ACTUAL_KEY: &str = | ||
| "apollo::demand_control::actual_cost_by_subgraph"; | ||
|
|
||
| /// Algorithm for calculating the cost of an incoming query. | ||
| #[derive(Clone, Debug, Deserialize, JsonSchema)] | ||
| #[serde(deny_unknown_fields, rename_all = "snake_case")] | ||
|
|
@@ -73,6 +77,16 @@ pub(crate) enum StrategyConfig { | |
| list_size: u32, | ||
| /// The maximum cost of a query | ||
| max: f64, | ||
|
|
||
| /// The strategy used to calculate the actual cost incurred by an operation. | ||
| /// | ||
| /// * `by_subgraph` (default) computes the cost of each subgraph response and sums them | ||
| /// to get the total query cost. | ||
| /// * `by_response_shape` computes the cost based on the final structure of the composed | ||
| /// response, not including any interim structures from subgraph responses that did not | ||
| /// make it to the composed response. | ||
| #[serde(default)] | ||
| actual_cost_mode: ActualCostMode, | ||
| }, | ||
|
|
||
| #[cfg(test)] | ||
|
|
@@ -82,6 +96,41 @@ pub(crate) enum StrategyConfig { | |
| }, | ||
| } | ||
|
|
||
| #[derive(Copy, Clone, Debug, Default, Deserialize, JsonSchema)] | ||
| #[serde(rename_all = "snake_case")] | ||
| pub(crate) enum ActualCostMode { | ||
| /// Computes the cost of each subgraph response and sums them to get the total query cost. | ||
| #[default] | ||
| BySubgraph, | ||
|
|
||
| /// Computes the cost based on the final structure of the composed response, not including any | ||
| /// interim structures from subgraph responses that did not make it to the composed response. | ||
| #[deprecated(since = "TBD", note = "use `BySubgraph` instead")] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only other uses of I'm happy to specify |
||
| #[warn(deprecated_in_future)] | ||
| ByResponseShape, | ||
| } | ||
|
|
||
| impl StrategyConfig { | ||
| fn validate(&self) -> Result<(), BoxError> { | ||
| let actual_cost_mode = match self { | ||
| StrategyConfig::StaticEstimated { | ||
| actual_cost_mode, .. | ||
| } => actual_cost_mode, | ||
| #[cfg(test)] | ||
| StrategyConfig::Test { .. } => return Ok(()), | ||
| }; | ||
|
|
||
| #[allow(deprecated_in_future)] | ||
| if matches!(actual_cost_mode, ActualCostMode::ByResponseShape) { | ||
| tracing::warn!( | ||
| "Actual cost computation mode `by_response_shape` will be deprecated in the future; migrate to `by_subgraph` when possible", | ||
| ); | ||
| } | ||
|
|
||
| Ok(()) | ||
| } | ||
| } | ||
carodewig marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #[derive(Copy, Clone, Debug, Serialize, Deserialize, JsonSchema, Eq, PartialEq)] | ||
| #[serde(deny_unknown_fields, rename_all = "snake_case")] | ||
| pub(crate) enum Mode { | ||
|
|
@@ -268,6 +317,30 @@ impl Context { | |
| Ok(estimated.zip(actual).map(|(est, act)| est - act)) | ||
| } | ||
|
|
||
| pub(crate) fn get_actual_cost_by_subgraph( | ||
| &self, | ||
| ) -> Result<Option<CostBySubgraph>, DemandControlError> { | ||
| self.get::<&str, CostBySubgraph>(COST_BY_SUBGRAPH_ACTUAL_KEY) | ||
| .map_err(|e| DemandControlError::ContextSerializationError(e.to_string())) | ||
| } | ||
|
|
||
| pub(crate) fn update_actual_cost_by_subgraph( | ||
| &self, | ||
| subgraph: &str, | ||
| cost: f64, | ||
| ) -> Result<(), DemandControlError> { | ||
| // combine this cost with the cost that already exists in the context | ||
| self.upsert( | ||
| COST_BY_SUBGRAPH_ACTUAL_KEY, | ||
| |mut existing_cost: CostBySubgraph| { | ||
| existing_cost.add_or_insert(subgraph, cost); | ||
| existing_cost | ||
| }, | ||
| ) | ||
| .map_err(|e| DemandControlError::ContextSerializationError(e.to_string()))?; | ||
| Ok(()) | ||
| } | ||
|
|
||
| pub(crate) fn insert_cost_result(&self, result: String) -> Result<(), DemandControlError> { | ||
| self.insert(COST_RESULT_KEY, result) | ||
| .map_err(|e| DemandControlError::ContextSerializationError(e.to_string()))?; | ||
|
|
@@ -348,6 +421,8 @@ impl Plugin for DemandControl { | |
| .insert(subgraph_name.clone(), demand_controlled_subgraph_schema); | ||
| } | ||
|
|
||
| init.config.strategy.validate()?; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is the line I squinted at with worry--looks like a new way for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the intent is that it will fail if the config is invalid - it'll cause the plugin creation to fail thereby stopping the router / hot reload. This is something that is done in other plugins, but wasn't previously a part of demand control
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh interesting, so the next pr that introduces a failure mode to validate() could blow up the demand control plugin if validation doesn't pass that's probably fine, I think? I worry that if someone doesn't have hot-reload on (I hate that I'm saying this), they might update themselves into a non-running router and potentially cause a production incident; this is probably more for the next pr, but what do you think about just logging out a warning for a bit until folks have enough time to get their stuff together? (might not matter if validation is easy/straightforward/whatever, but I'm worried that we'll cause a production outage on accident)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The validation is really straight-forward - since the validation is based only on the configuration, it's sufficient to try running the config in a dev/staging environment to see if it'll pass. I don't have a lot of sympathy if you push a new config into prod without trying it in some form of testing environment first?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really understand what's being validated but it sounds simple enough to maybe not be a worry? I think sometimes folks have different configs for different environments (even worse, sometimes they're templated and maybe not totally maintained) and I'm really wary of making something errorable that wasn't in the past but, it sounds simple and you understand it best; so, you're in the best position to make the decision. I feel like you understand my view and I'm happy to agree with whatever you think is the right move |
||
|
|
||
| Ok(DemandControl { | ||
| strategy_factory: StrategyFactory::new( | ||
| init.config.clone(), | ||
|
|
@@ -501,7 +576,7 @@ impl Plugin for DemandControl { | |
| |(subgraph_name, req): (String, Arc<Valid<ExecutableDocument>>), fut| async move { | ||
| let resp: subgraph::Response = fut.await?; | ||
| let strategy = resp.context.get_demand_control_context().map(|c| c.strategy).expect("must have strategy"); | ||
| Ok(match strategy.on_subgraph_response(req.as_ref(), &resp) { | ||
| Ok(match strategy.on_subgraph_response(req.as_ref(), &resp, &subgraph_name) { | ||
| Ok(_) => resp, | ||
| Err(err) => subgraph::Response::builder() | ||
| .errors( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ for making this a default config, but allowing users to revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make
legacydefault for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronArinder The reason I made the new behavior the default is that the current approach pretty severely undercounts actuals; the current actuals aren't really meaningful. My hope is that we can make the better behavior the default, as long as the change is called out in the changelog and we provide the option to revert. But if you feel strongly otherwise, I could be convinced to swap the defaults!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel medium strong but not super strong (like a... 6 out of 10?), which isn't coming from the behavior (I think by_subgraph should be the default) but from what our customers might expect--if this makes their graphs or monitors wonky, it'd give them a reason to distrust our releases
I don't really know if that's at stake though because I don't know how folks use this in the wild; maybe a changelog entry is enough for folks? It should be, but I suspect it sometimes gets overlooked
overall, dealer's choice! raising it as food for thought, not as a blocker