fix: demand control actual costs should consider each subgraph fetch#8827
fix: demand control actual costs should consider each subgraph fetch#8827
Conversation
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 1 changed, 0 removedBuild ID: 36b3c7ef67f7e6fb727e58de URL: https://www.apollographql.com/docs/deploy-preview/36b3c7ef67f7e6fb727e58de |
This comment has been minimized.
This comment has been minimized.
morriswchris
left a comment
There was a problem hiding this comment.
I did an initial pass on the logic and have a few question, mainly for my own comprehension 🙂
| #[default] | ||
| BySubgraph, | ||
|
|
||
| #[deprecated(since = "TBD", note = "use `BySubgraph` instead")] |
There was a problem hiding this comment.
Should since have an actual value? (I'm not sure the pattern we use for deprecated in this repo)
There was a problem hiding this comment.
The only other uses of #[deprecated] in this repo either don't specify a since, or use since = TBD, so I think we're fine from a 'router pattern' standpoint!
I'm happy to specify 3.0 if we're ready to commit to removing this option then; I just used TBD since I wasn't sure if we wanted to commit to it.
| // We need to have a field definition for later processing, unless the query is an | ||
| // `_entities` query. If the field should be there and isn't, return now. | ||
| let is_entities_query = parent_ty == "Query" && field.name == "_entities"; | ||
| if definition.is_none() && !is_entities_query { |
There was a problem hiding this comment.
I'm unsure if this changes any behaviour, but I wonder if we could end up in a situation where:
- we don't have a definition
is_entities_queryistrue
and if that should continue with calculating cost? Reading the original code, it would seem like we would not calculate cost of entity queries (though I don't know if it's possible to end up in the state above)
There was a problem hiding this comment.
Computing the cost in that scenario is exactly the situation we needed to support :)
Query._entities is a special operation which doesn't have a definition in the schema. It is used to search for objects by 'primary key' and can have any return type, based on the selections within the query. It's not something a user would call directly and is only used in federated queries.
It was not supported in the original code because this function was used at the execution stage, which has the final graphQL response (and users couldn't call _entities). But now, this function can be called at the subgraph stage, which means it needs to handle partial graphQL responses, including _entities queries.
Fortunately, once you go a few levels into an _entities query, you start finding real types which have definitions. So since this function is (highly) recursive, the is_entities_query escape hatch lets us recurse down into the real types and sum up the cost of those.
Let me know if that makes sense or you have other questions - this took me a while to piece together, and I'm not sure I explained it well!
| static_estimated: | ||
| max: 10 | ||
| list_size: 10 | ||
| actual_cost_mode: by_subgraph # the default value |
There was a problem hiding this comment.
❤️ for making this a default config, but allowing users to revert.
There was a problem hiding this comment.
should we make legacy default for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)
There was a problem hiding this comment.
@aaronArinder The reason I made the new behavior the default is that the current approach pretty severely undercounts actuals; the current actuals aren't really meaningful. My hope is that we can make the better behavior the default, as long as the change is called out in the changelog and we provide the option to revert. But if you feel strongly otherwise, I could be convinced to swap the defaults!
There was a problem hiding this comment.
I feel medium strong but not super strong (like a... 6 out of 10?), which isn't coming from the behavior (I think by_subgraph should be the default) but from what our customers might expect--if this makes their graphs or monitors wonky, it'd give them a reason to distrust our releases
I don't really know if that's at stake though because I don't know how folks use this in the wild; maybe a changelog entry is enough for folks? It should be, but I suspect it sometimes gets overlooked
overall, dealer's choice! raising it as food for thought, not as a blocker
aaronArinder
left a comment
There was a problem hiding this comment.
lgtm! only real question is around whether we can clobber costs
| self.cost += score; | ||
| } | ||
| } else { | ||
| tracing::debug!( |
There was a problem hiding this comment.
not sure folks will see the debug, but if they're using query cost to determine anything about the safety of their systems, it might be worth elevating to warn!()
not a blocker, I don't really know the context or actual use of this, but something to consider
| static_estimated: | ||
| max: 10 | ||
| list_size: 10 | ||
| actual_cost_mode: by_subgraph # the default value |
There was a problem hiding this comment.
should we make legacy default for now just to have the same behavior for folks? worried that this might break dashboards for folks (or at least make it looks like there's some wobble in the router version-over-version)
| .insert(subgraph_name.clone(), demand_controlled_subgraph_schema); | ||
| } | ||
|
|
||
| init.config.strategy.validate()?; |
There was a problem hiding this comment.
this is the line I squinted at with worry--looks like a new way for insert_cost_result to fail, but really, it's won't fail here because this fn doesn't err
There was a problem hiding this comment.
Yeah, the intent is that it will fail if the config is invalid - it'll cause the plugin creation to fail thereby stopping the router / hot reload. This is something that is done in other plugins, but wasn't previously a part of demand control
There was a problem hiding this comment.
oh interesting, so the next pr that introduces a failure mode to validate() could blow up the demand control plugin if validation doesn't pass
that's probably fine, I think? I worry that if someone doesn't have hot-reload on (I hate that I'm saying this), they might update themselves into a non-running router and potentially cause a production incident; this is probably more for the next pr, but what do you think about just logging out a warning for a bit until folks have enough time to get their stuff together? (might not matter if validation is easy/straightforward/whatever, but I'm worried that we'll cause a production outage on accident)
There was a problem hiding this comment.
The validation is really straight-forward - since the validation is based only on the configuration, it's sufficient to try running the config in a dev/staging environment to see if it'll pass. I don't have a lot of sympathy if you push a new config into prod without trying it in some form of testing environment first?
And to be clear, all the new config options are optional (both in this PR and the next) - so this won't cause errors in existing configs.
IDK - if you still disagree I can change it, but I feel like erroring on startup with a bad config is better than letting bad configs exist quietly - many users might not look at the logs to know to change something, but will react to the router not starting 🤷
There was a problem hiding this comment.
I don't really understand what's being validated but it sounds simple enough to maybe not be a worry? I think sometimes folks have different configs for different environments (even worse, sometimes they're templated and maybe not totally maintained) and I'm really wary of making something errorable that wasn't in the past
but, it sounds simple and you understand it best; so, you're in the best position to make the decision. I feel like you understand my view and I'm happy to agree with whatever you think is the right move
mabuyo
left a comment
There was a problem hiding this comment.
Docs look good to me, thank you! Left a non-blocking comment
|
First off, thanks for the work on this! My org has recently started using demand control for cost limiting, and we've also noticed that estimated and actual costs differ substantially. We have found that a major source of this discrepancy comes from the That said, I wanted to offer a perspective from a public-facing supergraph. Since our consumers have no visibility into our subgraph boundaries (and shouldn't know it's a supergraph), we were thinking it would be preferable if the cost logic to behave as if it were a monolith, as it does today with the actual cost, but not the estimated cost. It is easier for consumers to reason about a cost that ignores the hidden overhead of federation. Because of this, could I suggest avoiding the name legacy for the existing mode? It implies that the method is outdated or deprecated, whereas for public graphs, it is possibly the preferred behavior. Perhaps One question: Have you encountered the specific issue with the |
|
Thank you for your comment, @justindoherty! I really appreciate you sharing your experience with this feature and have a few follow-up questions for you.
I only learned about this yesterday; it is indeed using the configured list size, regardless of how many entities are requested. That's definitely a bug to fix, although it's out of scope of this PR since it requires changes to the query planner output.
I'd love to hear more from you on this, because it's almost the opposite of how I was thinking about the demand control feature! I see demand control as a feature for platform operators. They want to protect their infrastructure (the router, subgraphs) from queries which could overwhelm it. If a client writes a query that looks simple, but actually requires significant fan-out, fetches, etc. in a federated system, I think it's reasonable for an operator to reject that query due to the complexity which is not visible to / expected by the client. The router does have an operation limit feature which rejects operations based on operation depth, height, etc. Does that more closely approximate the 'monolith-like' behavior you mentioned?
I'm curious why you'd prefer to stick with the existing actuals computation. From my perspective, the existing approach both:
I recognize that the
I see what you mean and will make this change - I think I'll go with |
doc: add more of an explanation to why you'd want each mode
|
Thanks for the reply @carodewig :)
I totally get this point of view, and I may be better off adopting it as my own too. I'm however just a little worried that a user will encounter the error and possibly not really know how to fix their query to be compliant short of trial and error. In the
These were where we started and they are definitely useful. Our need for more robust cost control primarily comes from a few fields being proportionally very expensive to calculate. This is fine in low quantities but if a user were to request this field in a large list such as a connection from a search query, it would potentially overwhelm our subgraph.
I'm mainly of the mindset that I want predictable cost calculations that a consumer can understand and work with. Having the cost change based on how a query plan pans out from similar but different queries adds another layer of complexity to the explanation.
Thanks, much appreciated! One thought on this, but definitely not a deal breaker because all I wanted was to avoid the |
|
Thanks so much for sharing your perspective, @justindoherty! I can definitely see where query rejection by cost could be confusing for clients, and how the I'm going to think a bit more about the name - I agree that it doesn't fit well on the estimation side, but I'm struggling to come up with good alternatives 😅 |
|
NB: ended up going with |
The demand control feature estimates query costs by summing together the cost of each subgraph operation. This allows it to capture any intermediate work that must be completed to return a complete response.
Right now the actual query cost computation only looks at the final response shape; it does not include any of the intermediate work done in its total. I believe this is a bug; it's not meaningful to compare estimated and actual costs unless they're computed the same way.
This PR fixes that behavior to compute the actual query cost as the sum of all subgraph response costs. It:
subgraph_responseplugin stage, and sums the results in theexecution_responsestageResponseCostCalculator::score_response_fieldfunction to support calculating the cost of an_entitiesquerydemand_control.strategy.static_estimated.actual_cost_modeto disable the new cost calculation behavior; the default valueby_subgraphis the new behavior, the other valueby_response_shapereverts to the old behaviorChecklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩