Improve testing of {Scheduler|Worker}MetricCollector#6945
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 53m 10s ⏱️ + 10m 13s For more details on these failures, see this check. Results for commit 51f9781. ± Comparison against base commit 817ead3. ♻️ This comment has been updated with latest results. |
|
Failing tests:
New ones that look similar to other mysterious test timeouts we've had:
|
| PrometheusHandler._initialized = True | ||
| PrometheusHandler._collector = WorkerMetricCollector(self.server) | ||
| # Register collector | ||
| prometheus_client.REGISTRY.register(PrometheusHandler._collector) |
There was a problem hiding this comment.
If there are multiple concurrent Worker instances in the same process (very common with async workers in all gen_cluster tests), this means PrometheusHandler._collector will only be set to the WorkerMetricCollector of the last worker to start. Is that okay? Does the prometheus_client.REGISTRY.register(PrometheusHandler._collector) mean that all the collectors will be registered somewhere anywhere?
I don't know what either _collector or prometheus_client.REGISTRY are for, I'm just seeing some global variables getting overwritten in a place where I know there might not just be multiple workers sequentially, but also multiple workers in parallel.
There was a problem hiding this comment.
The way this had been implemented before was that we'd simply set up a PrometheusHandler for the first worker spinning up in the process, set _initialized to True and call it a day. This led to test_prometheus_collect_task_states because that PrometheusHandler did not belong to the newly spun-up worker when running multiple tests. This new pattern surely isn't perfect as we only ever have a single instance of a PrometheusHandler per process talking to the last worker we spun up. However, IMO this is better than what we had before and it would require some more serious thinking to understand the Prometheus client to handle parallel workers gracefully. I'm happy to file a follow-up issue for that.
Regarding the singleton implementation: I've stolen that from https://github.com/hendrikmakait/distributed/blob/13e315c11c7277ba0cadd9f5c2a16364cdeaf14b/distributed/http/scheduler/prometheus/core.py#L79
There was a problem hiding this comment.
This was pretty painful when we started testing this stuff. I'm OK with either solution and don't think we need anything more sophisticated for now.
As was already pointed out, this is merely an artifact for testing and no real world application would ever run multiple workers in the same process AND use prometheus. At the very least, this is something we then choose to not properly support
TLDR As long as the testing works out, I'm happy
| port = s.http_server.port | ||
| response = await http_client.fetch(f"http://localhost:{port}/metrics") | ||
| assert response.code == 200 | ||
| assert response.headers["Content-Type"] == "text/plain; version=0.0.4" |
There was a problem hiding this comment.
Do we need to assert the version? I'd think it's okay if that changes?
There was a problem hiding this comment.
Mostly added this for coherence with other tests assuming someone had a reason to add this. I don't mind dropping this everywhere.
There was a problem hiding this comment.
Dropped during refactoring.
| family.name: family for family in text_string_to_metric_families(txt) | ||
| family.name: family | ||
| for family in text_string_to_metric_families(txt) | ||
| if family.name.startswith("dask_scheduler_") |
There was a problem hiding this comment.
This implies to me that worker metrics are now getting mixed in when you hit the scheduler port in tests? Probably because they're all running on the same async server?
I wonder if we need to use an async-friendly prometheus client, like https://github.com/hynek/prometheus-async or, more likely, https://github.com/claws/aioprometheus (since it seems to support creating multiple Service instances per process)?
There was a problem hiding this comment.
This is less about worker metrics (though I am not 100% sure that's not happening), but more about additional Python-related metrics that we collect but don't want to test for in here.
There was a problem hiding this comment.
I wonder if we need to use an async-friendly prometheus client, like https://github.com/hynek/prometheus-async or, more likely, https://github.com/claws/aioprometheus (since it seems to support creating multiple Service instances per process)?
I'm open to it, but that's outside of the scope of this PR.
There was a problem hiding this comment.
This implies to me that worker metrics are now getting mixed in when you hit the scheduler port in tests? Probably because they're all running on the same async server?
They shouldn't, we have different collectors on different server endpoints. They also use worker/scheduler-specific methods to gather metrics to display.
There was a problem hiding this comment.
Async clients look like an interesting thing but the actual value they provide is a bit unclear to me.
I guess this can make the collection a bit smoother (currently it's a plain generator) such that it doesn't block the event loop during collection. Once we have more experience with this we can revisit this. Last time I was using prometheus I didn't encounter any loop instabilities
| # request data twice since there once was a case where metrics got registered multiple times resulting in | ||
| # prometheus_client errors | ||
| await fetch_metrics() |
There was a problem hiding this comment.
Is this still necessary? We're just making sure it doesn't error the second time?
There was a problem hiding this comment.
This is the behavior we currently have on main. I'm open to changing this but I also lack context as to why this has been necessary at some point in time.
|
|
||
| http_client = AsyncHTTPClient() | ||
|
|
||
| async def fetch_metrics(): |
There was a problem hiding this comment.
Seems like fetch_metrics could be factored out to a utils_test function?
|
@gjoseph92 I leave merging to you |
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
801c4cd to
51f9781
Compare
|
merging once CI is done |
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Closes #6943.
pre-commit run --all-files