Add a new config and metric for reporting ruler query execution wall time.#4317
Conversation
…time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
… than after the if block. Add a unit test to validate we're tracking time correctly. Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
| queryTime := prometheus.NewCounterVec(prometheus.CounterOpts{}, []string{"user"}) | ||
|
|
||
| mockFunc := func(ctx context.Context, q string, t time.Time) (promql.Vector, error) { | ||
| time.Sleep(1 * time.Millisecond) |
There was a problem hiding this comment.
I've seen we use sleep in other unit tests, but it still feels kind of wrong. Is there a better way to simulate time passing?
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
|
Sorry but we have begun the process of cutting a new release; please rebase from |
pracucci
left a comment
There was a problem hiding this comment.
Thanks for working on this! I left few comments.
| startTime = time.Now() | ||
| defer func() { queryTime.WithLabelValues(userID).Add(time.Since(startTime).Seconds()) }() |
There was a problem hiding this comment.
You can use prometheus.NewTimer() for this purpose, instead of tracking manually.
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
cstyan
left a comment
There was a problem hiding this comment.
@treid314 re: your comment about faking time pasing, we have this package: https://github.com/weaveworks/common/blob/master/mtime/mtime.go
I trust you've all agreed there's a reason to add this metric, but just off the top of my head I'm questioning how useful a metric for the total time spent executing queries in the ruler for a user is. What scenario is this metric here to help diagnose? Would a histogram with only a few buckets not be better? Or logging when the ruler finishes evaluating a rule with a field for how long that took?
It's for billing purposes (reason why it's a counter and not an histogram). |
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
pstibrany
left a comment
There was a problem hiding this comment.
Nice job. I would suggest to implement this tracking in separate function, to avoid breaking existing clients of MetricsQueryFunc ouside of Cortex.
…uler query time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
| func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc { | ||
| return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) { | ||
| // If we've been passed a counter we want to record the wall time spent executing this request. | ||
| if queryTime != nil { | ||
| timer := prometheus.NewTimer(nil) | ||
| defer func() { | ||
| querySeconds := timer.ObserveDuration().Seconds() | ||
| queryTime.Add(querySeconds) | ||
|
|
||
| // Log ruler query stats. | ||
| logMessage := append([]interface{}{ | ||
| "msg", "ruler query stats", | ||
| "cortex_ruler_query_seconds_total", querySeconds, | ||
| }, qs) | ||
| level.Info(util_log.WithContext(ctx, logger)).Log(logMessage...) | ||
| }() | ||
| } | ||
|
|
||
| result, err := qf(ctx, qs, t) | ||
| return result, err | ||
| } | ||
| } |
There was a problem hiding this comment.
We can simplify this function:
- we don't need to return new function, if
queryTimeis nil. No need to check it each time. - not sure what's going on with logging here, but we can just call
Log, right?
| func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc { | |
| return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) { | |
| // If we've been passed a counter we want to record the wall time spent executing this request. | |
| if queryTime != nil { | |
| timer := prometheus.NewTimer(nil) | |
| defer func() { | |
| querySeconds := timer.ObserveDuration().Seconds() | |
| queryTime.Add(querySeconds) | |
| // Log ruler query stats. | |
| logMessage := append([]interface{}{ | |
| "msg", "ruler query stats", | |
| "cortex_ruler_query_seconds_total", querySeconds, | |
| }, qs) | |
| level.Info(util_log.WithContext(ctx, logger)).Log(logMessage...) | |
| }() | |
| } | |
| result, err := qf(ctx, qs, t) | |
| return result, err | |
| } | |
| } | |
| func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc { | |
| if queryTime == nil { | |
| return qf | |
| } | |
| return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) { | |
| timer := prometheus.NewTimer(nil) | |
| defer func() { | |
| querySeconds := timer.ObserveDuration().Seconds() | |
| queryTime.Add(querySeconds) | |
| level.Info(util_log.WithContext(ctx, logger)).Log("msg", "ruler query stats", "cortex_ruler_query_seconds_total", querySeconds, "query", qs) | |
| }() | |
| result, err := qf(ctx, qs, t) | |
| return result, err | |
| } | |
| } |
There was a problem hiding this comment.
not sure what's going on with logging here, but we can just call Log, right?
I'm trying to follow what we do for query front-end logging in case a user would also want to record this log message in the ruler and use it for billing instead of the added metric.
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
tomwilkie
left a comment
There was a problem hiding this comment.
LGTM! Not going to merge until @pstibrany has had a chance to see the changes.
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
…time. (cortexproject#4317) * Add a new config and metric for reporting ruler query execution wall time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Spacing and PR number fixup Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Wrap the defer in a function to make it defer after the return rather than after the if block. Add a unit test to validate we're tracking time correctly. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Use seconds for our duration rather than nanoseconds Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Review comment fixes Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Update config flag in the config docs Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Pass counter rather than counter vector for metrics query function Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Fix comment in MetricsQueryFunction Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Move query metric and log to separate function. Add log message for ruler query time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Update config file and change log to show this a per user metric Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * code review fixes Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * update log message for ruler query metrics Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Remove append and just use the array for key values in the log messag Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Add query-frontend component to front end log message Signed-off-by: Tyler Reid <tyler.reid@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Tyler Reid tyler.reid@grafana.com
What this PR does:
This PR adds a new metric for tracking the execution wall time of queries in the ruler. We have a similar metric for query execution time on query front-ends today but we also wish to track query execution time in the rulers. This metric is called
cortex_ruler_query_seconds_totaland has a label ofusercontaining the UserID of the tenant who is tagged in the rule. This metric can be enabled with the-ruler.query-stats-enabledboolean config option.Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]