Skip to content

If I restart the query-frontend while queriers are running then we can't achieve -querier.max-concurrent #4391

@alvinlin123

Description

@alvinlin123

Describe the bug
If query-frontend and querier are restarted at the same time, or query-frontend is restarted while queriers are running, then -querier.max-concurrent cannot be achieved.

To Reproduce

  1. Restart just queriers by doing a rollout restart, do not restart query-frontend
  2. Make sure your system is in steady steady state and you can achieve -querier.max-concurrent
  3. Restart query-frontend
  4. hammer all your query frontends with expensive queries and observe -querier.max-concurrent is no longer achievable.

Expected behavior
Should still be able to achieve -querier.max-concurrent.

Environment:
We are running on k8s.

Storage Engine

  • Blocks
  • Chunks

Additional Context
My suspicion is because in the worker.go AddressRemoved does not call resetConcurrency()

Imagine the following cases:

  • You have 1 querier and 3 query-frontend (fe1, fe2, and fe3)
  • your -querier.max-concurrent is set to 8
  • So, each query frontend have at least 2 connection to the queriers. Because 8 is not divisible by 3, and 8 modulo 3 is 2, so there will be extra connection between fe1 and fe2 to the querier.
  • So, fe1 has 3 connection to querier, fe2 has 3, and fe3 has 2.

Now, we restart the query-frontend, and the DNS Watch on the querier (worker.go) will get to work and start adding and removing addresses.

  • During deployment we will have 6 query-frontends fe1 to fe6 because we spin up new pods first
  • So you get into a stat where fe1 has 2 connection to querier, fe2 has 2, fe3 has 1, fe4 has1, fe5 has 1, and fe6 has 1
  • Then we will spin down the old pod, fe1 to fe3.
  • Because the AddressRemoved method does not call resetConcurrency() to recalculate the load distribution, we end up having fe4 has 1 connection to querier, fe5 has 1, and fe6 has 1. Which is just 3 instead of 8.

Below is a graph showing achievement of -querier.max-concurrent=8 during different phases.

Grafana

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions