Skip to content

ClientPool "zombie client" death spiral prevents connection reuse after gRPC transition #8294

@sieverssj

Description

@sieverssj

Please make sure you have searched for information in the following guides.

Library Name

@google-cloud/firestore

A screenshot that you have tested with "Try this API".

N/A - It's an SDK problem

Link to the code that reproduces this issue. A link to a public Github Repository or gist with a minimal reproduction.

https://github.com/sieverssj/firestore-clientpool-zombie-repro

A step-by-step description of how to reproduce the issue, based on the linked reproduction.

Reproduction

Any application that:

  1. Uses the default preferRest: false (or unset) configuration
  2. Performs unary Firestore operations (reads, writes)
  3. Then performs a listen/onSnapshot operation
  4. Then performs more unary operations

After step 3, every unary operation in step 4 creates a new GAPIC client, gRPC channel, and auth token fetch instead of reusing the pool.

This is extremely common in serverless environments where the same function instance handles both document reads/writes and snapshot listeners.

Mechanism

  1. ClientPool initializes with this.grpcEnabled = false.

  2. Unary RPCs (getDocument, commit, batchGetDocuments, etc.) call pool.run(tag, requiresGrpc=false, op). This creates Client A with metadata { grpcEnabled: false }.

  3. A listen RPC (triggered by onSnapshot) calls pool.run(tag, requiresGrpc=true, op).

  4. In acquire(), this line permanently sets the pool-level flag to true:

    // pool.ts — acquire()
    this.grpcEnabled = this.grpcEnabled || requiresGrpc;
  5. The next line forces all future operations to require gRPC:

    // pool.ts — acquire()
    requiresGrpc = requiresGrpc || this.grpcEnabled;
  6. Client A becomes a zombie. It can never be reused because the eligibility check in the acquire() loop rejects it:

    // pool.ts — acquire() client selection loop
    (metadata.grpcEnabled || !requiresGrpc)

    Client A has metadata.grpcEnabled = false and requiresGrpc is now always true (from step 5), so the condition evaluates to (false || !true)false.

  7. Client A can never be garbage collected. GC only runs in release(), and Client A is never acquired again (step 6), so it's never released. The PoolIsTransitioningToGrpc GC path in shouldGarbageCollectClient only runs on the specific client being released:

    // pool.ts — shouldGarbageCollectClient()
    if (this.grpcEnabled !== clientMetadata.grpcEnabled) {
      // We are transitioning to GRPC. Garbage collect REST clients.
      return new PoolIsTransitioningToGrpc({
        shouldGarbageCollectClient: true,
        // ...
      });
    }

    Since the zombie is never acquired after the transition, it's never released, and this code path never executes for it.

  8. Client A permanently occupies idle capacity slots. With default concurrentOperationLimit = 100, the zombie contributes 100 idle capacity units.

  9. Death spiral. When any new gRPC client (Client B) is created and released, shouldGarbageCollectClient computes:

    // pool.ts — shouldGarbageCollectClient() idle capacity calculation
    let idleCapacityCount = 0;
    for (const [, metadata] of this.activeClients) {
      idleCapacityCount +=
        this.concurrentOperationLimit - metadata.activeRequestCount;
    }
    
    const maxIdleCapacityCount =
      this.maxIdleClients * this.concurrentOperationLimit;
    return new IdleCapacity({
      shouldGarbageCollectClient: idleCapacityCount > maxIdleCapacityCount,
      // ...
    });

    idleCapacityCount = 100 (zombie A) + 100 (new B) = 200. With default maxIdleClients = 1, maxIdleCapacityCount = 100. Since 200 > 100, Client B is immediately garbage collected.

  10. Every subsequent RPC creates a fresh GAPIC client, uses it once, and discards it. Each fresh client creates a new gRPC channel (new TCP connection) and a new GoogleAuth instance (which starts with an expired token, triggering an immediate metadata server token fetch).

Workaround

Setting maxIdleChannels: 2 breaks the death spiral. With maxIdleCapacityCount = 200, the zombie's 100 idle slots plus one surviving gRPC client's 100 idle slots equals 200, which does not exceed the threshold (strict >), so the gRPC client survives and gets reused.

The zombie still leaks one GAPIC client per Firestore instance but is otherwise harmless.

A clear and concise description of what the bug is, and what you expected to happen.

  • Near 1:1 ratio of Firestore RPCs to new TCP connections (observed via Datadog APM tcp.connect spans)
  • Near 1:1 ratio of Firestore RPCs to metadata server token fetches (GET http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token)
  • ~33x increase in tcp.connect volume after upgrading from @google-cloud/firestore v3 to v7
  • Significant p99 latency regression on all Firestore operations

A clear and concise description WHY you expect this behavior, i.e., was it a recent change, there is documentation that points to this behavior, etc. **

I expect clients in the pool to be re-used, especially when there's been no Rest -> gRPC transition. maxIdleChannels defaults to 1, so the default behavior of the client pool exhibits this problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: firestoreIssues related to the Firestore API.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions