ClientPool "zombie client" death spiral prevents connection reuse after gRPC transition

### Please make sure you have searched for information in the following guides.

- [x] Search the issues already opened: https://github.com/GoogleCloudPlatform/google-cloud-node/issues
- [x] Search StackOverflow: http://stackoverflow.com/questions/tagged/google-cloud-platform+node.js
- [x] Check our Troubleshooting guide: https://github.com/googleapis/google-cloud-node/blob/main/docs/troubleshooting.md
- [x] Check our FAQ: https://github.com/googleapis/google-cloud-node/blob/main/docs/faq.md
- [x] Check our libraries HOW-TO: https://github.com/googleapis/gax-nodejs/blob/main/client-libraries.md
- [x] Check out our authentication guide: https://github.com/googleapis/google-auth-library-nodejs
- [x] Check out handwritten samples for many of our APIs: https://github.com/GoogleCloudPlatform/nodejs-docs-samples
- [x] Check the API's issue tracker: https://cloud.google.com/support/docs/issue-trackers

### Library Name

@google-cloud/firestore

### A screenshot that you have tested with "Try this API".


N/A - It's an SDK problem

### Link to the code that reproduces this issue. A link to a **public** Github Repository or gist with a minimal reproduction.


https://github.com/sieverssj/firestore-clientpool-zombie-repro

### A step-by-step description of how to reproduce the issue, based on the linked reproduction.


## Reproduction

Any application that:

1. Uses the default `preferRest: false` (or unset) configuration
2. Performs unary Firestore operations (reads, writes)
3. Then performs a `listen`/`onSnapshot` operation
4. Then performs more unary operations

After step 3, every unary operation in step 4 creates a new GAPIC client, gRPC channel, and auth token fetch instead of reusing the pool.

This is extremely common in serverless environments where the same function instance handles both document reads/writes and snapshot listeners.

## Mechanism
 
1. `ClientPool` initializes with `this.grpcEnabled = false`.
 2. Unary RPCs (`getDocument`, `commit`, `batchGetDocuments`, etc.) call `pool.run(tag, requiresGrpc=false, op)`. This creates **Client A** with metadata `{ grpcEnabled: false }`.
 3. A `listen` RPC (triggered by `onSnapshot`) calls `pool.run(tag, requiresGrpc=true, op)`.
 4. In `acquire()`, this line permanently sets the pool-level flag to `true`:

    ```typescript
    // pool.ts — acquire()
    this.grpcEnabled = this.grpcEnabled || requiresGrpc;
    ```
 5. The next line forces all future operations to require gRPC:

    ```typescript
    // pool.ts — acquire()
    requiresGrpc = requiresGrpc || this.grpcEnabled;
    ```
 6. **Client A becomes a zombie.** It can never be reused because the eligibility check in the `acquire()` loop rejects it:

    ```typescript
    // pool.ts — acquire() client selection loop
    (metadata.grpcEnabled || !requiresGrpc)
    ```

    Client A has `metadata.grpcEnabled = false` and `requiresGrpc` is now always `true` (from step 5), so the condition evaluates to `(false || !true)` → `false`.
 7. **Client A can never be garbage collected.** GC only runs in `release()`, and Client A is never acquired again (step 6), so it's never released. The `PoolIsTransitioningToGrpc` GC path in `shouldGarbageCollectClient` only runs on the *specific client being released*:

    ```typescript
    // pool.ts — shouldGarbageCollectClient()
    if (this.grpcEnabled !== clientMetadata.grpcEnabled) {
      // We are transitioning to GRPC. Garbage collect REST clients.
      return new PoolIsTransitioningToGrpc({
        shouldGarbageCollectClient: true,
        // ...
      });
    }
    ```

    Since the zombie is never acquired after the transition, it's never released, and this code path never executes for it.
 8. **Client A permanently occupies idle capacity slots.** With default `concurrentOperationLimit = 100`, the zombie contributes 100 idle capacity units.
 9. **Death spiral.** When any new gRPC client (Client B) is created and released, `shouldGarbageCollectClient` computes:

    ```typescript
    // pool.ts — shouldGarbageCollectClient() idle capacity calculation
    let idleCapacityCount = 0;
    for (const [, metadata] of this.activeClients) {
      idleCapacityCount +=
        this.concurrentOperationLimit - metadata.activeRequestCount;
    }
    
    const maxIdleCapacityCount =
      this.maxIdleClients * this.concurrentOperationLimit;
    return new IdleCapacity({
      shouldGarbageCollectClient: idleCapacityCount > maxIdleCapacityCount,
      // ...
    });
    ```

    `idleCapacityCount = 100 (zombie A) + 100 (new B) = 200`. With default `maxIdleClients = 1`, `maxIdleCapacityCount = 100`. Since `200 > 100`, **Client B is immediately garbage collected.**
10. Every subsequent RPC creates a fresh GAPIC client, uses it once, and discards it. Each fresh client creates a new gRPC channel (new TCP connection) and a new `GoogleAuth` instance (which starts with an expired token, triggering an immediate metadata server token fetch).

## Workaround

Setting `maxIdleChannels: 2` breaks the death spiral. With `maxIdleCapacityCount = 200`, the zombie's 100 idle slots plus one surviving gRPC client's 100 idle slots equals 200, which does not exceed the threshold (strict `>`), so the gRPC client survives and gets reused.

The zombie still leaks one GAPIC client per Firestore instance but is otherwise harmless.

### A clear and concise description of what the bug is, and what you expected to happen.

* Near 1:1 ratio of Firestore RPCs to new TCP connections (observed via Datadog APM `tcp.connect` spans)
* Near 1:1 ratio of Firestore RPCs to metadata server token fetches (`GET http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token`)
* \~33x increase in `tcp.connect` volume after upgrading from `@google-cloud/firestore` v3 to v7
* Significant p99 latency regression on all Firestore operations

### A clear and concise description WHY you expect this behavior, i.e., was it a recent change, there is documentation that points to this behavior, etc. **

I expect clients in the pool to be re-used, especially when there's been no Rest -> gRPC transition. `maxIdleChannels` defaults to 1, so the default behavior of the client pool exhibits this problem. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClientPool "zombie client" death spiral prevents connection reuse after gRPC transition #8294

Please make sure you have searched for information in the following guides.

Library Name

A screenshot that you have tested with "Try this API".

Link to the code that reproduces this issue. A link to a public Github Repository or gist with a minimal reproduction.

A step-by-step description of how to reproduce the issue, based on the linked reproduction.

Reproduction

Mechanism

Workaround

A clear and concise description of what the bug is, and what you expected to happen.

A clear and concise description WHY you expect this behavior, i.e., was it a recent change, there is documentation that points to this behavior, etc. **

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ClientPool "zombie client" death spiral prevents connection reuse after gRPC transition #8294

Description

Please make sure you have searched for information in the following guides.

Library Name

A screenshot that you have tested with "Try this API".

Link to the code that reproduces this issue. A link to a public Github Repository or gist with a minimal reproduction.

A step-by-step description of how to reproduce the issue, based on the linked reproduction.

Reproduction

Mechanism

Workaround

A clear and concise description of what the bug is, and what you expected to happen.

A clear and concise description WHY you expect this behavior, i.e., was it a recent change, there is documentation that points to this behavior, etc. **

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions