Skip to content

feat(dbapi): add retry_aborts_internally option to disable internal statement-replay retry #16491

@waiho-gumloop

Description

@waiho-gumloop

Summary

The Spanner DBAPI layer (spanner_dbapi) always retries aborted transactions internally by replaying all recorded statements and validating checksums. There is no way to disable this behavior. Applications that implement their own transaction retry logic (re-invoking a callable with a fresh session on abort) experience nested retry loops that cause severe contention amplification under concurrent writes.

Background

When commit() receives an Aborted exception from Spanner, the DBAPI enters an internal retry loop in TransactionRetryHelper.retry_transaction(). This loop replays all statements recorded during the transaction and validates checksums of read results to ensure consistency. It retries up to 50 times with exponential backoff.

This mechanism was designed for Django and other PEP 249 ORMs that build transactions incrementally through individual cursor.execute() calls (original motivation: googleapis/python-spanner-django#34). In this model, the DBAPI layer is the only component that can retry — the ORM has no concept of "re-run this transaction from scratch."

However, many applications use a different pattern: wrapping the entire transaction in a callable and re-invoking it on abort (similar to Session.run_in_transaction). For these applications, the internal retry is unnecessary and harmful.

The nested retry problem

When an application wraps transactions in its own retry loop and the DBAPI also retries internally, the two layers interfere:

  1. Contention amplification (thundering herd): The internal replay re-acquires locks on the same rows that caused the original abort. Under concurrent writes, each replay attempt can abort another thread's replay, leading to exponential retry growth across threads.

  2. Wasted wall-clock time: The internal retry loop accumulates 13–19 seconds of lock wait time (observed in production with 10 concurrent writers) before finally raising RetryAborted. The outer application retry then starts fresh, having wasted all that time.

  3. Checksum mismatches on contended rows: For read-modify-write patterns, replayed reads almost always return different data (because another transaction committed in between), causing _compare_checksums() to fail. The internal retry is structurally unable to succeed in this scenario — it always falls through to RetryAborted after exhausting retries.

Relevant code paths

File Function Role
connection.py L505-515 Connection.commit() Catches Aborted, calls retry_transaction(), then recursively calls commit()
transaction_helper.py L165-210 TransactionRetryHelper.retry_transaction() The internal retry loop — replays statements, validates checksums
checksum.py L64-80 _compare_checksums() Raises RetryAborted on checksum mismatch
exceptions.py L165-172 RetryAborted Exception raised when internal retry fails validation

Timeline

Date Commit / PR Event
Oct 2020 googleapis/python-spanner-django#34 Original request — Django needs transparent transaction retry
Nov 2020 PR googleapis/python-spanner#156, googleapis/python-spanner#160, googleapis/python-spanner#168 DBAPI created with built-in statement replay and checksum validation
Feb 2021 JDBC RETRY_ABORTS_INTERNALLY JDBC driver adds opt-out flag for the same reason
2021+ Go client Go provides NewReadWriteStmtBasedTransaction (with internal retry) vs ReadWriteTransaction (without) as separate APIs
Mar 2026 This issue Python DBAPI still has no way to disable internal retry

Proposed Change

Add a retry_aborts_internally parameter to Connection and connect(), following the same pattern used for read_only and request_priority:

  • Default True — preserves existing behavior; no breaking change
  • When Falsecommit() wraps Aborted in RetryAborted and raises immediately, bypassing the statement-replay loop

Files changed

  1. connection.py — Add retry_aborts_internally parameter to __init__ and connect(), add property getter/setter, modify commit() to check the flag
  2. test_connection.py — 8 new unit tests

Usage

from google.cloud.spanner_dbapi import connect

# Default (unchanged) — internal retry enabled
conn = connect(instance_id, database_id, project=project)

# Disable internal retry for application-managed retries
conn = connect(instance_id, database_id, project=project,
               retry_aborts_internally=False)

# SQLAlchemy via connect_args
engine = create_engine("spanner:///...",
                       connect_args={"retry_aborts_internally": False})

Production impact

In our workload (10 concurrent writers updating JSON array columns on the same row):

Configuration Success rate Abort-to-recovery time
Default (nested retries) ~55% 13–19 seconds
retry_aborts_internally=False + app retry 98–100% 0.01–0.08 seconds

Related

Metadata

Metadata

Assignees

Labels

api: spannerIssues related to the Spanner API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions