-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat(dbapi): add retry_aborts_internally option to disable internal statement-replay retry #16491
Description
Summary
The Spanner DBAPI layer (spanner_dbapi) always retries aborted transactions internally by replaying all recorded statements and validating checksums. There is no way to disable this behavior. Applications that implement their own transaction retry logic (re-invoking a callable with a fresh session on abort) experience nested retry loops that cause severe contention amplification under concurrent writes.
Background
When commit() receives an Aborted exception from Spanner, the DBAPI enters an internal retry loop in TransactionRetryHelper.retry_transaction(). This loop replays all statements recorded during the transaction and validates checksums of read results to ensure consistency. It retries up to 50 times with exponential backoff.
This mechanism was designed for Django and other PEP 249 ORMs that build transactions incrementally through individual cursor.execute() calls (original motivation: googleapis/python-spanner-django#34). In this model, the DBAPI layer is the only component that can retry — the ORM has no concept of "re-run this transaction from scratch."
However, many applications use a different pattern: wrapping the entire transaction in a callable and re-invoking it on abort (similar to Session.run_in_transaction). For these applications, the internal retry is unnecessary and harmful.
The nested retry problem
When an application wraps transactions in its own retry loop and the DBAPI also retries internally, the two layers interfere:
-
Contention amplification (thundering herd): The internal replay re-acquires locks on the same rows that caused the original abort. Under concurrent writes, each replay attempt can abort another thread's replay, leading to exponential retry growth across threads.
-
Wasted wall-clock time: The internal retry loop accumulates 13–19 seconds of lock wait time (observed in production with 10 concurrent writers) before finally raising
RetryAborted. The outer application retry then starts fresh, having wasted all that time. -
Checksum mismatches on contended rows: For read-modify-write patterns, replayed reads almost always return different data (because another transaction committed in between), causing
_compare_checksums()to fail. The internal retry is structurally unable to succeed in this scenario — it always falls through toRetryAbortedafter exhausting retries.
Relevant code paths
| File | Function | Role |
|---|---|---|
connection.py L505-515 |
Connection.commit() |
Catches Aborted, calls retry_transaction(), then recursively calls commit() |
transaction_helper.py L165-210 |
TransactionRetryHelper.retry_transaction() |
The internal retry loop — replays statements, validates checksums |
checksum.py L64-80 |
_compare_checksums() |
Raises RetryAborted on checksum mismatch |
exceptions.py L165-172 |
RetryAborted |
Exception raised when internal retry fails validation |
Timeline
| Date | Commit / PR | Event |
|---|---|---|
| Oct 2020 | googleapis/python-spanner-django#34 | Original request — Django needs transparent transaction retry |
| Nov 2020 | PR googleapis/python-spanner#156, googleapis/python-spanner#160, googleapis/python-spanner#168 | DBAPI created with built-in statement replay and checksum validation |
| Feb 2021 | JDBC RETRY_ABORTS_INTERNALLY |
JDBC driver adds opt-out flag for the same reason |
| 2021+ | Go client | Go provides NewReadWriteStmtBasedTransaction (with internal retry) vs ReadWriteTransaction (without) as separate APIs |
| Mar 2026 | This issue | Python DBAPI still has no way to disable internal retry |
Proposed Change
Add a retry_aborts_internally parameter to Connection and connect(), following the same pattern used for read_only and request_priority:
- Default
True— preserves existing behavior; no breaking change - When
False—commit()wrapsAbortedinRetryAbortedand raises immediately, bypassing the statement-replay loop
Files changed
connection.py— Addretry_aborts_internallyparameter to__init__andconnect(), add property getter/setter, modifycommit()to check the flagtest_connection.py— 8 new unit tests
Usage
from google.cloud.spanner_dbapi import connect
# Default (unchanged) — internal retry enabled
conn = connect(instance_id, database_id, project=project)
# Disable internal retry for application-managed retries
conn = connect(instance_id, database_id, project=project,
retry_aborts_internally=False)
# SQLAlchemy via connect_args
engine = create_engine("spanner:///...",
connect_args={"retry_aborts_internally": False})Production impact
In our workload (10 concurrent writers updating JSON array columns on the same row):
| Configuration | Success rate | Abort-to-recovery time |
|---|---|---|
| Default (nested retries) | ~55% | 13–19 seconds |
retry_aborts_internally=False + app retry |
98–100% | 0.01–0.08 seconds |
Related
- PR: feat(dbapi): add retry_aborts_internally option to Connection python-spanner#1538
- JDBC equivalent:
RETRY_ABORTS_INTERNALLY - Go equivalent:
NewReadWriteStmtBasedTransactionvsReadWriteTransaction - Django original motivation: googleapis/python-spanner-django#34