Description
get_engine() in pygeoapi/provider/sql.py creates the SQLAlchemy engine with:
engine = create_engine(conn_str, connect_args=connect_args, pool_pre_ping=True)
No pool parameters are passed, so SQLAlchemy uses the default QueuePool (pool_size=5, max_overflow=10) with pool_recycle=-1. The pool_size connections are therefore held open for the entire life of the worker process
and are never recycled.
In a multi-process deployment this is significant. With N replicas and W gunicorn workers, the connection floor is N * W * pool_size connections that sit IDLE on the database indefinitely, plus burst overflow. On a deployment with 4 replicas x 4 workers we observed connections idle on the PostgreSQL server for several days, eventually exhausting max_connections (shared
with other services) and producing:
org.postgresql.util.PSQLException: FATAL: remaining connection slots are
reserved for roles with the SUPERUSER attribute
There is currently no way to bound or recycle the pool from configuration.
Proposal
Allow the connection pool to be configured per provider via the existing options: block, exposing pool_size, max_overflow, pool_recycle, pool_timeout and pool_pre_ping. These would be parsed in store_db_parameters(), kept out of the DBAPI connect_args, and passed to create_engine(). The values are stored as a hashable tuple so get_engine() remains @functools.cache-able (engine sharing per process is preserved).
The same change must be applied to the PostgreSQL process manager (pygeoapi/process/manager/postgresql.py), which also calls get_engine().
Open question for the PSC / maintainers
Should the default pool_recycle remain -1 (no behaviour change; pure opt-in feature) or change to a finite value such as 3600 (recycle hourly; removes the "idle forever" footgun for everyone, but is a behaviour change for all existing users)?
I propose merging with pool_recycle=-1 as the default (zero behaviour change) and treating a finite default as a separate follow-up decision, but I'm happy to go whichever way the PSC prefers.
Environment
- pygeoapi: master
- Python: 3.12
- Driver:
postgresql+psycopg2, SQLAlchemy QueuePool
I'd like to contribute a fix (code + test + docs) for this.
Description
get_engine()inpygeoapi/provider/sql.pycreates the SQLAlchemy engine with:No pool parameters are passed, so SQLAlchemy uses the default
QueuePool(pool_size=5,max_overflow=10) withpool_recycle=-1. Thepool_sizeconnections are therefore held open for the entire life of the worker processand are never recycled.
In a multi-process deployment this is significant. With N replicas and W gunicorn workers, the connection floor is
N * W * pool_sizeconnections that sit IDLE on the database indefinitely, plus burst overflow. On a deployment with 4 replicas x 4 workers we observed connections idle on the PostgreSQL server for several days, eventually exhaustingmax_connections(sharedwith other services) and producing:
There is currently no way to bound or recycle the pool from configuration.
Proposal
Allow the connection pool to be configured per provider via the existing
options:block, exposingpool_size,max_overflow,pool_recycle,pool_timeoutandpool_pre_ping. These would be parsed instore_db_parameters(), kept out of the DBAPIconnect_args, and passed tocreate_engine(). The values are stored as a hashable tuple soget_engine()remains@functools.cache-able (engine sharing per process is preserved).The same change must be applied to the PostgreSQL process manager (
pygeoapi/process/manager/postgresql.py), which also callsget_engine().Open question for the PSC / maintainers
Should the default
pool_recycleremain-1(no behaviour change; pure opt-in feature) or change to a finite value such as3600(recycle hourly; removes the "idle forever" footgun for everyone, but is a behaviour change for all existing users)?I propose merging with
pool_recycle=-1as the default (zero behaviour change) and treating a finite default as a separate follow-up decision, but I'm happy to go whichever way the PSC prefers.Environment
postgresql+psycopg2, SQLAlchemy QueuePoolI'd like to contribute a fix (code + test + docs) for this.