Critical database connection issues under high alert load
Description
Keep backend has multiple critical database connection issues that make it unusable in high-throughput production environments (1000+ alerts/minute). These issues occur both with direct PostgreSQL connections AND through PgBouncer.
Environment
- Keep version: 0.48.0
- PostgreSQL: 15.x (direct connection, VM)
- Python: 3.13
- psycopg2: latest
- SQLAlchemy: 2.0.36
- Deployment: 1 pods, 4 gunicorn workers + 4 ARQ workers each
Configuration tested
DATABASE_POOL_SIZE: "20"
DATABASE_MAX_OVERFLOW: "40"
KEEP_DB_PRE_PING_ENABLED: "true"
Issue 1: Connection pool exhaustion (Direct PostgreSQL)
Symptoms
Under moderate load (~50 alerts/second), the connection pool becomes exhausted within minutes:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 20 overflow 40 reached,
connection timed out, timeout 30.00
Analysis
Connections are acquired but never returned to the pool. Even with pool_pre_ping=True, dead connections accumulate. PostgreSQL shows connections in idle state that are never reused by the application.
Expected: 60 connections (20 + 40 overflow) should be recycled
Actual: All 60 connections become "stuck", new requests timeout
Issue 2: Deferred loader failure after INSERT
Error
KeyError: "Deferred loader for attribute 'id' failed to populate correctly"
File "keep/api/core/db.py", line 2716, in create_deduplication_event
"deduplication_event_id": deduplication_event.id,
Root cause
In db.py:2716, code accesses deduplication_event.id immediately after session.add() but before session.flush() or session.commit(). The ORM hasn't populated the auto-generated ID yet.
Fix required
# Current (broken):
session.add(deduplication_event)
return {"deduplication_event_id": deduplication_event.id} # ID not populated!
# Should be:
session.add(deduplication_event)
session.flush() # Force INSERT to get ID
return {"deduplication_event_id": deduplication_event.id}
Issue 3: PgBouncer incompatibility
When using PgBouncer (even in session mode), psycopg2 produces protocol-level errors:
psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq
sqlalchemy.exc.NoSuchColumnError: Could not locate column in row for column 'workflow.id'
IndexError: tuple index out of range
PgBouncer config tested
pool_mode = session
server_reset_query = DISCARD ALL
ignore_startup_parameters = extra_float_digits
Result: Still fails. psycopg2 binary protocol has fundamental incompatibility with PgBouncer.
Issue 4: Stale connections not detected
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.
PostgreSQL closes idle connections (via idle_session_timeout), but SQLAlchemy pool doesn't detect this despite pool_pre_ping=True.
Summary table
| Setup |
Result |
Primary error |
| Direct PostgreSQL |
❌ Fails |
Pool exhaustion + deferred loader |
| Direct + pre_ping |
❌ Fails |
Same issues |
| PgBouncer session mode |
❌ Fails |
Protocol corruption |
| PgBouncer + DISCARD ALL |
❌ Fails |
Same protocol errors |
This blocks enterprise adoption where connection pooling is mandatory.
Critical database connection issues under high alert load
Description
Keep backend has multiple critical database connection issues that make it unusable in high-throughput production environments (1000+ alerts/minute). These issues occur both with direct PostgreSQL connections AND through PgBouncer.
Environment
Configuration tested
Issue 1: Connection pool exhaustion (Direct PostgreSQL)
Symptoms
Under moderate load (~50 alerts/second), the connection pool becomes exhausted within minutes:
Analysis
Connections are acquired but never returned to the pool. Even with
pool_pre_ping=True, dead connections accumulate. PostgreSQL shows connections inidlestate that are never reused by the application.Expected: 60 connections (20 + 40 overflow) should be recycled
Actual: All 60 connections become "stuck", new requests timeout
Issue 2: Deferred loader failure after INSERT
Error
Root cause
In
db.py:2716, code accessesdeduplication_event.idimmediately aftersession.add()but beforesession.flush()orsession.commit(). The ORM hasn't populated the auto-generated ID yet.Fix required
Issue 3: PgBouncer incompatibility
When using PgBouncer (even in session mode), psycopg2 produces protocol-level errors:
PgBouncer config tested
Result: Still fails. psycopg2 binary protocol has fundamental incompatibility with PgBouncer.
Issue 4: Stale connections not detected
PostgreSQL closes idle connections (via
idle_session_timeout), but SQLAlchemy pool doesn't detect this despitepool_pre_ping=True.Summary table
This blocks enterprise adoption where connection pooling is mandatory.