[Improvement](unlock)optimize the place of unlock which deadlock occurs when database tables are frequently created or deleted#62391
Conversation
…rs when database tables are frequently created or deleted
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
/review |
There was a problem hiding this comment.
Found 1 blocking issue.
InsertIntoTableCommand.initPlan()now callsinsertExecutor.onFail(e)before releasingnewestTargetTableIf.readLock(). For the OLAP insert path,OlapInsertExecutor.onFail()aborts the transaction throughGlobalTransactionMgr.abortTransaction(), which acquires the target table write lock viaMetaLockUtils.tryWriteLockTablesOrMetaException(...). Because the current thread still holds the read lock, the abort path can only time out after 5s instead of cleaning the transaction up immediately. This regresses the failure-path locking behavior that the existing comment is explicitly protecting.
Critical checkpoints
- Goal of the task: simplify unlock handling around
initPlan(). The current code does not preserve the failure-path behavior, and there is no test proving the intended deadlock fix. - Scope/minimality: the diff is small and focused, but the new
finallychanges lock ordering on the error path. - Concurrency: not safe. The insert thread now holds a table read lock while invoking a failure handler that upgrades to table write locking.
- Lifecycle: no additional lifecycle issues beyond the lock-ownership regression above.
- Configuration: not applicable.
- Compatibility: not applicable.
- Parallel code paths: the generic
initPlan()path serves all insert executors, but the regression is observable on OLAP insert executors whoseonFail()aborts a transaction. - Conditional checks: the schema/id recheck logic is still fine.
- Test coverage: no regression or unit test was added for
beginTransaction()/finalizeSink()failures under concurrent metadata changes. - Observability: existing WARN logs are enough to diagnose once reproduced.
- Transaction/persistence: not safe. The abort path may fail to take the required table write lock, delaying cleanup of a PREPARE transaction.
- Data writes/modifications: the failure path no longer guarantees prompt transactional cleanup.
- FE/BE variable passing: not applicable.
- Performance: the regression adds a 5s write-lock timeout on the failure path.
- Other issues: none beyond the blocking lock-order regression above.
| } | ||
| Throwables.throwIfInstanceOf(e, RuntimeException.class); | ||
| throw new IllegalStateException(e.getMessage(), e); | ||
| } finally { |
There was a problem hiding this comment.
Calling insertExecutor.onFail(e) before this readUnlock() reintroduces the lock-order problem the old code was explicitly avoiding. For the normal OLAP path, onFail() goes through OlapInsertExecutor.abortTransactionOnFail() -> GlobalTransactionMgr.abortTransaction() -> MetaLockUtils.tryWriteLockTablesOrMetaException(...), which needs the same table's write lock. Because the current thread is still holding newestTargetTableIf.readLock(), that write-lock acquisition times out instead of aborting the txn immediately.
Please keep the unlock ahead of onFail() on the exception path, and use finally only for the success / retry branches if you still want to deduplicate the non-error unlocks.
…rs when database tables are frequently created or deleted
What problem does this PR solve?
Issue Number: close #62390
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)