Skip to content

Add skipDuplicates() scope guard to createDocuments#852

Open
premtsd-code wants to merge 26 commits intomainfrom
csv-import-upsert-v2
Open

Add skipDuplicates() scope guard to createDocuments#852
premtsd-code wants to merge 26 commits intomainfrom
csv-import-upsert-v2

Conversation

@premtsd-code
Copy link
Copy Markdown
Contributor

@premtsd-code premtsd-code commented Apr 8, 2026

Summary

Adds a Database::skipDuplicates() scope guard so callers can wrap a createDocuments() call and have duplicate-by-id rows silently skipped instead of throwing DuplicateException. Used by CSV import / re-import flows and database migration workers where re-running the same batch should be idempotent rather than fatal.

The shape follows the existing silent() / skipRelationships() pattern — instance-level scope guard, not a flag parameter — so callers compose naturally:

$database->skipDuplicates(fn () => $database->createDocuments($collection, $batch));

Design

Duplicate handling happens at the adapter layer via dialect-specific no-op inserts. Database::createDocuments sets a scope-guarded flag on the adapter before each chunk's INSERT; the adapter emits the right SQL (or Mongo equivalent) and returns the batch. No pre-filter, no intra-batch dedup — adapter-level row-level dedup is sufficient and avoids redundant queries.

Per-adapter implementation

Adapter Strategy
MariaDB / MySQL INSERT IGNORE INTO
Postgres INSERT INTO ... ON CONFLICT (...) DO NOTHING
SQLite INSERT OR IGNORE INTO
Mongo upsert + $setOnInsert, bypassing the transaction wrap to avoid txn-abort-on-duplicate

Hot-path cost

One property read + one boolean branch per chunk. Callers that don't use skipDuplicates pay no closure allocation and no scope-guard setup — the feature cost is paid only when the feature is actually used. This addresses review comment #1 (don't penalize the common case with an $enable flag pattern).

Mirror dual-write correctness

Mirror::createDocuments pre-filters against source to identify which input ids already exist, then after the source insert forwards only the genuinely-new docs to destination. Pre-existing ids were silently no-op'd by the adapter's INSERT IGNORE — they represent no user write, so nothing propagates to destination.

This replaces an earlier captureOnNext-based approach flagged by Greptile (#3084293974): that approach wrapped $onNext to capture "what source inserted", but SQL adapters return the full input batch from createDocuments regardless of rowCount(), causing $onNext to fire for every doc — including skipped duplicates — so Mirror forwarded would-be values to destination, diverging from source. Pre-filter-then-insert matches the idempotent-write semantics: a skipped duplicate is not a user write, so it does not mirror.

Cost: one extra SELECT against source per batch when skipDuplicates is active and Mirror has an upgraded destination. Zero cost on the non-skip path. Skip and non-skip paths are unified into a single method body with branches only at the three points where behavior actually differs (source call, forward filter, destination call).

upsertDocumentsWithIncrease tenant grouping (regression fix)

Restored per-tenant grouping for the existing-doc lookup in sharedTables + tenantPerDocument mode. An earlier review round (#2 — inline find() instead of a helper) was correct for single-tenant use cases but ran under the session tenant context (null for platform workers), silently missing rows from other tenants. This broke appwrite's StatsUsage worker, which accumulates stats across many projects and flushes via $dbForLogs->setTenant(null)->setTenantPerDocument(true) + upsertDocumentsWithIncrease() — every input doc looked "new" and went down the create path instead of the update path.

The fix: branch on tenantPerDocument mode and run one find() per unique tenant wrapped in withTenant($tenant, ...). K queries total (K = unique tenants in the batch). Single-tenant callers (the common case, including CSV import and database migration) still hit the fast one-query batched path.

Helper: tenantKey(Document) — used only within upsertDocumentsWithIncrease to keep the in-memory lookup map and the intra-batch dedup check consistent. Returns "<tenant>:<id>" in tenantPerDocument mode, just <id> otherwise. Without it, a legitimate cross-tenant batch like [{id: x, tenant: 1}, {id: x, tenant: 2}] would falsely fail with DuplicateException.

Tests

End-to-end coverage across all adapter variants (MariaDB / MySQL / Postgres / SQLite / MongoDB × Pool / Mirror / SharedTables):

  • testCreateDocumentsIgnoreDuplicates — mix of existing + new docs, non-existing inserted, existing no-op'd
  • testCreateDocumentsIgnoreAllDuplicates — every doc already exists, nothing newly inserted
  • testCreateDocumentsSkipDuplicatesEmptyBatch — empty batch is a no-op
  • testCreateDocumentsSkipDuplicatesNestedScope — scope guard saves & restores through nested calls
  • testCreateDocumentsSkipDuplicatesLargeBatch — 300 docs / 50 pre-existing, exercises the chunk loop
  • testCreateDocumentsSkipDuplicatesSecondCallSkipsAll — idempotent re-invocation preserves first-write values
  • testCreateDocumentsSkipDuplicatesRelationships — relationship children aren't duplicated on re-run
  • testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination (Mirror) — source-seeded dup=Original with a dup=WouldBe caller input, asserts destination.dup stays empty and destination.fresh gets the newly-inserted row (Greptile's exact assertion from #3084293974)

Test plan

  • skipDuplicates tests pass on every adapter variant
  • Full Mirror suite: 635 tests / 11,035 assertions green
  • Greptile divergence assertion holds
  • composer analyze (PHPStan) clean at the project's configured level
  • No regressions in existing testCreateDocuments* and testUpsert* suites

Summary by CodeRabbit

  • New Features

    • Add scoped "skip duplicates" mode for batch document creation that safely ignores duplicates, preserves existing attributes, and avoids aborting operations.
  • Refactor

    • Batch lookups and tenant-aware identity handling improve upsert/dedup detection and large-batch performance; duplicate-skipping integrates across storage backends and mirror flows.
  • Tests

    • Added comprehensive end-to-end tests covering nested scopes, large batches, empty batches, mirrors, and relationship propagation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

Warning

Rate limit exceeded

@premtsd-code has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 30 minutes and 43 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 30 minutes and 43 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5fe84cc5-10e1-46d9-9f87-37a1ce5e5c7d

📥 Commits

Reviewing files that changed from the base of the PR and between fa0e373 and 52b189b.

📒 Files selected for processing (2)
  • src/Database/Database.php
  • src/Database/Mirror.php
📝 Walkthrough

Walkthrough

Adds a scoped skipDuplicates(callable $callback) mode across Database, Pool, Mirror, and adapters that changes create/insert flows to avoid overwriting existing documents by using adapter-specific INSERT-IGNORE/ON CONFLICT or conditional upserts, and adds e2e tests for the behavior.

Changes

Cohort / File(s) Summary
Core adapter & Database state
src/Database/Adapter.php, src/Database/Database.php
Introduce protected skipDuplicates flag and public skipDuplicates(callable $callback): mixed. Database adds tenantKey(...), batches existing-document lookups, and conditionally routes create flows through adapter skip scope.
Pool delegation
src/Database/Adapter/Pool.php
Pool now wraps delegated adapter calls (pinned or borrowed) and transactional pinned-adapter withTransaction in the adapter's skipDuplicates(...) when pool skip flag is set.
Mongo adapter
src/Database/Adapter/Mongo.php
withTransaction short-circuits when skipping duplicates. createDocuments() gains a skip-duplicates path implemented via conditional upserts ($setOnInsert, _uid/_tenant filters) instead of insertMany.
SQL / Postgres / SQLite adapters
src/Database/Adapter/SQL.php, src/Database/Adapter/Postgres.php, src/Database/Adapter/SQLite.php
Added protected hooks getInsertKeyword(), getInsertSuffix(string $table), getInsertPermissionsSuffix(); SQL generation now emits INSERT IGNORE / ON CONFLICT DO NOTHING when skip mode is enabled.
Mirror behavior
src/Database/Mirror.php
When skipping duplicates, Mirror prefetches existing IDs from source, uses adapter skipDuplicates(...) for source/destination writes, and filters out source-existing docs before forwarding to destination.
Tests (e2e)
tests/e2e/Adapter/Scopes/DocumentTests.php, tests/e2e/Adapter/MirrorTest.php
Added comprehensive e2e tests covering duplicate-skipping semantics: mixed/all-duplicate/empty batches, nested scopes, large chunked batches, repeated calls, relationships, and mirror divergence prevention.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Database
    participant Adapter
    participant Storage as DB

    rect rgba(100,180,120,0.5)
    Note over Client,Storage: New flow with skipDuplicates enabled
    Client->>Database: createDocuments(docs) within skipDuplicates scope
    Database->>Database: compute tenantKey, dedupe, chunk ids
    Database->>Adapter: find(existing ids) [chunked]
    Adapter->>DB: SELECT/_find by _uid/_id
    DB-->>Adapter: existing ids
    Adapter-->>Database: existing ids
    Database->>Database: filter out pre-existing docs, prepare inserts
    Database->>Adapter: adapter->skipDuplicates(fn => createDocuments(filtered))
    Adapter->>DB: INSERT IGNORE / ON CONFLICT DO NOTHING or conditional upserts (Mongo)
    DB-->>Adapter: success (duplicates skipped)
    Adapter-->>Database: inserted result info
    Database->>Database: create relationships for actually inserted docs
    Database-->>Client: return inserted count / docs
    end
Loading
sequenceDiagram
    participant Client
    participant Database
    participant Adapter
    participant Storage as DB

    rect rgba(200,100,150,0.5)
    Note over Client,Storage: Regular flow without skipDuplicates
    Client->>Database: createDocuments(docs)
    Database->>Adapter: createDocuments(docs)
    Adapter->>DB: INSERT INTO ...
    DB-->>Adapter: may error on duplicates
    Adapter-->>Database: propagate exception or handle normally
    Database-->>Client: result or exception
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Feat mongodb #721: Overlaps Mongo adapter transaction and createDocuments changes that touch skip-duplicate behavior.
  • Fix update documents #646: Modifies Database::createDocuments and SQL insert handling; relevant to the insert-path changes.
  • Feat mongo tmp #647: Related Mongo adapter changes and upsert/transaction behavior affecting the same area.

Suggested reviewers

  • abnegate
  • fogelito
  • ArnabChatterjee20k

Poem

🐇 I hopped through rows and checked each id,
I nudged the duplicates softly to the side,
Scoped my skip so nested hops restore,
Fresh carrots stay, old roots keep core,
A tiny rabbit guard — no overwrites tonight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a skipDuplicates() scope guard to createDocuments functionality across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch csv-import-upsert-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR adds a skipDuplicates() scope guard to createDocuments across all adapter backends (MySQL INSERT IGNORE, Postgres ON CONFLICT DO NOTHING, SQLite INSERT OR IGNORE, Mongo upsert + $setOnInsert), fixes per-tenant batching in upsertDocumentsWithIncrease, and hardens the Mirror flow to avoid forwarding skipped duplicates to the destination.

  • P1 — M2M junction duplication: createDocumentRelationships runs unconditionally for every input document before the adapter's dedup step. For MANY_TO_MANY relationships, relateDocuments always calls createDocument on the junction collection without checking for an existing entry; the junction table has no composite UNIQUE constraint, so each re-run silently appends a duplicate row for the same (key, twoWayKey) pair. The existing relationship test uses ONE_TO_MANY only and does not catch this.
  • P2 — Mirror pre-filter not tenant-scoped: In tenantPerDocument mode, the find() pre-filter in Mirror::createDocuments runs under the session tenant and misses documents from other tenants, potentially forwarding them to destination as if they were new writes.

Confidence Score: 3/5

One P1 defect (M2M junction duplication on re-run) should be resolved before merging for any re-import use cases involving MANY_TO_MANY relationships.

The core mechanism — adapter flags, Pool propagation, Mongo transaction bypass, Mirror divergence prevention — is well-designed and tests are thorough for covered cases. However, createDocumentRelationships runs before the dedup step with no guard for known duplicates, meaning MANY_TO_MANY junction entries accumulate silently on every re-run of a batch containing existing parent documents — directly undermining the idempotent-re-import guarantee the PR was built to deliver.

src/Database/Database.php (createDocuments pre-processing loop around line 5720), src/Database/Mirror.php (skipDuplicates pre-filter around line 614)

Important Files Changed

Filename Overview
src/Database/Database.php Adds skipDuplicates scope guard and tenantKey helper; upsertDocumentsWithIncrease batch-fetch is well structured, but createDocumentRelationships is called unconditionally for all documents — including known duplicates — causing MANY_TO_MANY junction entries to accumulate silently on re-runs.
src/Database/Mirror.php Pre-filter correctly chunks by maxQueryValues and prevents duplicate forwarding to destination; find is not tenant-scoped in tenantPerDocument mode, which can leak documents to destination in cross-tenant batches.
src/Database/Adapter/SQL.php Pluggable getInsertKeyword / getInsertSuffix pattern is clean; returns the full input batch regardless of actual row insertions, a previously-discussed known trade-off.
src/Database/Adapter/Postgres.php Overrides getInsertSuffix and getInsertPermissionsSuffix with correct ON CONFLICT … DO NOTHING clauses; conflict targets are tenant-aware.
src/Database/Adapter/SQLite.php Minimal override — provides INSERT OR IGNORE INTO for the skip-duplicates path.
src/Database/Adapter/Mongo.php Bypasses transaction and uses upsert + $setOnInsert to avoid write-conflict errors; return value of upsert() is discarded (full input batch returned), a trade-off explicitly acknowledged in a prior thread reply.
src/Database/Adapter/Pool.php Correctly propagates skipDuplicates to the pinned and pooled underlying adapters, including the withTransaction path.
src/Database/Adapter.php Adds $skipDuplicates flag and skipDuplicates(callable) scope guard with save/restore; clean implementation following the existing silent/skipRelationships pattern.
tests/e2e/Adapter/Scopes/DocumentTests.php Good breadth of new tests covering empty batch, nested scopes, large batches, and Mirror divergence; relationship test covers ONE_TO_MANY only, leaving MANY_TO_MANY junction duplication untested.
tests/e2e/Adapter/MirrorTest.php Adds testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination covering the Mirror divergence scenario from the prior review discussion; assertions are precise.

Comments Outside Diff (1)

  1. src/Database/Database.php, line 5720-5722 (link)

    P1 MANY_TO_MANY junction entries duplicated on re-run with skipDuplicates

    createDocumentRelationships is called unconditionally for every document in the pre-processing loop — including documents whose IDs are known duplicates and will be silently skipped by the adapter. For MANY_TO_MANY relationships, relateDocuments always creates a new junction entry via createDocument without checking whether one already exists. The junction collection is created with plain INDEX_KEY indexes, not a composite UNIQUE constraint, so each re-run silently appends a second (third, …) row for the same (key, twoWayKey) pair.

    Concrete failure: call skipDuplicates + createDocuments with a batch that contains existingParent having M2M children [existingChild]. The first call creates the junction row {parentId: existingParent, childId: existingChild}. The second call skips the parent insert but still calls createDocumentRelationships → relateDocuments → createDocument(junction, …) → a second identical junction row is inserted. After two calls existingChild appears twice in every query on the relationship.

    The existing test testCreateDocumentsSkipDuplicatesRelationships uses ONE_TO_MANY only (child stores FK, no junction table), so it does not exercise this path.

    A minimal guard is to skip createDocumentRelationships for documents whose ID is already in $preExistingIds when $this->skipDuplicates is true, which requires fetching pre-existing IDs before this loop (similar to what upsertDocumentsWithIncrease does at line 7153–7193).

Reviews (29): Last reviewed commit: "Chunk id lookups by \$this->maxQueryValu..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/e2e/Adapter/Scopes/DocumentTests.php (1)

7815-7818: Prefer __FUNCTION__ for collection naming consistency.

Using __FUNCTION__ here (as in adjacent tests) improves isolation and avoids accidental name reuse across reruns.

♻️ Suggested tweak
-        $col = 'createDocsIgnoreIntraBatch';
+        $col = __FUNCTION__;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/Adapter/Scopes/DocumentTests.php` around lines 7815 - 7818, The
collection name is hard-coded as 'createDocsIgnoreIntraBatch' which can collide
across test runs; change the $col assignment to use the current function name
(replace the literal with __FUNCTION__) so the calls to
$database->createCollection($col) and $database->createAttribute($col, 'name',
Database::VAR_STRING, 128, true) use a unique, test-local collection name
consistent with adjacent tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/Mongo.php`:
- Around line 1501-1524: The current dedupe prefetch uses only _uid and a single
tenant filter, causing cross-tenant collisions for shared tables; update the
logic that builds existingUids (and the prefetch loop around
$this->client->find/$this->client->getMore) to be tenant-aware by using a
composite key (e.g., "$tenant:$uid") when populating $existingUids, include the
tenant in the $filters returned by $this->getTenantFilters($collection->getId())
per chunk, and prefetch in per-tenant chunks using array_chunk($uidValues,
max(1, $this->maxQueryValues)) (use the chunk size when calling getMore instead
of count($uidValues)); then when filtering $records check the composite key
($record['_tenant'] . ':' . ($record['_uid'] ?? '')) against $existingUids so
duplicate detection respects tenant boundaries.
- Around line 1504-1516: The prefetch block calls $this->client->find() and
$this->client->getMore() without exception mapping; wrap each driver call in a
try/catch that catches \Throwable and rethrows the normalized exception via
processException($e) (same pattern used elsewhere in this class), i.e. surround
the $this->client->find(...) call and the $this->client->getMore(...) call with
try { ... } catch (\Throwable $e) { throw $this->processException($e); } so any
raw driver errors from the $response/$more handling are mapped consistently.

In `@src/Database/Database.php`:
- Around line 7210-7255: The upsertDocumentsWithIncrease() flow builds
$existingDocs keyed as tenant:id when $upsertTenantPerDocument is true, but
later duplicate-input detection still uses only $document->getId(), causing
false DuplicateException across tenants; update the duplicate-input check and
any intermediate maps (e.g., $seenIds/$inputKeys) to use the same composite
lookup key (use the existing $lookupKey logic: $tenant . ':' .
$document->getId() when $upsertTenantPerDocument is true) so the tenant-aware
key is reused consistently across the whole upsert path.

---

Nitpick comments:
In `@tests/e2e/Adapter/Scopes/DocumentTests.php`:
- Around line 7815-7818: The collection name is hard-coded as
'createDocsIgnoreIntraBatch' which can collide across test runs; change the $col
assignment to use the current function name (replace the literal with
__FUNCTION__) so the calls to $database->createCollection($col) and
$database->createAttribute($col, 'name', Database::VAR_STRING, 128, true) use a
unique, test-local collection name consistent with adjacent tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9f1924e-3ed0-43a8-8859-6e6c299aadb5

📥 Commits

Reviewing files that changed from the base of the PR and between 72ee161 and b9c7f4d.

📒 Files selected for processing (9)
  • src/Database/Adapter.php
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/Pool.php
  • src/Database/Adapter/Postgres.php
  • src/Database/Adapter/SQL.php
  • src/Database/Adapter/SQLite.php
  • src/Database/Database.php
  • src/Database/Mirror.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/Database/Database.php (2)

5682-5726: ⚠️ Potential issue | 🟠 Major

Don’t let the outer prefetch snapshot suppress writes.

The adapters already do the authoritative duplicate filtering inside the transaction/session, but this code still drops $chunk entries from a pre-transaction find(). If a matching row is deleted after that snapshot but before the insert transaction starts, ignore: true will incorrectly skip a document that should now insert, and $modified / $onNext will be low.

Also applies to: 5805-5815

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5682 - 5726, The current prefetch
logic builds $preExistingIds (when $ignore is true, in the tenantPerDocument and
non-tenant branches using find()/authorization->skip/etc.) and later uses that
snapshot to skip writes, which can wrongly suppress inserts if a matching row
was deleted after the prefetch; remove or stop using this prefetch-based
suppression. Instead, eliminate reliance on $preExistingIds to skip document
inserts/relationship writes (and the related checks that reference it around the
relationship-resolution code and at the later block noted ~5805-5815), and
either rely on the adapters’ authoritative duplicate filtering inside the
transaction/session or re-check existence inside the transaction. Keep the
prefetch only if it’s used for optimizations that do not affect whether an
insert occurs, or remove the prefetch entirely if it’s unused for
non-authoritative optimizations.

5788-5790: ⚠️ Potential issue | 🟠 Major

Use the same tenant-aware key for deferred relationship writes.

In tenant-per-document mode, two inserted documents can legitimately share the same $id. This map is keyed only by $id, so the later payload overwrites the earlier one, and unset() after the first insert prevents the second tenant’s relationship write entirely.

♻️ Proposed fix
                 if (!empty($relationshipData)) {
-                    $deferredRelationships[$document->getId()] = $relationshipData;
+                    $relationshipKey = $tenantPerDocument
+                        ? $document->getTenant() . ':' . $document->getId()
+                        : $document->getId();
+                    $deferredRelationships[$relationshipKey] = $relationshipData;
                 }
@@
                 foreach ($batch as $insertedDoc) {
-                    $docId = $insertedDoc->getId();
-                    if (\array_key_exists($docId, $deferredRelationships)) {
+                    $relationshipKey = $tenantPerDocument
+                        ? $insertedDoc->getTenant() . ':' . $insertedDoc->getId()
+                        : $insertedDoc->getId();
+                    if (\array_key_exists($relationshipKey, $deferredRelationships)) {
                         $relDoc = clone $insertedDoc;
-                        foreach ($deferredRelationships[$docId] as $key => $value) {
+                        foreach ($deferredRelationships[$relationshipKey] as $key => $value) {
                             $relDoc->setAttribute($key, $value);
                         }
                         $this->silent(fn () => $this->createDocumentRelationships($collection, $relDoc));
-                        unset($deferredRelationships[$docId]);
+                        unset($deferredRelationships[$relationshipKey]);
                     }
                 }

Also applies to: 5823-5832

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5788 - 5790, The deferred
relationship map ($deferredRelationships) is keyed only by $document->getId(),
which causes collisions in tenant-per-document mode; change the key to be
tenant-aware (e.g., combine tenant id and document id or use the document's
tenant identifier method) wherever entries are set and removed so each
tenant/document pair is unique—update the code that assigns
$deferredRelationships[$document->getId()] = $relationshipData; and the matching
unset() logic (and the same pattern around the related block at lines referenced
5823-5832) to use the composite tenant-aware key instead of just getId().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter SELECT that checks for existing _uid
rows (the block building $sql, preparing via $this->getPDO()->prepare, binding
values and calling $stmt->execute()/fetchAll()) can throw a raw PDOException
because it runs outside the method's main try/catch; wrap the entire
prepare/execute/fetch block in the adapter's processException(...) wrapper so
exceptions are mapped consistently (i.e., call $this->processException(...)
around the closure that prepares, binds, executes and fetches, returning $rows),
keeping the logic inside createDocuments() and referencing the existing $sql,
$binds, $stmt and $rows variables.
- Around line 2665-2713: The reconciliation currently keys timestamps only by
_uid and uses a single :_vfy_tenant bind when $this->sharedTables is true, which
corrupts mixed-tenant batches; update the verification to be tenant-aware by
binding the tenant per document and keying by a composite tenant+uid.
Concretely, in the block that builds $verifyPlaceholders/$verifyBinds and
$verifySql (the code using $verifyPlaceholders, $verifyBinds, $verifySql and
$verifyStmt), include the tenant value for each document (use the document's
tenant when tenant-per-document applies), change the WHERE to match pairs (e.g.
use row-value tuples or explicit (_uid = :uX AND _tenant = :tX) clauses), and
then build $actualTimestamps keyed by "$tenant|$uid" (and compare against
$expectedTimestamps keyed the same way) so the subsequent loop that populates
$insertedDocs/$insertedUids only matches rows with the correct tenant+uid.
- Around line 2650-2652: Replace the use of "INSERT IGNORE" (returned by
getInsertKeyword) with a plain "INSERT" and remove any IGNORE-specific behavior
in getInsertSuffix so that constraint violations are not globally suppressed;
instead let the DB raise errors and update the insert/execute path that catches
SQL exceptions to inspect the database error/constraint name (e.g., index or key
name such as _uid) and map errors to DuplicateException only when the violated
constraint corresponds to the _uid duplicate, otherwise map to UniqueException
(or other specific exceptions) for PRIMARY/UNIQUE/NOT NULL/truncation
violations—use the existing functions getInsertKeyword and getInsertSuffix to
locate changes and adjust the exception-mapping in the code that performs the
insert and error handling.

---

Duplicate comments:
In `@src/Database/Database.php`:
- Around line 5682-5726: The current prefetch logic builds $preExistingIds (when
$ignore is true, in the tenantPerDocument and non-tenant branches using
find()/authorization->skip/etc.) and later uses that snapshot to skip writes,
which can wrongly suppress inserts if a matching row was deleted after the
prefetch; remove or stop using this prefetch-based suppression. Instead,
eliminate reliance on $preExistingIds to skip document inserts/relationship
writes (and the related checks that reference it around the
relationship-resolution code and at the later block noted ~5805-5815), and
either rely on the adapters’ authoritative duplicate filtering inside the
transaction/session or re-check existence inside the transaction. Keep the
prefetch only if it’s used for optimizations that do not affect whether an
insert occurs, or remove the prefetch entirely if it’s unused for
non-authoritative optimizations.
- Around line 5788-5790: The deferred relationship map ($deferredRelationships)
is keyed only by $document->getId(), which causes collisions in
tenant-per-document mode; change the key to be tenant-aware (e.g., combine
tenant id and document id or use the document's tenant identifier method)
wherever entries are set and removed so each tenant/document pair is
unique—update the code that assigns $deferredRelationships[$document->getId()] =
$relationshipData; and the matching unset() logic (and the same pattern around
the related block at lines referenced 5823-5832) to use the composite
tenant-aware key instead of just getId().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d2ab610b-bb64-4dcb-979c-f3880d9e99df

📥 Commits

Reviewing files that changed from the base of the PR and between 20846d3 and 242c9a8.

📒 Files selected for processing (3)
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/SQL.php
  • src/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Adapter/Mongo.php

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/Database/Adapter/SQL.php (3)

2482-2527: ⚠️ Potential issue | 🟠 Major

Wrap the ignore-mode pre-filter in processException() handling.

This query runs before the method’s main try/catch, so SQL or connection failures here still escape as raw PDOException while the rest of createDocuments() is adapter-mapped.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2482 - 2527, The pre-filter block
in createDocuments() that prepares and executes the SELECT (using getSQLTable(),
getPDO(), getPDOType(), preparing $sql and executing $stmt) should be wrapped
inside a call to processException() so any PDOException is adapter-mapped
instead of escaping; move the existing logic that builds $placeholders/$binds,
$tenantFilter, $sql, $stmt->bindValue(), $stmt->execute(), $rows =
$stmt->fetchAll() and $stmt->closeCursor() into a closure passed to
processException(), and return/assign the result from that closure (or rethrow
via processException) so behavior matches the rest of createDocuments().

2782-2785: ⚠️ Potential issue | 🟠 Major

Don’t use INSERT IGNORE as the base implementation for ignore=true.

For adapters that inherit this default, ignore=true stops meaning “skip duplicate document IDs” and starts suppressing a broader class of insert failures on both the document and permission insert paths. The base hook should stay plain INSERT INTO, and adapter-specific duplicate handling should be expressed with targeted conflict clauses or constraint-aware exception mapping.

Does MariaDB/MySQL `INSERT IGNORE` suppress constraint violations beyond duplicate-key conflicts, such as PRIMARY KEY, other UNIQUE, NOT NULL, truncation, or conversion errors?

Based on learnings, in src/Database/Adapter/MariaDB.php only duplicate _uid violations should throw DuplicateException; other unique or PRIMARY collisions should still throw UniqueException.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2782 - 2785, The base method
getInsertKeyword in class SQL should not return 'INSERT IGNORE INTO' for
ignore=true; change it to always return the plain 'INSERT INTO' and move any
ignore semantics into adapter-specific code (e.g., Adapter/MariaDB.php) by using
targeted conflict clauses or by mapping constraint/SQLSTATE errors to the
correct exceptions; specifically, in MariaDB adapter ensure only duplicate
violations on the document _uid map to DuplicateException while other
UNIQUE/PRIMARY/constraint failures map to UniqueException (and do not rely on
INSERT IGNORE in SQL::getInsertKeyword).

2664-2728: ⚠️ Potential issue | 🔴 Critical

Keep expectedTimestamps tenant-aware too.

$actualTimestamps is keyed by tenant:uid now, but $expectedTimestamps is still keyed only by _uid. In tenantPerDocument mode, two documents with the same _uid in different tenants can overwrite each other here, causing an inserted document to be dropped from $documents and miss its permission insert.

💡 Proposed fix
-                $expectedTimestamps = [];
-                foreach ($documents as $doc) {
-                    $expectedTimestamps[$doc->getId()] = $doc->getCreatedAt();
-                }
+                $expectedTimestamps = [];
+                foreach ($documents as $doc) {
+                    $key = ($this->sharedTables && $this->tenantPerDocument)
+                        ? $doc->getTenant() . ':' . $doc->getId()
+                        : $doc->getId();
+                    $expectedTimestamps[$key] = $doc->getCreatedAt();
+                }
@@
-                    if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$doc->getId()]) {
+                    if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$key]) {
                         $insertedDocs[] = $doc;
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2664 - 2728, The
expectedTimestamps map is not tenant-aware while actualTimestamps is, so when
sharedTables && tenantPerDocument you must key expectedTimestamps the same way
as actualTimestamps (tenant + ':' + uid) during the initial foreach over
$documents and also use that key when comparing later (instead of
$expectedTimestamps[$doc->getId()]); update the loop that builds
$expectedTimestamps and the comparison that checks equality to construct/look up
keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables &&
$this->tenantPerDocument so entries don't collide across tenants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter block in createDocuments() that prepares
and executes the SELECT (using getSQLTable(), getPDO(), getPDOType(), preparing
$sql and executing $stmt) should be wrapped inside a call to processException()
so any PDOException is adapter-mapped instead of escaping; move the existing
logic that builds $placeholders/$binds, $tenantFilter, $sql, $stmt->bindValue(),
$stmt->execute(), $rows = $stmt->fetchAll() and $stmt->closeCursor() into a
closure passed to processException(), and return/assign the result from that
closure (or rethrow via processException) so behavior matches the rest of
createDocuments().
- Around line 2782-2785: The base method getInsertKeyword in class SQL should
not return 'INSERT IGNORE INTO' for ignore=true; change it to always return the
plain 'INSERT INTO' and move any ignore semantics into adapter-specific code
(e.g., Adapter/MariaDB.php) by using targeted conflict clauses or by mapping
constraint/SQLSTATE errors to the correct exceptions; specifically, in MariaDB
adapter ensure only duplicate violations on the document _uid map to
DuplicateException while other UNIQUE/PRIMARY/constraint failures map to
UniqueException (and do not rely on INSERT IGNORE in SQL::getInsertKeyword).
- Around line 2664-2728: The expectedTimestamps map is not tenant-aware while
actualTimestamps is, so when sharedTables && tenantPerDocument you must key
expectedTimestamps the same way as actualTimestamps (tenant + ':' + uid) during
the initial foreach over $documents and also use that key when comparing later
(instead of $expectedTimestamps[$doc->getId()]); update the loop that builds
$expectedTimestamps and the comparison that checks equality to construct/look up
keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables &&
$this->tenantPerDocument so entries don't collide across tenants.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 892638bd-ac20-414b-9059-7f7e20295028

📥 Commits

Reviewing files that changed from the base of the PR and between 242c9a8 and 455da70.

📒 Files selected for processing (2)
  • src/Database/Adapter/SQL.php
  • src/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Database.php

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2676-2682: The verification query is binding composite keys from
$expectedTimestamps into $verifyBinds/placeholders and then comparing them
against the _uid column, but when tenantPerDocument is enabled _uid was set to
the raw document ID ($attributes['_uid'] = $document->getId()), so the WHERE
_uid IN (...) must use the document IDs not the composite keys. Fix by
extracting the document ID portion from each composite key before creating $ph
and adding to $verifyBinds (or if keys are already plain IDs, leave unchanged);
update the loop that builds $verifyPlaceholders/$verifyBinds (the foreach over
array_values(array_unique(array_keys($expectedTimestamps)))) to derive $uid =
<documentIdFromComposite($uid)> and bind that instead so the verification query
matches rows by actual _uid values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43af96a2-d479-4354-86c9-7d68abad7b56

📥 Commits

Reviewing files that changed from the base of the PR and between 455da70 and deedfda.

📒 Files selected for processing (1)
  • src/Database/Adapter/SQL.php

Adapters:
- MariaDB/MySQL: INSERT IGNORE INTO
- PostgreSQL: ON CONFLICT DO NOTHING
- SQLite: INSERT OR IGNORE INTO
- MongoDB: session-scoped pre-filter before insertMany

Database.php:
- Intra-batch dedup by ID (tenant-aware, first occurrence wins)
- Pre-fetch existing IDs to skip known duplicates
- Deferred relationship creation for ignore mode (no orphans)
- Race-condition reconciliation via _createdAt timestamp verification

upsertDocuments:
- Batch-fetch existing docs with find() instead of per-row getDocument()
- Tenant-aware composite keys for seenIds duplicate check

All paths are tenant-per-document aware with null-safe array_filter.
- testCreateDocumentsIgnoreDuplicates: mixed batch with onNext assertions
- testCreateDocumentsIgnoreIntraBatchDuplicates: first-wins, no ACL drift
- testCreateDocumentsIgnoreAllDuplicates: zero inserts, empty onNext
@premtsd-code premtsd-code force-pushed the csv-import-upsert-v2 branch from f2b08dc to bee42d4 Compare April 9, 2026 10:28
…x _uid mapping

- Pass ignoreDuplicates option to insertMany for race-condition safety
- Extract _uid from raw array before replaceChars (excluded from transformation)
- Map inserted records back to original Document objects by _uid
- Add skipDuplicates(callable) scope guard on Database, following
  existing pattern (skipRelationships, skipValidation, etc.)
- Remove bool $ignore parameter from createDocuments signature
- Mirror propagates skipDuplicates state to source and destination
- Update tests to use $database->skipDuplicates(function() { ... })
- Add $skipDuplicates property + skipDuplicates() scope guard to Adapter
- Remove bool $ignore parameter from all adapter createDocuments signatures
  (Adapter, Pool, Mongo, SQL)
- Remove bool $ignore from SQL helper methods (getInsertKeyword,
  getInsertSuffix, getInsertPermissionsSuffix) and Postgres/SQLite overrides
- Pool delegate() and withTransaction() propagate skipDuplicates state to
  pooled/pinned adapter via the scope guard
- Database::createDocuments() wraps adapter call in adapter->skipDuplicates()
  when the flag is set, drops the local $ignore variable
- skipDuplicates() takes optional $enable so Pool/Mirror/Database can
  forward state in a single call instead of branching
- Database: tenantKey(Document) helper replaces ~8 inline ternaries
- Database: fetchExistingByIds() helper replaces the tenant-grouped
  batched find() in createDocuments and upsertDocuments pre-fetch
- Database: drop dead guard on non-defer relationship path
- Mongo: findExistingUids() helper replaces the duplicated cursor walk
  in the skipDuplicates pre-filter (tenant and non-tenant variants)
- SQL: buildUidTenantLookup() helper replaces the duplicated
  placeholder-and-binds builder shared by pre-filter and reconciliation
When a batch has only null-ID documents but non-null tenants, the
previous code emitted WHERE _uid IN () AND _tenant IN (...) — invalid
SQL that the !empty(binds) call-site guard failed to catch because the
tenant binds kept the binds array non-empty.

Return 'where' => '1=0' with empty binds when there are no UIDs, so
the query matches nothing and leaves no dangling placeholders.
- testCreateDocumentsSkipDuplicatesEmptyBatch: no-op with empty array
- testCreateDocumentsSkipDuplicatesNestedScope: scope guard save/restore
  through nested skipDuplicates() calls
- testCreateDocumentsSkipDuplicatesLargeBatch: 300-doc batch with 50
  pre-existing, exercising the chunk loop
- testCreateDocumentsSkipDuplicatesSecondCallSkipsAll: second identical
  batch should preserve First values, not overwrite with Second
- testCreateDocumentsSkipDuplicatesRelationships: skipDuplicates with a
  one-to-many relationship — verifies deferred relationship logic does
  not create orphan child rows for ignored parents
INSERT IGNORE / ON CONFLICT DO NOTHING / INSERT OR IGNORE handle
duplicates at the DB level, and the existing reconciliation block
identifies actually-inserted rows via the _createdAt timestamp. The
pre-filter was a wire-data optimization that ran a SELECT before the
INSERT, then trimmed the batch — but the same SELECT (with timestamp
comparison) already happens in the reconciliation path on demand.

Net: ~45 lines removed, single code path for skipDuplicates instead
of pre-filter + fallback. Race-condition handling is unchanged
(reconciliation path covers it).
Previously the Database orchestrator carried a pre-fetch + chunk-filter
pass to compensate for Mongo's plain upsert() returning matched +
upserted combined ('n') — without that compensation, $modified would
over-count the number of pre-existing docs in the batch.

This pushes the responsibility into the adapter, where it belongs:
- composer bump to utopia-php/mongo dev-upsert-return-upserted-count
  (PR #37, adds Client::upsertWithCounts() returning matched/modified/
  upserted[] separately)
- Mongo adapter calls upsertWithCounts() and maps the response's
  upserted[].index entries back to the originating Document objects,
  returning ONLY the actually-inserted docs (race-safe)
- Database::createDocuments drops $preExistingIds pre-fetch and the
  chunk filter; the adapter is now responsible for accurate counts

Net effect on Database.php: ~17 lines removed. Net effect on Mongo.php:
~80 lines net deletion (replaced the inline pre-filter with a single
upsertWithCounts call + index mapping).
The SQL adapter previously carried a pre-filter SELECT plus a _createdAt
reconciliation pass to compensate for INSERT IGNORE not telling us which
rows were actually inserted vs skipped. The Mongo adapter pulled the same
trick via Client::upsertWithCounts() (mongo PR #37). Both layers were
solving the same problem twice.

Move the pre-filter back to Database::createDocuments where it ran
originally, and let each adapter's INSERT IGNORE / ON CONFLICT DO NOTHING
/ upsert+\$setOnInsert serve only as a race-safety net. The adapter just
returns its input as the inserted set.

- Database.php: restore \$preExistingIds pre-fetch + chunk filter
- SQL.php: drop pre-filter, drop _createdAt reconciliation, drop
  buildUidTenantLookup helper (-173 lines)
- Mongo.php: revert to plain Client::upsert(), return input docs
  (withTransaction skipDuplicates bypass stays — still needed to avoid
  WriteConflict under snapshot isolation)
- composer.json/lock: back to stable utopia-php/mongo 1.*

Trade-off: in the race window between the orchestrator's pre-fetch
SELECT and the adapter INSERT, \$modified over-reports by N and onNext
fires for N skipped docs. Common case stays correct.
…en comments

The deferred-relationships mechanism was useful when the adapter could
report which docs were actually inserted (via reconciliation/
upsertWithCounts). With those gone, $batch always equals $chunk, so the
defer-then-match dance is dead weight. The orchestrator already pre-
filters known duplicates upfront, so any survivor is expected to insert
and relationships can be created inline like the non-skipDuplicates path.

- Database.php: pre-filter $documents once, before encode/validate,
  instead of filtering each chunk inline. Drop $deferredRelationships
  scaffolding and the per-chunk match loop. Collapse if/elseif into
  one inline createDocumentRelationships call.
- Mongo.php: tighten withTransaction skipDuplicates comment (the
  "relationship writes are deferred" rationale is no longer accurate).
- Mirror.php: tighten the captureOnNext comment.
- composer.lock: revert unrelated symfony-polyfill-php85 / mongo 1.0.0→
  1.0.2 bumps that crept in from an earlier composer update.

Race window unchanged: a doc that passes pre-filter but loses to a
concurrent insert still produces an orphan relationship and an
over-counted $modified — same as before this commit.
* @param bool $enable
* @return T
*/
public function skipDuplicates(callable $callback, bool $enable = true): mixed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the $enable flag and check if skipDuplicates instead, otherwise we add the overhead of calling an extra callable when it's false

* @param bool $idsOnly When true, uses Query::select(['$id']) for a lighter fetch
* @return array<string, Document>
*/
private function fetchExistingByIds(Document $collection, array $documents, bool $idsOnly = false): array
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use find

Comment on lines +5751 to +5776
// skipDuplicates: dedupe intra-batch (first wins) then drop already-existing rows.
// Adapter INSERT IGNORE / ON CONFLICT / upsert is the race-safety net.
if ($this->skipDuplicates) {
$seenIds = [];
$deduplicated = [];
foreach ($documents as $document) {
if ($document->getId() !== '') {
$dedupeKey = $this->tenantKey($document);
if (isset($seenIds[$dedupeKey])) {
continue;
}
$seenIds[$dedupeKey] = true;
}
$deduplicated[] = $document;
}
$documents = $deduplicated;

$preExistingIds = $this->fetchExistingByIds($collection, $documents, idsOnly: true);
if (!empty($preExistingIds)) {
$documents = \array_values(\array_filter(
$documents,
fn (Document $doc) => !isset($preExistingIds[$this->tenantKey($doc)])
));
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this if using insert ignore

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abnegate The child collection is getting orphan rows. when multiple rows with same parent and different childs. where parent is ignored but child is not ignored

if ($this->resolveRelationships) {
$document = $this->silent(fn () => $this->createDocumentRelationships($collection, $document));
}

Comment on lines +607 to +613
$insertedFromSource = [];
$captureOnNext = function (Document $doc) use (&$insertedFromSource, $onNext): void {
$insertedFromSource[] = $doc;
if ($onNext !== null) {
$onNext($doc);
}
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can never capture like this internally, if there were many docs, this loop could easily OOM

Comment on lines +2614 to +2640
/**
* Returns the INSERT keyword, optionally with IGNORE for duplicate handling.
* Override in adapter subclasses for DB-specific syntax.
*/
protected function getInsertKeyword(): string
{
return $this->skipDuplicates ? 'INSERT IGNORE INTO' : 'INSERT INTO';
}

/**
* Returns a suffix appended after VALUES clause for duplicate handling.
* Override in adapter subclasses (e.g., Postgres uses ON CONFLICT DO NOTHING).
*/
protected function getInsertSuffix(string $table): string
{
return '';
}

/**
* Returns a suffix for the permissions INSERT statement when ignoring duplicates.
* Override in adapter subclasses for DB-specific syntax.
*/
protected function getInsertPermissionsSuffix(): string
{
return '';
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's group logically with other getter methods

#1 Drop $enable flag on skipDuplicates() scope guard
The $enable param made every non-skipDuplicates createDocuments call pay
for a closure allocation + extra function call. Branch at the call site
instead so the cost only applies when the flag is actually set.

  - Adapter::skipDuplicates(callable, bool) → skipDuplicates(callable)
  - Database::skipDuplicates(callable, bool) → skipDuplicates(callable)
  - Database::createDocuments, Mirror::createDocuments, Pool::delegate,
    Pool::withTransaction now branch inline.

#2 Drop fetchExistingByIds helper, inline find()
The helper's per-tenant grouping defended a hypothetical multi-tenant
batching scenario that no caller exercises (relationships are intra-
tenant, callers always batch per tenant). Existing patterns in the same
file (refetchDocuments, relationship loading) just call find() directly.
Match that idiom and drop ~70 lines.

#4 Mirror: only capture inserted docs in skipDuplicates mode
The captureOnNext accumulator paid the cost (closure + per-doc array
push) on every createDocuments call, including the common non-skip path.
Branch at the entry of Mirror::createDocuments so the capture only
happens when skipDuplicates is set; the non-skip path passes through
to source/destination unchanged.

#5 Move getInsertKeyword/Suffix/PermissionsSuffix to getters cluster
Were sitting next to createDocuments(); moved to the getSupport*
cluster around line 1030 where other adapter-capability shims live.

Not addressed:
- #2 partial: the existing patterns (refetchDocuments etc.) don't handle
  tenant-per-document multi-tenant batches either, so this is consistent.
- #3 (drop the pre-filter): rejected. createDocumentRelationships runs
  in the encoding loop BEFORE the adapter's INSERT IGNORE no-ops the
  parent, so dropping the pre-filter would deterministically duplicate
  child rows on every CSV re-import of a collection with relationships
  (not a race window — every call). The relationships test verifies
  this. Reverting would require reintroducing the deferred-relationships
  scaffolding we just removed, and the adapter still couldn't tell us
  which parents were actually inserted (SQL INSERT IGNORE has no per-row
  reporting). Pre-filter stays.
Per Jake's #4 review comment, the captureOnNext accumulator could grow
unbounded with the full inserted set. Replace it with a bounded buffer
that flushes to the destination every $batchSize docs as the source
streams them through onNext, plus a tail flush after the source returns.

Memory peak in skipDuplicates mode is now O($batchSize) (~1000 docs)
regardless of how many docs the source inserts. Same destination batching
efficiency — destination still gets multi-doc createDocuments calls,
just one per chunk instead of one giant call at the end.

Eligibility checks (SOURCE_ONLY_COLLECTIONS, upgrade status) moved
upfront so the flush closure can capture the decision and short-circuit
cleanly. Non-skip path is unchanged: direct passthrough, no buffer.

The flush-on-fill is load-bearing for the memory bound — see the
in-method comment.
…erDocument

When inlining find() per Jake's review, the per-tenant grouping the
deleted fetchExistingByIds helper used to do was lost. In tenantPerDocument
mode with a multi-tenant batch, the inlined find() runs under the session
tenant context (often null for platform workers) and silently misses rows
belonging to other tenants — every input doc looks "new" and goes down the
create path instead of the update path.

This actively breaks appwrite's StatsUsage worker, which accumulates stats
across many projects (each tagged with its own $tenant) and flushes them
via $dbForLogs->setTenant(null)->setTenantPerDocument(true) followed by
upsertDocumentsWithIncrease(...). With the old per-doc getDocument loop,
each lookup ran under withTenant($doc->getTenant()) and was correct. The
batch find() inlining lost that scoping.

Fix per Greptile's suggestion: branch on tenantPerDocument mode and run
one find() per unique tenant value under withTenant, merging results into
the same lookup hash. K queries (K = unique tenants in the batch) instead
of N (per-doc) or 1 (broken). Common single-tenant case still hits the
fast batched path.

Applied at both call sites:
- Database::createDocuments skipDuplicates pre-filter
- Database::upsertDocumentsWithIncrease existing-docs lookup

Fixes greptile flag at PR #852 discussion r3077481595.
Per Jake's review feedback, the pre-filter in Database::createDocuments
was redundant with the adapter's INSERT IGNORE / ON CONFLICT DO NOTHING
/ upsert + \$setOnInsert path. The adapter already handles duplicates
atomically; the upfront SELECT duplicated that work.

The semantic is now row-level dedup:
- pre-existing rows are silently no-op'd by the adapter
- new rows go through
- retries are idempotent
- source data is never overwritten

Tests updated to match:
- 5 skipDuplicates tests adjusted to expect full-chunk counts, with
  data-level assertions unchanged (pre-existing values preserved, new
  rows inserted).
- testCreateDocumentsSkipDuplicatesRelationships encodes the retry
  scenario: missing nested child gets created via relateDocuments'
  idempotent path when the parent already exists.
- testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination relaxed
  to source-authoritative semantics only. Destination may transiently
  hold would-be values during the migration window; backfill reconciles.

All 105 skipDuplicates tests across 11 adapter variants pass, full
Mirror suite (636 tests) passes, PHPStan level 7 clean.
Follow-up to Jake's #5 review comment — the fix was only applied to
SQL.php. Propagate the same grouping to the other adapters and fix one
cosmetic byproduct from the Mirror refactor.

- Postgres.php: move getInsertKeyword/Suffix/PermissionsSuffix from
  between createDocuments() and getUpsertStatement() down to the
  getSupportFor* cluster (matches SQL.php:1036 layout).
- SQLite.php: move getInsertKeyword override from the top of the class
  to after the getSupportFor* cluster.
- Adapter.php: skipDuplicates() method was inserted between protected
  property declarations, breaking the property cluster. Move the method
  down next to withTransaction() — the other callable-taking scope
  guard. $skipDuplicates property stays with its peers.
- Mirror.php: revert the incidental fn () => collapse on the non-skip
  destination createDocuments call back to the original two-line form,
  minimizing unrelated diff noise.

Pure relocation / style. No behavior change.
Addresses Greptile #3084293974. The captureOnNext-based forwarding
was structurally broken post-pre-filter-removal: SQL adapters return
the full input batch from createDocuments regardless of how many rows
INSERT IGNORE actually inserted, so onNext fired for every document
including skipped duplicates. captureOnNext buffered all of them and
flushed the full buffer (carrying the caller's would-be values) to
destination, diverging destination from source whenever source had a
pre-existing row the caller tried to overwrite.

Fix: instead of trying to distinguish inserted vs skipped at insert
time, re-fetch source's authoritative state via find() after the
adapter write settles, then forward that to destination. Race-free
regardless of concurrent writers — destination always receives a
faithful snapshot of whatever source holds, because the query runs
after source's write has resolved.

Cost: one extra SELECT per batch against source when skipDuplicates
is active AND Mirror has an upgraded destination. Zero cost on the
non-skip path. Memory remains O(batch_size).

Net ~47 lines smaller than the previous capture-based approach:
- captureOnNext closure removed
- flushBuffer closure removed
- bounded-buffer machinery removed
- shouldMirror upfront-snapshot removed (inlined as eligibility check)

Test: testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination now
asserts destination holds source's 'Original' value rather than the
caller's 'WouldBe' input, per Greptile's suggested assertion.
…p paths

Addresses Greptile #3084293974. The captureOnNext-based forwarding
was structurally broken: SQL adapters return the full input batch from
createDocuments regardless of what INSERT IGNORE actually inserted, so
onNext fired for every doc including skipped duplicates. captureOnNext
then forwarded the caller's would-be values to destination, diverging
destination from source whenever source had a pre-existing row the
caller tried to re-insert.

Fix: pre-filter against source before the insert to identify which
input ids already exist, then forward only the genuinely-new docs to
destination. A source-skipped duplicate is a no-op (no user write
happened), so nothing should propagate. This matches Greptile's
semantics: destination.dup.isEmpty() after skipping a duplicate.

While the logic was touched, the skip and non-skip paths (which shared
eligibility check, upgrade check, clone loop, destination write, and
after-filter loop) are unified into a single flow with $this->
skipDuplicates branches only at the points where behavior actually
differs:

  - source call: wrapped in source->skipDuplicates when active
  - $toForward: filter out pre-existing ids when active
  - destination call: wrapped in destination->skipDuplicates when active

Non-skip hot path pays zero new closure allocations; the pre-filter
block is gated behind if ($this->skipDuplicates) and short-circuits
to an empty $existingIds map.

Net: -46 lines in createDocuments (~170 → ~95), same test coverage,
Greptile's assertion now holds.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Database/Mirror.php (1)

604-656: ⚠️ Potential issue | 🟠 Major

Pre-filter lookup is not tenant-aware for tenantPerDocument mode.

When tenantPerDocument is enabled, documents with the same $id but different tenants are distinct. The current implementation:

  1. Calls find() with all IDs at once without grouping by tenant (line 616)
  2. Keys $existingIds by $id alone (line 622)
  3. Filters $toForward by $id alone (line 650)

If the batch contains documents for multiple tenants, this incorrectly excludes new documents whose ID exists for a different tenant, or fails to detect true duplicates for the correct tenant.

Use the same tenant-aware pattern as Database::upsertDocumentsWithIncrease(): when tenantPerDocument is enabled, group documents by tenant and call find() per-tenant using withTenant(), then use composite keys ($tenant:$id) for tracking existing IDs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Mirror.php` around lines 604 - 656, The pre-filter that builds
$existingIds and filters $toForward must be made tenant-aware when
tenantPerDocument is enabled: instead of calling $this->source->find(...) once
for all IDs and keying by $id, group the $documents by tenant (use the same
pattern as Database::upsertDocumentsWithIncrease()), and for each tenant call
$this->source->withTenant($tenant)->silent(fn() => ->find(...)) to fetch only
that tenant's existing IDs; store keys in $existingIds as composite
"$tenant:$id" and update the filtering lambda for $toForward (and the earlier
array_map/array_filter that builds $ids) to use the same composite key so only
documents with matching tenant+id are considered duplicates.
♻️ Duplicate comments (1)
src/Database/Database.php (1)

5728-5732: ⚠️ Potential issue | 🔴 Critical

skipDuplicates() still leaves relationship side effects behind.

By the time execution reaches Line 5728, every input document has already gone through createDocumentRelationships(). If the adapter then ignores a duplicate parent row, those child/junction writes are still committed. That can orphan rows or mutate relationships on an existing document even though the parent create was skipped. Relationship writes need to be driven from the successfully inserted $batch, not from the pre-insert input list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5728 - 5732, The current flow runs
createDocumentRelationships() before calling the adapter and then uses
adapter->skipDuplicates() which can ignore parent inserts while relationship
writes have already been applied; change the flow so the adapter call (the
$insert closure -> $this->withTransaction(fn () =>
$this->adapter->createDocuments($collection, $chunk))) executes first to produce
the actual inserted $batch (ensure adapter->skipDuplicates(...) returns the real
inserted batch when used), and only after $batch is obtained call
createDocumentRelationships() and any child/junction writes against $batch (not
the original input). Adjust skipDuplicates() usage so it returns the inserted
documents batch and ensure relationship-writing code references $batch instead
of the pre-insert input list.
🧹 Nitpick comments (2)
src/Database/Adapter/Postgres.php (1)

2358-2358: Silence the unused override parameter warning.

$table is intentionally unused here. If the parent signature must stay as-is, rename it to $_table or add a local suppression so PHPMD stops flagging this override.

♻️ Small cleanup
-    protected function getInsertSuffix(string $table): string
+    protected function getInsertSuffix(string $_table): string
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/Postgres.php` at line 2358, The override of
getInsertSuffix(string $table) triggers an unused-parameter warning; rename the
unused parameter to $_table (or prefix it with an underscore) in the
getInsertSuffix method signature or add a local PHPMD suppression comment inside
that method so the linter stops flagging it; update the method signature in the
Postgres class (getInsertSuffix) and keep the parent signature unchanged.
tests/e2e/Adapter/Scopes/DocumentTests.php (1)

7930-7933: Clarify whether these assertions validate “processed” vs “created” semantics.

At Line 7930, Line 7984, Line 8110, and Line 8155, the tests assert counts and onNext emissions include duplicate IDs. That encodes a “processed inputs” contract, not “newly created rows.” Please make this explicit (e.g., rename to $processedCount / add a short comment) or add before/after count assertions to separately validate created-row deltas.

Also applies to: 7984-7986, 8110-8112, 8155-8157

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/Adapter/Scopes/DocumentTests.php` around lines 7930 - 7933, The
assertions using $count and $emittedIds currently validate “processed inputs”
semantics (they expect duplicate IDs for repeated processing) rather than newly
created rows; update the test to make this explicit by renaming variables (e.g.,
$processedCount, $processedEmittedIds) or adding a short comment above the block
clarifying “processed inputs may include duplicates,” and if you also need to
assert created-row deltas add before/after row-count checks (capture $beforeRows
= ... then run operation then assert $afterRows - $beforeRows equals expected
new creations) around the assertions that reference $count and $emittedIds (the
same pattern appears for the other occurrences currently asserting duplicates).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Database.php`:
- Around line 7153-7184: The pre-read ID queries in the branches that build
$idsByTenant/$docIds must be chunked to respect the Query::equal('$id', ...)
limit controlled by $this->maxQueryValues; update the tenant loop (where you
build $idsByTenant and call $this->find(... Query::equal('$id', $tenantIds)
...)) and the else branch (where $docIds is used in $this->find(...
Query::equal('$id', $docIds) ...)) to split each array with array_chunk(...,
max(1, $this->maxQueryValues)) and call find for each chunk (using the same
authorization->skip/withTenant/silent wrappers), then merge all returned
documents into $existingDocs (use the same key format $tenant . ':' .
$doc->getId() in the tenant branch) or $existing as before so downstream code
sees the combined results. Ensure you preserve unique IDs before chunking and
maintain limit counts per chunk.

In `@src/Database/Mirror.php`:
- Around line 614-624: The prefetch call in Mirror.php uses Query::equal('$id',
$ids) with an unchunked $ids array which can exceed DocumentsValidator limits;
change the logic that builds $existing by chunking $ids with array_chunk($ids,
max(1, $this->maxQueryValues)) (same pattern used in createDocuments() and
upsertDocumentsWithIncrease()), call $this->source->silent(fn() =>
$this->source->find($collection, [Query::equal('$id', $chunk),
Query::limit(count($chunk))])) for each chunk, and merge results into
$existingIds[$doc->getId()] = true to preserve current behavior. Ensure you
reference $this->maxQueryValues and use Query::equal('$id', $chunk) and
Query::limit(count($chunk)) per chunk.

---

Outside diff comments:
In `@src/Database/Mirror.php`:
- Around line 604-656: The pre-filter that builds $existingIds and filters
$toForward must be made tenant-aware when tenantPerDocument is enabled: instead
of calling $this->source->find(...) once for all IDs and keying by $id, group
the $documents by tenant (use the same pattern as
Database::upsertDocumentsWithIncrease()), and for each tenant call
$this->source->withTenant($tenant)->silent(fn() => ->find(...)) to fetch only
that tenant's existing IDs; store keys in $existingIds as composite
"$tenant:$id" and update the filtering lambda for $toForward (and the earlier
array_map/array_filter that builds $ids) to use the same composite key so only
documents with matching tenant+id are considered duplicates.

---

Duplicate comments:
In `@src/Database/Database.php`:
- Around line 5728-5732: The current flow runs createDocumentRelationships()
before calling the adapter and then uses adapter->skipDuplicates() which can
ignore parent inserts while relationship writes have already been applied;
change the flow so the adapter call (the $insert closure ->
$this->withTransaction(fn () => $this->adapter->createDocuments($collection,
$chunk))) executes first to produce the actual inserted $batch (ensure
adapter->skipDuplicates(...) returns the real inserted batch when used), and
only after $batch is obtained call createDocumentRelationships() and any
child/junction writes against $batch (not the original input). Adjust
skipDuplicates() usage so it returns the inserted documents batch and ensure
relationship-writing code references $batch instead of the pre-insert input
list.

---

Nitpick comments:
In `@src/Database/Adapter/Postgres.php`:
- Line 2358: The override of getInsertSuffix(string $table) triggers an
unused-parameter warning; rename the unused parameter to $_table (or prefix it
with an underscore) in the getInsertSuffix method signature or add a local PHPMD
suppression comment inside that method so the linter stops flagging it; update
the method signature in the Postgres class (getInsertSuffix) and keep the parent
signature unchanged.

In `@tests/e2e/Adapter/Scopes/DocumentTests.php`:
- Around line 7930-7933: The assertions using $count and $emittedIds currently
validate “processed inputs” semantics (they expect duplicate IDs for repeated
processing) rather than newly created rows; update the test to make this
explicit by renaming variables (e.g., $processedCount, $processedEmittedIds) or
adding a short comment above the block clarifying “processed inputs may include
duplicates,” and if you also need to assert created-row deltas add before/after
row-count checks (capture $beforeRows = ... then run operation then assert
$afterRows - $beforeRows equals expected new creations) around the assertions
that reference $count and $emittedIds (the same pattern appears for the other
occurrences currently asserting duplicates).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06d46ba1-a7eb-478b-ad90-a27f372a0327

📥 Commits

Reviewing files that changed from the base of the PR and between 455da70 and 934ec04.

📒 Files selected for processing (10)
  • src/Database/Adapter.php
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/Pool.php
  • src/Database/Adapter/Postgres.php
  • src/Database/Adapter/SQL.php
  • src/Database/Adapter/SQLite.php
  • src/Database/Database.php
  • src/Database/Mirror.php
  • tests/e2e/Adapter/MirrorTest.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/Database/Adapter.php
  • src/Database/Adapter/SQLite.php
  • src/Database/Adapter/Mongo.php

Addresses CodeRabbit #3084994641 and #3084994657 (+ Greptile #3084466265).
Three call sites were passing unchunked id arrays into Query::equal('$id',
...), which throws QueryException from DocumentsValidator once the array
exceeds 5000 values.

Mirror::createDocuments (skipDuplicates pre-filter): defensive — Mirror's
public API accepts arbitrary-size input, and while current callers pass
batches below the limit, the code shouldn't throw on a theoretical
direct caller with >5000 input ids.

Database::upsertDocumentsWithIncrease (both tenantPerDocument and single-
tenant branches): real regression. The batched pre-read was introduced
in ae929db to restore per-tenant grouping for StatsUsage. It replaced
a per-doc getDocument() loop on main (which is safe from validator
limits because each call queries one id). Passing the full input array
to Query::equal now breaks at >5000 input docs — a path that previously
worked slowly on main.

All three sites use self::RELATION_QUERY_CHUNK_SIZE (5000) and
Query::limit(PHP_INT_MAX), matching the existing pattern used by the
five populateDocumentsRelationshipsBatch chunking sites. Accumulator
shape is adapted to our map-only use case: we build the existence set /
lookup map directly inside the chunk loop instead of carrying a flat
list we don't need.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/Database/Database.php (2)

7160-7187: ⚠️ Potential issue | 🟠 Major

Respect maxQueryValues when chunking the pre-read lookups.

These find() calls still validate Query::equal('$id', $chunk) against $this->maxQueryValues. If a caller lowers that limit below 5000, chunking by RELATION_QUERY_CHUNK_SIZE will still fail before the upsert starts.

✂️ Suggested change
-                foreach (\array_chunk($tenantIds, self::RELATION_QUERY_CHUNK_SIZE) as $chunk) {
+                foreach (\array_chunk($tenantIds, \max(1, $this->maxQueryValues)) as $chunk) {
@@
-                foreach (\array_chunk($docIds, self::RELATION_QUERY_CHUNK_SIZE) as $chunk) {
+                foreach (\array_chunk($docIds, \max(1, $this->maxQueryValues)) as $chunk) {

Based on learnings, both createDocuments() and upsertDocumentsWithIncrease() prefetch existing document IDs in chunks using array_chunk(..., max(1, $this->maxQueryValues)) to satisfy DocumentsValidator limits on Query::equal('$id', ...).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 7160 - 7187, The pre-read find()
calls chunk IDs using self::RELATION_QUERY_CHUNK_SIZE but Query::equal('$id',
$chunk) is validated against $this->maxQueryValues, so change the chunking in
both loops (the tenant branch that iterates over $idsByTenant and the else
branch that iterates over $docIds) to use array_chunk(..., max(1,
$this->maxQueryValues)) instead of self::RELATION_QUERY_CHUNK_SIZE so the chunk
size never exceeds the DocumentsValidator limit when calling find().

5728-5732: ⚠️ Potential issue | 🔴 Critical

skipDuplicates() still lets skipped rows mutate relationship state.

By the time this guard runs, every input document has already gone through createDocumentRelationships() at Lines 5720-5722. A row ignored here can still create/update children or junction rows, and duplicate child/junction inserts triggered during that earlier relationship pass still go through the normal createDocument() path. The insert result needs to be authoritative for relationship work, or those writes need to be deferred until after you know which rows were actually inserted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5728 - 5732, Summary:
duplicate-skipping allows skipped rows to still mutate relationships because
createDocumentRelationships runs before the adapter decides which rows are
inserted. Fix: change the control flow so relationship work is driven by the
authoritative insert result — either defer createDocumentRelationships until
after $batch is computed or make adapter->skipDuplicates return per-row
insertion outcomes and use that to gate relationship writes; specifically update
the sequence around the $insert closure, withTransaction,
adapter->skipDuplicates and createDocumentRelationships (and any calls to
createDocument) so relationship/junction creation only runs for rows confirmed
inserted by createDocuments (or is skipped per the adapter-provided insertion
map).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/Database/Database.php`:
- Around line 7160-7187: The pre-read find() calls chunk IDs using
self::RELATION_QUERY_CHUNK_SIZE but Query::equal('$id', $chunk) is validated
against $this->maxQueryValues, so change the chunking in both loops (the tenant
branch that iterates over $idsByTenant and the else branch that iterates over
$docIds) to use array_chunk(..., max(1, $this->maxQueryValues)) instead of
self::RELATION_QUERY_CHUNK_SIZE so the chunk size never exceeds the
DocumentsValidator limit when calling find().
- Around line 5728-5732: Summary: duplicate-skipping allows skipped rows to
still mutate relationships because createDocumentRelationships runs before the
adapter decides which rows are inserted. Fix: change the control flow so
relationship work is driven by the authoritative insert result — either defer
createDocumentRelationships until after $batch is computed or make
adapter->skipDuplicates return per-row insertion outcomes and use that to gate
relationship writes; specifically update the sequence around the $insert
closure, withTransaction, adapter->skipDuplicates and
createDocumentRelationships (and any calls to createDocument) so
relationship/junction creation only runs for rows confirmed inserted by
createDocuments (or is skipped per the adapter-provided insertion map).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c7a16723-e265-42e8-8fd4-3a629176caf0

📥 Commits

Reviewing files that changed from the base of the PR and between 934ec04 and fa0e373.

📒 Files selected for processing (2)
  • src/Database/Database.php
  • src/Database/Mirror.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Mirror.php

…SIZE

CodeRabbit #3084994641 + #3084994657 follow-up: the earlier chunking
fix used self::RELATION_QUERY_CHUNK_SIZE (5000), which matches the
existing populateDocumentsRelationshipsBatch pattern but doesn't
respect the user-configurable validator ceiling.

Query::equal('\$id', \$chunk) is validated by DocumentsValidator
against \$this->maxQueryValues, not RELATION_QUERY_CHUNK_SIZE. A
caller doing \$db->setMaxQueryValues(100) followed by any of the
three chunked pre-reads would still throw QueryException because
chunks of 5000 exceed the configured 100.

Switch to array_chunk(..., max(1, \$this->maxQueryValues)) at all
three sites so chunk size always respects the actual validator
limit. In the default case (maxQueryValues = 5000) this produces
identical chunks to the old code; under customization it correctly
scales down.

Note: the 5 existing relationship-population chunking sites still
use RELATION_QUERY_CHUNK_SIZE and have the same latent bug under
customized maxQueryValues. Leaving that for a follow-up since
fixing the pattern globally is out of scope for this PR.
premtsd-code added a commit to utopia-php/migration that referenced this pull request Apr 15, 2026
CSV / JSON / appwrite-to-appwrite imports that re-run on the same
batch (e.g. user re-uploads the same file, or a worker retries a
failed chunk) currently throw DuplicateException and abort the whole
batch. Wrap the row-buffer flush in the new skipDuplicates() scope
guard so duplicate-by-id rows are silently no-op'd at the adapter
layer (INSERT IGNORE / ON CONFLICT DO NOTHING / $setOnInsert), letting
the rest of the batch proceed.

The existing skipRelationshipsExistCheck() wrapper is preserved
(FK-target guard); skipDuplicates composes around it.

Feature-branch note: depends on utopia-php/database's skipDuplicates()
scope guard from PR utopia-php/database#852. composer.json is
temporarily pinned to dev-csv-import-upsert-v2 with a 5.99.0 alias so
composer can resolve the 5.* constraint transitively. Must be reset
to the proper release version (^5.X.Y) once PR #852 merges and
utopia-php/database ships.
premtsd-code added a commit to appwrite/appwrite that referenced this pull request Apr 15, 2026
Exposes two new optional boolean params on the three migration
creation endpoints so CSV / JSON / appwrite-to-appwrite imports can
choose how to handle rows whose IDs already exist at the destination.

Endpoints updated (app/controllers/api/migrations.php):
- POST /v1/migrations/appwrite
- POST /v1/migrations/csv/imports
- POST /v1/migrations/json/imports

Parameter semantics:
- overwrite=true  -> destination uses upsertDocuments instead of
                     createDocuments; existing rows are replaced
                     with imported values
- skip=true       -> destination wraps createDocuments in
                     skipDuplicates; existing rows are preserved
                     unchanged, duplicate-id rows silently no-op
- both false      -> default; fails fast on DuplicateException
                     (original behavior, unchanged)
- both true       -> overwrite wins (upsert subsumes skip)

Both params are stored in the migration Document's options array
(matches the existing pattern for destination behavior config like
path, size, delimiter, bucketId, etc.) and read back in the worker's
processDestination() to instantiate DestinationAppwrite with the
new constructor params.

Feature-branch note: depends on utopia-php/migration#feat/skip-duplicates
(DestinationAppwrite constructor params) which in turn depends on
utopia-php/database#852 (skipDuplicates scope guard). composer.json is
temporarily pinned to dev-feat/skip-duplicates and
dev-csv-import-upsert-v2 respectively; both must be reset to proper
release versions once the upstream PRs merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants