Skip to content

Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243

Open
rejuvenile wants to merge 310 commits intoTraceMachina:mainfrom
rejuvenile:pr/all-changes
Open

Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243
rejuvenile wants to merge 310 commits intoTraceMachina:mainfrom
rejuvenile:pr/all-changes

Conversation

@rejuvenile
Copy link
Copy Markdown

@rejuvenile rejuvenile commented Mar 24, 2026

Summary

Comprehensive production hardening from running NativeLink at scale with 10 Mac Mini workers and a ZFS-backed server. 156 commits across the following areas:

Store reliability & I/O performance

  • Fix store pipeline races causing Bazel "Lost inputs" errors
  • Fix ExistenceCacheStore false positives and LRU promotion race
  • Fix BatchUpdateBlobs missing results for duplicate digests
  • Retry hardlinks on cache eviction, hold file path read-lock during hardlink
  • Fire-and-forget eviction unrefs + CAS dedup in FilesystemStore
  • POSIX_FADV_SEQUENTIAL for kernel readahead, rayon-based blake3 hashing
  • Fix LRU eviction ordering at startup (sort by atime, not directory order)
  • Fix zero-digest file handling in FilesystemStore and worker pipeline

MokaEvictingMap — lock-free TinyLFU cache (NEW)

  • Replace Mutex+LRU EvictingMap with moka::sync::Cache (lock-free reads)
  • TinyLFU admission with frequency bump on insert (prevents single-access rejection)
  • Pinning via DashMap with 120s auto-timeout, 25% cap
  • Bounded eviction channel (4096) with inline fallback
  • Startup atime ordering preserved via insert_startup (no frequency bump)
  • KB-scaled weigher for items up to 4TB
  • All 4 stores migrated: ExistenceCacheStore, MemoryStore, FilesystemStore, MemoryAwaitedActionDb
  • Old EvictingMap/ShardedEvictingMap removed (-2,177 lines)
  • 18 integration tests covering full API
  • Production result: lock contention 391 (406ms worst) → 0

Redis pipelining (NEW)

  • has_with_results: batch STRLEN+EXISTS into single pipeline (N keys → 1 round-trip)
  • batch_get_part_unchunked: pipelined GETRANGE+EXISTS across store chain
  • BatchReadBlobs: delegates to batch pipeline instead of N individual reads
  • RENAME+PUBLISH pipelined (saves 1 RTT per write with pub_sub)
  • Pipeline chunking (5000 max) prevents unbounded response buffering
  • CrossSlot fallback for Redis cluster mode
  • Production result: Redis slow commands 10,400 → 0

Batch store reads (NEW)

  • batch_get_part_unchunked trait method across full store chain
  • MemoryStore override: direct get_many() + BytesWrapper::to_contiguous(), zero buf_channel
  • FilesystemStore override: parallel FuturesUnordered without buf_channel overhead
  • FastSlowStore: fast-first with slow fallback for NotFound
  • SizePartitioningStore: concurrent lower/upper partition batches
  • Production result: GetTree BFS 100-600ms → sub-millisecond (warm cache)

GetTree caching & coalescing (NEW)

  • Tree result cache: MokaEvictingMap (512MiB, 10K entries, 5min TTL)
  • Subtree directory cache: per-directory MokaEvictingMap (256MiB, 50K, 5min TTL)
  • Thundering herd coalescing: watch channel prevents redundant BFS (54 calls → 1 BFS + 53 waiters)
  • Arc<Vec> for zero-copy cache sharing
  • insert_many batch fix: 1 run_pending_tasks instead of N+1
  • 8 integration tests for caching, coalescing, subtree overlap
  • Production result: slow BFS levels 642 → 0, tree cache hit 6µs vs 26ms BFS

Graceful shutdown (NEW)

  • Server: HTTP/2 GOAWAY drain via hyper_util GracefulShutdown, 30s per-listener timeout
  • Server: QUIC serve_with_shutdown on accept_stop signal
  • Worker: CAS TCP server serve_with_shutdown with cas_shutdown_tx watch channel
  • SIGTERM sequence: stop accept → drain (35s) → flush writes (30s) → shutdown schedulers (20s) → exit
  • Systemd: TimeoutStopSec=90, KillMode=mixed
  • Production result: clean shutdown, no more stale AC entries after restart

Streaming blob fixes (NEW)

  • Fix errored in-flight entries poisoning all reads for that digest
  • has_error() check falls through to CAS store on poisoned entries
  • Immediate cleanup of poisoned InFlightBlobMap entries

gRPC protocol & transport

  • Fix 4MB default message size limit (configurable per-listener)
  • Fix ByteStream resume returning wrong error code
  • Add gRPC status detail propagation for FAILED_PRECONDITION
  • Parallel chunked ByteStream reads for large blobs
  • HTTP/2 send buffer and flow control tuning
  • Fix dual_transport: skip QUIC endpoints in TCP ConnectionManager

Worker improvements

  • DirectoryCache: spawn_blocking for filesystem ops, parallel blob downloads
  • Fix subtree race condition with download fallback
  • Worker proxy store with blob mirroring and locality-aware scheduling
  • Targeted prefetch of missing blobs at action dispatch
  • BlobsInStableStorage lifecycle for durability
  • Fix worker reconnect on scheduler eviction / server restart

Observability

  • Runtime watchdog + TCP keepalive
  • Store operation stall detector with thread dumps
  • pprof auto-capture on CPU threshold
  • Comprehensive write path and eviction logging
  • EvictingMap lock instrumentation
  • Non-blocking stdout logging via tracing-appender

QUIC/HTTP3 transport

  • QUIC connection pool with SO_REUSEPORT
  • BBR congestion control, tuned ACK delay
  • TLS support for worker-to-server and server-to-worker connections
  • Dual TCP+QUIC transport with intelligent routing

Test plan

  • 18 MokaEvictingMap integration tests (insert/get/remove, eviction, TTL, pinning, callbacks, concurrent stress)
  • 8 GetTree cache tests (tree_cache_hit, subtree_overlap, coalescing_concurrent, leader_failure, paginated_bypass)
  • Existing CAS server tests pass
  • cargo check --bin nativelink --features quic,pprof compiles clean
  • Production deployment: 10 Mac Mini workers, 48GB MemoryStore, 800GB FilesystemStore
  • macOS benchmark: 39s → 11s (3.5x faster)

🤖 Generated with Claude Code


This change is Reviewable

rejuvenile and others added 30 commits March 23, 2026 09:39
inner_store returned self, preventing callers (like LocalWorker) from
downcasting through the chain to find FastSlowStore. Delegate to inner
store instead — optimized_for override is independent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EvictingMap: warn! for items evicted within 120s of insertion (age + size
in log), debug! for older items. Helps diagnose Bazel "lost inputs" errors.

Worker: append .local to bare hostnames for mDNS resolution so the server
can connect to worker CAS endpoints for peer blob sharing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
upload_results writes to fast_store() only (FilesystemStore), deferring
remote CAS upload to spawn_upload_to_remote. But that function only
collected tree_digest blobs, not the individual file blobs inside output
directory trees (dep-graph.bin, query-cache.bin, etc). This caused
"Missing digest" / "lost inputs" errors when Bazel tried to download
action outputs that were never pushed to the server.

Fix: decode each Tree proto from fast_store and extract all file digests
for inclusion in the background upload. Also add success/fail counters
and tree file count to upload logging for diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers now race a peer fetch (via locality map) in parallel with the server
fetch when peers are known. Whichever responds first wins; the loser is
cancelled. This gives LAN peers a chance to win when they're closer/faster.

Server-side behavior is unchanged — IS_WORKER_REQUEST detection ensures the
sequential path (with redirect generation) is used for server-side requests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit raced on both server and worker sides because
IS_WORKER_REQUEST isn't set for Bazel client requests to the server.
Add an explicit race_peers flag (default false) that only workers
enable, preventing the server from wastefully racing against its
own workers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…AS, register output digests

- Parallelize server-side GetTree BFS (FuturesUnordered per tree level)
- GrpcStore: report LazyExistenceOnSync for CAS stores (skip FindMissingBlobs before get_part)
- WORKER_BACKLOG 8→64 to reduce backpressure during burst patterns
- Worker peer CAS connections 4→64
- Include tree digests in BlobsAvailableNotification from worker
- Register output digests from ExecuteResult in server locality map
- Fix existence_store_test: yield for async eviction callbacks
- Fix bytestream_server_test: tonic Status format change

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…check, parallel mkdir

- Bound hardlink phase to 64 concurrent tasks (was unbounded 4000+)
- Split has_with_results into 500-key chunks to release Mutex between batches
- Level-parallel BFS for directory creation (siblings concurrent, parents first)
- Log CAS server exit errors in local_worker

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log requested/missing counts at info level, and list missing digests at
debug level. Needed to diagnose "Lost inputs" build failures where blobs
exist on disk but Bazel reports them missing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tribution

Uses WorkerProxyStore locality map to route batch blob reads to peers that
already have the data, falling back to server for unknown digests. All peer
and server batches execute in parallel via join_all, eliminating the previous
server-only bottleneck where 10 workers competed for the same blobs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…concurrency to 64

- Round-robin digest assignment across peers that have the blob, preventing
  hotspots when one peer has most blobs
- Retry path is now best-effort: individual failures are logged and skipped
  instead of aborting the entire batch operation
- BYTESTREAM_CONCURRENCY increased from 16 to 64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers report cpu_load_pct (load_avg_1m / num_cpus * 100) piggybacked
on KeepAliveRequest (~2.5s), BlobsAvailableNotification (~500ms), and
ExecuteComplete (per action). The scheduler stores this per-worker and
prefers lightly-loaded workers when selecting candidates:

- LRU/MRU fallback path: picks lightest-loaded viable worker
- Locality scoring tiebreaker: when scores are within 10%, lower CPU
  load wins before timestamp
- Workers reporting 0 (unknown/old) are sorted last among known loads

Backward compatible: old workers send 0 (proto default), treated as
unknown and sorted last.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs in inner_get_tree:

1. FuturesUnordered returned directories in completion order, not BFS
   discovery order, making paging tokens nondeterministic. Fixed by
   collecting into a HashMap and iterating in original order.

2. page_size=0 (no paging) triggered `len >= 0` which is always true,
   breaking after the first BFS level. Fixed by treating 0 as MAX.

3. When page was filled mid-level, remaining unprocessed items were
   dropped, producing empty next_page_token. Fixed by copying remaining
   items back to the deque front.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e fast path

- GetTree BFS validation: use position assignment instead of re-hashing
  directory protos (fixes 77% validation failure rate with Java-serialized protos)
- EvictingMap::get_many(): batch retrieval in single lock acquisition
- FilesystemStore::get_file_entries_batch(): batch file entry lookup
- Hardlink pre-fetch: pre-fetch entries in batches of 500 before concurrent
  hardlink loop, eliminating per-file EvictingMap lock contention
- Blob eviction race fix: eager pre-read of small blobs (<=1MiB) before
  background upload to prevent eviction race in spawn_upload_to_remote
- Directory cache: use download_to_directory for cache-miss construction
  instead of serial per-file RPCs (2.5s -> 50-200ms)
- Combined set_readonly_recursive + calculate_directory_size into single walk
- GetTree BFS dedup logging: per-level timing, dedup stats, slow level warnings
- Input fetch logging: tree resolution, materialization, hardlink stats,
  blob fetch throughput, slow operation warnings
- CPU load logging downgraded to debug level (worker + server)
- Load-aware selection logging downgraded to debug level
- Fix mkdir_depth_levels log field (was using dirs_created instead of depth)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Streaming pipeline: producer fetches missing blobs in batches of 64,
  consumer hardlinks concurrently (64-wide) as blobs arrive via mpsc channel.
  Both overlap via futures::future::join for minimum wall-clock time.
- Raise MAX_PEER_HINTS from 1000 to 16384 to cover large actions.
- FastSlowStore: per-leg timing (fast_ms, slow_ms, slower_leg).
- FilesystemStore: per-phase timing (temp create, write, emplace) >50ms.
- EvictingMap: warn! on lock contention >1ms with operation name.
- StallGuard: with_context() for dynamic digest/size in stall dumps.
- DirectoryCache: comprehensive hit/miss/eviction/timing logging.
- MemoryStore config: 16GB→32GB per tier (64GB total) in server configs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dlinks as blobs arrive

Three-future pipeline: fetcher (all blobs at once, bounded at 128), producer
(sends files to channel as blobs land via Notify), consumer (hardlinks at 64
concurrency). Eliminates serial per-batch round-trips that bottlenecked at
47-60 MB/s despite 10GbE capacity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…info

- directory_cache: set_readonly_and_calculate_size was setting all files
  to 0o444, stripping the execute bit from executables. Now preserves
  execute permission (0o555 for executables, 0o444 for non-executables).
  This caused EPERM failures for cargo build scripts (runner, ring, etc).

- api_worker_scheduler: promote "Load-aware worker selection" log from
  debug! to info! so CPU-load scheduling decisions are visible in
  production logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tests verifying load-aware worker selection:
- cpu_load_update_worker_load_stores_correctly
- cpu_load_lightest_loaded_worker_gets_picked
- cpu_load_unknown_zero_sorted_last
- cpu_load_falls_back_to_lru_when_no_load_data

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers report cached input_root_digest values to the scheduler via
BlobsAvailableNotification. Scheduler gives highest routing priority
to workers with a directory cache hit for the action's input_root_digest.

Fix EPERM: set_readonly_and_calculate_size now strips write bits only
(& 0o555) instead of guessing executable status. Also removes the
skip-when-0o555 chmod optimization in hardlink_and_set_metadata which
was unsafe because concurrent hardlinks sharing CAS inodes can corrupt
file permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On cache MISS, resolve the full merkle tree and check if any subtrees
are already cached from other root entries. Cached subtrees are reused
via directory symlinks (APFS-compatible), skipping download of already-
materialized portions. BFS traversal ensures maximum (top-down) subtree
matching.

- Store .merkle_tree_meta alongside each cached directory entry
- In-memory subtree_index maps every directory digest to its disk path
- Rebuild subtree index from disk metadata on startup
- Clean up subtree index entries on eviction
- Made resolve_directory_tree public for cache access
- 6 new tests for merkle metadata, subtree index, and cache reload

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS requires write permission on the source directory for rename(2),
unlike POSIX/Linux which only checks the parent. Temporarily restore
0o755 on the temp dir before rename, then lock down to 0o555 after.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er subtree scoring

Workers report cached subtree digests via delta encoding (added/removed)
instead of full snapshots every 500ms. First notification sends full
snapshot, subsequent ones send only changes. Scheduler scores workers
by both root directory cache hits and subtree cache hits.

Fix CAS inode corruption: always provide explicit unix_mode (0o444 for
non-executable, 0o555 for executable) to prevent concurrent hardlinks
from corrupting shared inode permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… not just directory count

When no worker has an exact root match, the scheduler now scores workers by
the total file bytes under their cached subtree digests. A worker caching a
subtree with 10GB of files scores higher than one with 100 bytes. The tree
resolver computes per-subtree byte totals via bottom-up aggregation during
BFS resolution, cached alongside file digests.

Scoring tiers: exact root match > weighted subtree coverage > blob locality > LRU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e cache

Bazel actions create output directories inside the input tree. Directory
symlinks to cached subtrees caused EPERM because either:
- Cached dirs were 0o555 (can't mkdir inside), or
- If made writable, actions would mutate the cache

Fix: use hardlink_directory_tree for subtree cache hits — creates fresh
writable directories and hardlinks only the files. Cache integrity is
preserved (0o555 dirs, read-only files) since actions never access the
cache directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ach path component

When create_dir_all fails for an output directory, walk up the path and
log the mode, is_dir/is_file/is_symlink status of each component to
identify whether the failure is due to a read-only parent, a file
blocking a directory, or a symlink to a read-only cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dlink

The Bazel input tree contains SymlinkNodes (e.g., bazel-out) that, when
recreated as symlinks in the work directory, point to the read-only
directory cache (0o555 directories). create_dir_all then fails with
EPERM when trying to create output directories through these symlinks.

Fix: hardlink_directory_tree_recursive now resolves directory symlinks
to real directories with fresh writable permissions, while preserving
file symlinks and dangling/looping symlinks as-is.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…opies

The Bazel input tree contains SymlinkNodes (e.g., bazel-out -> .) that
point to directories in the read-only directory cache (0o555). When
create_dir_all tries to create output directories through these symlinks,
it fails with EPERM because the resolved target directories are read-only.

Previous approach (resolving ALL directory symlinks in hardlink_directory_tree)
caused infinite recursion for self-referential symlinks like bazel-out -> .

New approach: prepare_output_directories now handles this surgically:
1. Fast path: try create_dir_all (usually works)
2. On failure, walk the output path component by component
3. For each symlink that resolves to a read-only directory:
   - Replace the symlink with a real writable directory
   - Create absolute symlinks to all entries in the original target
   - Skip self-referential entries (e.g., bazel-out pointing to itself)
4. For read-only work dirs (0o555): chmod writable
5. Retry create_dir_all after each fix

This preserves access to all input tree files while making the specific
output path writable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_dir_all

create_dir_all succeeds when the output directory already exists (it's
part of the input tree), even if the directory is read-only (0o555)
through a symlink chain to the cached directory. Then rustc fails with
Permission denied when writing output files (.d, binary) into it.

Fix: after create_dir_all succeeds, check if the parent directory is
actually writable (mode & 0o200). If not, fall through to the slow path
that replaces symlinks-to-read-only-dirs with writable shallow copies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- set_readonly_recursive: preserve execute bits (& !0o222 instead of
  hardcoded 0o444) so cached shell scripts remain executable
- prepare_output_directories: serialize slow-path symlink replacement
  with a tokio Mutex to prevent concurrent EEXIST/ENOENT races
- Cargo.toml: lto="thin", codegen-units=16 for faster release builds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old cache entries had files at 0o444 (no execute bit) from before the
set_readonly_and_calculate_size fix. Since the cache persists on disk
across restarts, these stale entries kept serving non-executable files
like cargo_build_script_runner and cc_wrapper.sh.

Add CACHE_FORMAT_VERSION (currently 2) with a version file check on
startup. When the version is missing or stale, all entries are cleared
and the version file is written. This ensures format changes (like
permission semantics) automatically invalidate old entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rejuvenile and others added 28 commits April 14, 2026 21:52
StreamingBlob core primitive (nativelink-util/src/streaming_blob.rs):
- Single writer, multiple concurrent readers with independent cursors
- RwLock<VecDeque<Bytes>> for O(1) append + O(1) front eviction
- Sliding window eviction with configurable max_buffer_bytes
- Writer Drop safety: sets terminal-error if EOF never sent
- InFlightBlobMap with Arc::ptr_eq on removal (prevents race)
- 10 tests: data flow, multi-reader, error propagation, window
  eviction, reader blocking, EOF gating, map operations

Design documents:
- streaming-blob-pipeline-design.md: 5-phase migration plan for
  concurrent read-while-write across server, workers, and P2P.
  Incorporates code + perf reviewer feedback (13 open issues).
- resumable-writes-design.md: ByteStream write resumption after
  client disconnect, with cross-references to streaming pipeline.
- latency-reduction-opportunities.md: 10 protocol-level enhancements
  with latency estimates, complexity ratings, and priority order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log three new conditions:
- BytesWrapper total_len != sum of chunk lengths (corrupt entry)
- send() failure mid-stream (with chunk count, bytes sent, remaining)
- Existing incomplete-read error now includes actual_data_len

These diagnostics will identify whether the 6,496 hash mismatches
are caused by corrupt BytesWrapper entries, send failures, or
something else in the store chain above MemoryStore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log when fast store get_part() returns Ok but bytes_written doesn't
match the digest's expected size. Two paths instrumented:
- Direct fast store hit (has() + get_part)
- Waiter path after loader populated the fast store

Combined with the MemoryStore diagnostics (which showed zero issues),
this will pinpoint whether the corruption happens in FastSlowStore's
tee/forwarding or in a layer above it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major features:
- Streaming read-while-write: InFlightBlobMap allows readers to stream from
  in-progress uploads before store commit. Bounded at 128 concurrent blobs
  with 64MiB sliding window per blob.
- Partial-read false-positive fix: FastSlowStore size checks and
  LoggingReadStream hash verification now correctly skip validation for
  partial reads (worker parallel_chunk_count=64 splits blobs into 64 RPCs).
- Missing digest hints: server resolves GetTree inline (200ms timeout),
  sends missing_digests in StartExecute so workers skip has() checks.
- Zstd compression: opt-in transport-level compression on GrpcStore.
- MATCH_CONCURRENCY 32: up from 8 for parallel worker matching.
- Memory pressure management: idle stream tracking with saturating counters,
  sweeper evicts oldest idle streams when over budget (default 256MiB).

Fixes from code review:
- atomic_saturating_sub prevents partial_write_bytes counter underflow
- Sweeper re-checks maybe_idle.is_some() before removing (double-lock race)
- InFlightBlobMap capped at 128 entries (8GiB worst case)
- FastSlowStore has() size comparison uses < instead of != (block alignment)
- inner_prepare_action BoxFuture prevents stack overflow in tests
- Demoted hot-path scheduler logs to debug! (tree cache hits, scoring)

35 new tests across streaming_blob, bytestream_server, grpc_store,
running_actions_manager, and simple_scheduler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old-format bytestream config (cas_stores map) didn't carry the
streaming_read_while_write and max_streaming_blob_buffer_bytes fields
through to the new ByteStreamConfig. This caused deny_unknown_fields
rejection when the config used the old format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When multiple readers request the same blob that isn't in the fast store,
the first reader populates from the slow store while subsequent readers
(waiters) previously blocked on the OnceCell until populate completed,
then read from the fast store. This added latency equal to the full
populate time plus a TOCTOU risk where the blob could be evicted between
populate and the waiter's read.

Now the populate thread tees data into a StreamingBlobInner (64MiB sliding
window). Waiters get a StreamingBlobReader and consume data concurrently
as it arrives from disk, with no blocking and no eviction race.

Also makes StreamingBlobInner::new, StreamingBlobWriter::new, and
StreamingBlobReader::new public for cross-crate use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously waiters blocked on OnceCell::get_or_try_init until populate
completed, then read from the streaming buffer. This was just TOCTOU
protection, not a latency improvement.

Now waiters bypass the OnceCell entirely using the is_new flag from
get_loader. When is_new=false (another thread is populating), the waiter
immediately creates a StreamingBlobReader and consumes data as chunks
arrive from disk. Time-to-first-byte for waiters drops from full
populate time (~100ms-10s) to first chunk arrival (~1-2ms).

The populator still uses get_or_try_init to ensure exactly one populate
runs, and tees data into both the client writer and the streaming buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For blobs > 64 MiB (the sliding window size), the waiter's
StreamingBlobReader could start at a non-zero chunk index after
eviction, but the offset math assumed pos=0 was the blob start.
This would silently serve partial data.

Fix: check earliest_chunk_idx before starting to stream. If chunks
have already been evicted (earliest > 0), fall back to slow store
directly instead of serving a partial blob.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- info! log when waiter uses concurrent streaming path
- Detect zero bytes streamed for full reads and fall back to slow store
  (handles race where populate completes before waiter subscribes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The waiter path entered the streaming reader even when the populate had
already finished (is_terminal=true), causing zero-byte reads from a
drained buffer. Now three distinct paths:

1. is_waiter && !is_terminal: stream concurrently from populate buffer
2. is_waiter && is_terminal: read from fast store (populate done)
3. !is_waiter (populator): populate_and_maybe_stream with tee

Removes the zero-byte fallback hack which was masking this bug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When is_new=false (waiter), copy_slow_to_fast created a
StreamingBlobWriter that was never used. Its Drop impl would log
a spurious "dropped without eof" warning, and in the rare
get_or_try_init retry case, could poison the StreamingBlobInner.

Fix: populate_and_maybe_stream now takes Option<StreamingBlobWriter>.
Waiters pass None, populators pass Some(writer). The tee and EOF
calls are gated behind if let Some.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…reduced channels

10 optimizations for the read/write hot paths:

1. ShardedEvictingMap: parallel shard iteration for sizes_for_keys,
   get_many, insert_many using FuturesUnordered (independent locks)
2. FastSlowStore: skip has() + get_part() double lookup, attempt
   get_part() directly and handle NotFound as cache miss
3. CompletenessCheckingStore: fetch AC entry once, decode for
   completeness check, serve bytes directly (eliminates double fetch)
4. VerifyStore: reduce buf_channel from 256 to 4 slots (hasher is
   memory-speed, deep buffering is wasteful)
5. store_trait: reduce get_part_unchunked/update_oneshot channel from
   1024 to 4 slots (collecting into single buffer, not streaming)
6. LoggingReadStream: remove redundant SHA256 hashing (VerifyStore
   already verifies on reads)
7. FastSlowStore mirror: pre-allocate BytesMut with digest size
8. ExistenceCacheStore: Vec::with_capacity for cache miss collection
9. DedupStore: FuturesOrdered → FuturesUnordered for has_with_results
10. GetTree BFS: capacity hints for deque, seen set, directories vec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ction

When the server restarts, QUIC connections enter a half-open state.
The old 60s idle timeout meant dead connections weren't detected for
up to 60s, causing RPC timeouts on workers. The H3Connection layer
has built-in reconnection (detects driver closure in poll_ready),
but it only triggers after the connection is marked dead.

With 2s keepalives and 15s idle timeout, dead connections are
detected within ~4-6s, allowing automatic reconnection before
the 120s RPC timeout expires.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Server graceful shutdown: HTTP/2 GOAWAY drain on SIGTERM via hyper_util
GracefulShutdown. QUIC serve_with_shutdown. SIGTERM sequence: stop
accept → drain (30s) → flush writes (30s) → shutdown schedulers (20s).

Worker CAS server: serve_with_shutdown with cas_shutdown_tx watch
channel triggered on worker SIGTERM.

Streaming blob fix: errored in-flight entries in InFlightBlobMap were
poisoning reads. Now checks has_error() and falls through to store.

MokaEvictingMap: lock-free TinyLFU cache replacing Mutex+LRU. Fixes:
pin race (DashMap-first ordering), replaced-item unref, bounded
eviction channel, BTree cleanup on eviction, batched pin_keys,
atomic fast-path for pinned check.

Phase 2: ExistenceCacheStore and MemoryStore migrated to Moka.
EvictingMap shards reduced 64→16.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
has_with_results() now sends all N keys' STRLEN+EXISTS commands in
one pipeline round-trip instead of N individual round-trips. With
64 connections and 16K permits, this eliminates the head-of-line
blocking that caused 5.5s Redis latency from 3 connections.

CrossSlot fallback for Redis cluster mode preserves the original
per-key concurrent behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 2.3 + Phase 3: all stores now use Moka. FilesystemStore
uses pinning, async unref, callbacks, insert_with_time.
MemoryAwaitedActionDb is TTL-only.

FsEvictingMap type alias simplified: lifetime parameter removed
(MokaEvictingMap requires Q: 'static). Lookups still work with
any lifetime via Borrow trait.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Startup atime ordering: insert_with_time() uses insert_startup()
   which skips frequency bump and run_pending_tasks(). Items enter
   at freq=1, preserving FIFO window ordering (oldest evicted first).

2. u32 weigher: scale to KB granularity (capacity/1024, weight/1024)
   so items up to 4TB fit in u32 without truncation.

3. Unpin re-admission: unpin_key() bumps frequency after re-inserting
   into Moka so TinyLFU doesn't immediately reject the item.

4. Channel-full callback skip: log warning when eviction channel is
   full and ItemCallbacks are skipped in the tokio::spawn fallback.

5. Startup batch perf: insert_startup() defers run_pending_tasks()
   to bulk processing when the first real operation triggers it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds batch_get_part_unchunked to StoreDriver trait with pipelined
Redis implementation. GetTree BFS now fetches all directories per
level in a single Redis pipeline (1 round-trip for 130+ dirs)
instead of 130 individual GETRANGE commands.

Store chain support: RedisStore (true pipeline), FastSlowStore
(fast-first with slow fallback), SizePartitioningStore (split by
threshold, concurrent), ExistenceCacheStore (delegates + updates
cache), VerifyStore (pass-through). Default impl fans out via
FuturesUnordered for non-Redis backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Redis update(): RENAME + PUBLISH combined into single pipeline
(saves 1 RTT per write when pub_sub enabled).

BatchReadBlobs gRPC handler: delegates to batch_get_part_unchunked
instead of N individual get_part_unchunked calls. Single Redis
pipeline for all requested blobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Redis pipeline chunking (MAX_PIPELINE_BATCH=5000): prevents unbounded
response buffering for very large batches. CrossSlot fallback logging
added for cluster-mode observability.

BatchReadBlobs: 64MiB per-blob size cap prevents unbounded memory
from malicious/misconfigured clients.

ExistenceCacheStore: batch path now caches digest.size_bytes()
instead of data.len(), matching single-key get_part behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
evicting_map.rs: 1593 → 84 lines (LenEntry, ItemCallback, NoopCallback
retained). Removed EvictingMap, ShardedEvictingMap, State, EvictionItem,
LockMetrics, lock_with_metrics!, SerializedLRU, and all constants.

Deleted evicting_map_test.rs (668 lines). Removed lru dependency.
Updated doc comments referencing old type names.

Net: -2,177 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
18 integration tests covering: insert/get/remove, max_bytes/max_count
eviction, TTL expiration, pin/unpin/pin_cap, replaced items, startup
insert, sizes_for_keys, range queries, concurrent stress, callbacks.

VerifyStore: documented that batch_get_part_unchunked bypasses hash
verification. Audited all store wrappers — completeness_checking and
dedup correctly fall through to default (no bypass).

Expiry investigation: Moka's Expiry trait does NOT help with size-based
eviction ordering (TTL and size eviction are independent). Corrected
doc: window deque is unused in Moka 0.12, entries go to MainProbation.
Current mitigation (insert_startup skips freq bump, FIFO ordering in
MainProbation) correctly preserves atime-based eviction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MemoryStore: direct evicting_map.get_many() + BytesWrapper::to_contiguous().
Zero buf_channel allocations, zero async task pairs. Single-chunk blobs
are zero-copy (Arc bump). ~50us for 500 keys vs ~2-5ms before.

FilesystemStore: FuturesUnordered for parallel I/O but skips buf_channel.
Each task does evicting_map.get() → read_file_entry_bytes() → Bytes.
Preserves FD semaphore, stale-entry cleanup, length cap. 256MiB safety
bound for unlimited reads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dual_transport code was passing ALL endpoints (including
use_http3=true) to the TCP ConnectionManager. This caused TCP
connection attempts to the QUIC-only UDP port (50072), generating
persistent ConnectionRefused errors and log floods on all workers.

Fix: filter out use_http3 endpoints when building tcp_endpoints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetTree: MokaEvictingMap cache (512MiB, 10K entries, 5min TTL) for
assembled tree results. 54 concurrent identical GetTree calls → 1 BFS
+ 53 cache hits. Eliminates ~15,900 redundant directory lookups.

ExistenceCacheStore: batch_get_part_unchunked now uses insert_many()
instead of 552 sequential insert() calls. Reduces ~2200 Moka ops +
552 run_pending_tasks() to ~552 inserts + 1 maintenance pass.

get_many(): documented why sequential is correct (Moka lock-free
reads at 100ns each, parallelism overhead exceeds benefit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GetTree coalescing: DashMap + watch channel prevents thundering herd.
54 concurrent same-root calls → 1 BFS + 53 waiters. TOCTOU race
fixed with Entry API (single lock scope for check+register).

Subtree cache (level 2): per-directory MokaEvictingMap (256MiB, 50K,
5min TTL). BFS checks subtree cache before store fetch. Different
roots with 90% overlap → only ~10% dirs fetched from store.

Arc<Vec<Directory>>: CachedTree wraps directories in Arc for cheap
cache clones within moka. Response construction still deep-clones
(required by protobuf ownership).

insert_many: uses insert_batch (defers run_pending_tasks) instead
of insert_inner (called it per-item). Now 1 maintenance pass per
batch instead of N+1.

CLAUDE.md: code review requirement before committing.

Level 3 Tree proto lookup deferred (no root→tree digest mapping).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 integration tests: tree_cache_hit, tree_cache_miss_different_root,
subtree_cache_overlap, coalescing_concurrent, coalescing_leader_failure,
paginated_bypasses_cache, subtree_cache_deduplication, next_page_token.

Arc optimization: BFS result moved into Arc (zero-copy), cache gets
Arc clone (refcount bump), response gets one deep clone. Eliminates
transient double materialization (~5000 heap allocations saved).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FindMissingBlobs (120K/min), ByteStream read/write completed,
mirror blob streamed, AC read/write, BlobsAvailable registration,
streaming populate, connection creation — all downgraded from
info! to debug!. With release_max_level_info these are compiled
out in release builds, reducing log volume ~60x under load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants