Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC by rejuvenile · Pull Request #2243 · TraceMachina/nativelink

rejuvenile · 2026-03-24T03:19:57Z

Summary

Comprehensive production hardening from running NativeLink at scale with 10 Mac Mini workers and a ZFS-backed server. 156 commits across the following areas:

Store reliability & I/O performance

Fix store pipeline races causing Bazel "Lost inputs" errors
Fix ExistenceCacheStore false positives and LRU promotion race
Fix BatchUpdateBlobs missing results for duplicate digests
Retry hardlinks on cache eviction, hold file path read-lock during hardlink
Fire-and-forget eviction unrefs + CAS dedup in FilesystemStore
POSIX_FADV_SEQUENTIAL for kernel readahead, rayon-based blake3 hashing
Fix LRU eviction ordering at startup (sort by atime, not directory order)
Fix zero-digest file handling in FilesystemStore and worker pipeline

MokaEvictingMap — lock-free TinyLFU cache (NEW)

Replace Mutex+LRU EvictingMap with moka::sync::Cache (lock-free reads)
TinyLFU admission with frequency bump on insert (prevents single-access rejection)
Pinning via DashMap with 120s auto-timeout, 25% cap
Bounded eviction channel (4096) with inline fallback
Startup atime ordering preserved via insert_startup (no frequency bump)
KB-scaled weigher for items up to 4TB
All 4 stores migrated: ExistenceCacheStore, MemoryStore, FilesystemStore, MemoryAwaitedActionDb
Old EvictingMap/ShardedEvictingMap removed (-2,177 lines)
18 integration tests covering full API
Production result: lock contention 391 (406ms worst) → 0

Redis pipelining (NEW)

has_with_results: batch STRLEN+EXISTS into single pipeline (N keys → 1 round-trip)
batch_get_part_unchunked: pipelined GETRANGE+EXISTS across store chain
BatchReadBlobs: delegates to batch pipeline instead of N individual reads
RENAME+PUBLISH pipelined (saves 1 RTT per write with pub_sub)
Pipeline chunking (5000 max) prevents unbounded response buffering
CrossSlot fallback for Redis cluster mode
Production result: Redis slow commands 10,400 → 0

Batch store reads (NEW)

batch_get_part_unchunked trait method across full store chain
MemoryStore override: direct get_many() + BytesWrapper::to_contiguous(), zero buf_channel
FilesystemStore override: parallel FuturesUnordered without buf_channel overhead
FastSlowStore: fast-first with slow fallback for NotFound
SizePartitioningStore: concurrent lower/upper partition batches
Production result: GetTree BFS 100-600ms → sub-millisecond (warm cache)

GetTree caching & coalescing (NEW)

Tree result cache: MokaEvictingMap (512MiB, 10K entries, 5min TTL)
Subtree directory cache: per-directory MokaEvictingMap (256MiB, 50K, 5min TTL)
Thundering herd coalescing: watch channel prevents redundant BFS (54 calls → 1 BFS + 53 waiters)
Arc<Vec> for zero-copy cache sharing
insert_many batch fix: 1 run_pending_tasks instead of N+1
8 integration tests for caching, coalescing, subtree overlap
Production result: slow BFS levels 642 → 0, tree cache hit 6µs vs 26ms BFS

Graceful shutdown (NEW)

Server: HTTP/2 GOAWAY drain via hyper_util GracefulShutdown, 30s per-listener timeout
Server: QUIC serve_with_shutdown on accept_stop signal
Worker: CAS TCP server serve_with_shutdown with cas_shutdown_tx watch channel
SIGTERM sequence: stop accept → drain (35s) → flush writes (30s) → shutdown schedulers (20s) → exit
Systemd: TimeoutStopSec=90, KillMode=mixed
Production result: clean shutdown, no more stale AC entries after restart

Streaming blob fixes (NEW)

Fix errored in-flight entries poisoning all reads for that digest
has_error() check falls through to CAS store on poisoned entries
Immediate cleanup of poisoned InFlightBlobMap entries

gRPC protocol & transport

Fix 4MB default message size limit (configurable per-listener)
Fix ByteStream resume returning wrong error code
Add gRPC status detail propagation for FAILED_PRECONDITION
Parallel chunked ByteStream reads for large blobs
HTTP/2 send buffer and flow control tuning
Fix dual_transport: skip QUIC endpoints in TCP ConnectionManager

Worker improvements

DirectoryCache: spawn_blocking for filesystem ops, parallel blob downloads
Fix subtree race condition with download fallback
Worker proxy store with blob mirroring and locality-aware scheduling
Targeted prefetch of missing blobs at action dispatch
BlobsInStableStorage lifecycle for durability
Fix worker reconnect on scheduler eviction / server restart

Observability

Runtime watchdog + TCP keepalive
Store operation stall detector with thread dumps
pprof auto-capture on CPU threshold
Comprehensive write path and eviction logging
EvictingMap lock instrumentation
Non-blocking stdout logging via tracing-appender

QUIC/HTTP3 transport

QUIC connection pool with SO_REUSEPORT
BBR congestion control, tuned ACK delay
TLS support for worker-to-server and server-to-worker connections
Dual TCP+QUIC transport with intelligent routing

Test plan

18 MokaEvictingMap integration tests (insert/get/remove, eviction, TTL, pinning, callbacks, concurrent stress)
8 GetTree cache tests (tree_cache_hit, subtree_overlap, coalescing_concurrent, leader_failure, paginated_bypass)
Existing CAS server tests pass
cargo check --bin nativelink --features quic,pprof compiles clean
Production deployment: 10 Mac Mini workers, 48GB MemoryStore, 800GB FilesystemStore
macOS benchmark: 39s → 11s (3.5x faster)

🤖 Generated with Claude Code

This change is

inner_store returned self, preventing callers (like LocalWorker) from downcasting through the chain to find FastSlowStore. Delegate to inner store instead — optimized_for override is independent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EvictingMap: warn! for items evicted within 120s of insertion (age + size in log), debug! for older items. Helps diagnose Bazel "lost inputs" errors. Worker: append .local to bare hostnames for mDNS resolution so the server can connect to worker CAS endpoints for peer blob sharing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

upload_results writes to fast_store() only (FilesystemStore), deferring remote CAS upload to spawn_upload_to_remote. But that function only collected tree_digest blobs, not the individual file blobs inside output directory trees (dep-graph.bin, query-cache.bin, etc). This caused "Missing digest" / "lost inputs" errors when Bazel tried to download action outputs that were never pushed to the server. Fix: decode each Tree proto from fast_store and extract all file digests for inclusion in the background upload. Also add success/fail counters and tree file count to upload logging for diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Workers now race a peer fetch (via locality map) in parallel with the server fetch when peers are known. Whichever responds first wins; the loser is cancelled. This gives LAN peers a chance to win when they're closer/faster. Server-side behavior is unchanged — IS_WORKER_REQUEST detection ensures the sequential path (with redirect generation) is used for server-side requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous commit raced on both server and worker sides because IS_WORKER_REQUEST isn't set for Bazel client requests to the server. Add an explicit race_peers flag (default false) that only workers enable, preventing the server from wastefully racing against its own workers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…AS, register output digests - Parallelize server-side GetTree BFS (FuturesUnordered per tree level) - GrpcStore: report LazyExistenceOnSync for CAS stores (skip FindMissingBlobs before get_part) - WORKER_BACKLOG 8→64 to reduce backpressure during burst patterns - Worker peer CAS connections 4→64 - Include tree digests in BlobsAvailableNotification from worker - Register output digests from ExecuteResult in server locality map - Fix existence_store_test: yield for async eviction callbacks - Fix bytestream_server_test: tonic Status format change Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…check, parallel mkdir - Bound hardlink phase to 64 concurrent tasks (was unbounded 4000+) - Split has_with_results into 500-key chunks to release Mutex between batches - Level-parallel BFS for directory creation (siblings concurrent, parents first) - Log CAS server exit errors in local_worker Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Log requested/missing counts at info level, and list missing digests at debug level. Needed to diagnose "Lost inputs" build failures where blobs exist on disk but Bazel reports them missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tribution Uses WorkerProxyStore locality map to route batch blob reads to peers that already have the data, falling back to server for unknown digests. All peer and server batches execute in parallel via join_all, eliminating the previous server-only bottleneck where 10 workers competed for the same blobs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…concurrency to 64 - Round-robin digest assignment across peers that have the blob, preventing hotspots when one peer has most blobs - Retry path is now best-effort: individual failures are logged and skipped instead of aborting the entire batch operation - BYTESTREAM_CONCURRENCY increased from 16 to 64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Workers report cpu_load_pct (load_avg_1m / num_cpus * 100) piggybacked on KeepAliveRequest (~2.5s), BlobsAvailableNotification (~500ms), and ExecuteComplete (per action). The scheduler stores this per-worker and prefers lightly-loaded workers when selecting candidates: - LRU/MRU fallback path: picks lightest-loaded viable worker - Locality scoring tiebreaker: when scores are within 10%, lower CPU load wins before timestamp - Workers reporting 0 (unknown/old) are sorted last among known loads Backward compatible: old workers send 0 (proto default), treated as unknown and sorted last. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three bugs in inner_get_tree: 1. FuturesUnordered returned directories in completion order, not BFS discovery order, making paging tokens nondeterministic. Fixed by collecting into a HashMap and iterating in original order. 2. page_size=0 (no paging) triggered `len >= 0` which is always true, breaking after the first BFS level. Fixed by treating 0 as MAX. 3. When page was filled mid-level, remaining unprocessed items were dropped, producing empty next_page_token. Fixed by copying remaining items back to the deque front. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e fast path - GetTree BFS validation: use position assignment instead of re-hashing directory protos (fixes 77% validation failure rate with Java-serialized protos) - EvictingMap::get_many(): batch retrieval in single lock acquisition - FilesystemStore::get_file_entries_batch(): batch file entry lookup - Hardlink pre-fetch: pre-fetch entries in batches of 500 before concurrent hardlink loop, eliminating per-file EvictingMap lock contention - Blob eviction race fix: eager pre-read of small blobs (<=1MiB) before background upload to prevent eviction race in spawn_upload_to_remote - Directory cache: use download_to_directory for cache-miss construction instead of serial per-file RPCs (2.5s -> 50-200ms) - Combined set_readonly_recursive + calculate_directory_size into single walk - GetTree BFS dedup logging: per-level timing, dedup stats, slow level warnings - Input fetch logging: tree resolution, materialization, hardlink stats, blob fetch throughput, slow operation warnings - CPU load logging downgraded to debug level (worker + server) - Load-aware selection logging downgraded to debug level - Fix mkdir_depth_levels log field (was using dirs_created instead of depth) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Streaming pipeline: producer fetches missing blobs in batches of 64, consumer hardlinks concurrently (64-wide) as blobs arrive via mpsc channel. Both overlap via futures::future::join for minimum wall-clock time. - Raise MAX_PEER_HINTS from 1000 to 16384 to cover large actions. - FastSlowStore: per-leg timing (fast_ms, slow_ms, slower_leg). - FilesystemStore: per-phase timing (temp create, write, emplace) >50ms. - EvictingMap: warn! on lock contention >1ms with operation name. - StallGuard: with_context() for dynamic digest/size in stall dumps. - DirectoryCache: comprehensive hit/miss/eviction/timing logging. - MemoryStore config: 16GB→32GB per tier (64GB total) in server configs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dlinks as blobs arrive Three-future pipeline: fetcher (all blobs at once, bounded at 128), producer (sends files to channel as blobs land via Notify), consumer (hardlinks at 64 concurrency). Eliminates serial per-batch round-trips that bottlenecked at 47-60 MB/s despite 10GbE capacity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…info - directory_cache: set_readonly_and_calculate_size was setting all files to 0o444, stripping the execute bit from executables. Now preserves execute permission (0o555 for executables, 0o444 for non-executables). This caused EPERM failures for cargo build scripts (runner, ring, etc). - api_worker_scheduler: promote "Load-aware worker selection" log from debug! to info! so CPU-load scheduling decisions are visible in production logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

4 tests verifying load-aware worker selection: - cpu_load_update_worker_load_stores_correctly - cpu_load_lightest_loaded_worker_gets_picked - cpu_load_unknown_zero_sorted_last - cpu_load_falls_back_to_lru_when_no_load_data Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Workers report cached input_root_digest values to the scheduler via BlobsAvailableNotification. Scheduler gives highest routing priority to workers with a directory cache hit for the action's input_root_digest. Fix EPERM: set_readonly_and_calculate_size now strips write bits only (& 0o555) instead of guessing executable status. Also removes the skip-when-0o555 chmod optimization in hardlink_and_set_metadata which was unsafe because concurrent hardlinks sharing CAS inodes can corrupt file permissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

On cache MISS, resolve the full merkle tree and check if any subtrees are already cached from other root entries. Cached subtrees are reused via directory symlinks (APFS-compatible), skipping download of already- materialized portions. BFS traversal ensures maximum (top-down) subtree matching. - Store .merkle_tree_meta alongside each cached directory entry - In-memory subtree_index maps every directory digest to its disk path - Rebuild subtree index from disk metadata on startup - Clean up subtree index entries on eviction - Made resolve_directory_tree public for cache access - 6 new tests for merkle metadata, subtree index, and cache reload Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

macOS requires write permission on the source directory for rename(2), unlike POSIX/Linux which only checks the parent. Temporarily restore 0o755 on the temp dir before rename, then lock down to 0o555 after. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…er subtree scoring Workers report cached subtree digests via delta encoding (added/removed) instead of full snapshots every 500ms. First notification sends full snapshot, subsequent ones send only changes. Scheduler scores workers by both root directory cache hits and subtree cache hits. Fix CAS inode corruption: always provide explicit unix_mode (0o444 for non-executable, 0o555 for executable) to prevent concurrent hardlinks from corrupting shared inode permissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… not just directory count When no worker has an exact root match, the scheduler now scores workers by the total file bytes under their cached subtree digests. A worker caching a subtree with 10GB of files scores higher than one with 100 bytes. The tree resolver computes per-subtree byte totals via bottom-up aggregation during BFS resolution, cached alongside file digests. Scoring tiers: exact root match > weighted subtree coverage > blob locality > LRU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e cache Bazel actions create output directories inside the input tree. Directory symlinks to cached subtrees caused EPERM because either: - Cached dirs were 0o555 (can't mkdir inside), or - If made writable, actions would mutate the cache Fix: use hardlink_directory_tree for subtree cache hits — creates fresh writable directories and hardlinks only the files. Cache integrity is preserved (0o555 dirs, read-only files) since actions never access the cache directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ach path component When create_dir_all fails for an output directory, walk up the path and log the mode, is_dir/is_file/is_symlink status of each component to identify whether the failure is due to a read-only parent, a file blocking a directory, or a symlink to a read-only cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dlink The Bazel input tree contains SymlinkNodes (e.g., bazel-out) that, when recreated as symlinks in the work directory, point to the read-only directory cache (0o555 directories). create_dir_all then fails with EPERM when trying to create output directories through these symlinks. Fix: hardlink_directory_tree_recursive now resolves directory symlinks to real directories with fresh writable permissions, while preserving file symlinks and dangling/looping symlinks as-is. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…opies The Bazel input tree contains SymlinkNodes (e.g., bazel-out -> .) that point to directories in the read-only directory cache (0o555). When create_dir_all tries to create output directories through these symlinks, it fails with EPERM because the resolved target directories are read-only. Previous approach (resolving ALL directory symlinks in hardlink_directory_tree) caused infinite recursion for self-referential symlinks like bazel-out -> . New approach: prepare_output_directories now handles this surgically: 1. Fast path: try create_dir_all (usually works) 2. On failure, walk the output path component by component 3. For each symlink that resolves to a read-only directory: - Replace the symlink with a real writable directory - Create absolute symlinks to all entries in the original target - Skip self-referential entries (e.g., bazel-out pointing to itself) 4. For read-only work dirs (0o555): chmod writable 5. Retry create_dir_all after each fix This preserves access to all input tree files while making the specific output path writable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…_dir_all create_dir_all succeeds when the output directory already exists (it's part of the input tree), even if the directory is read-only (0o555) through a symlink chain to the cached directory. Then rustc fails with Permission denied when writing output files (.d, binary) into it. Fix: after create_dir_all succeeds, check if the parent directory is actually writable (mode & 0o200). If not, fall through to the slow path that replaces symlinks-to-read-only-dirs with writable shallow copies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- set_readonly_recursive: preserve execute bits (& !0o222 instead of hardcoded 0o444) so cached shell scripts remain executable - prepare_output_directories: serialize slow-path symlink replacement with a tokio Mutex to prevent concurrent EEXIST/ENOENT races - Cargo.toml: lto="thin", codegen-units=16 for faster release builds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Old cache entries had files at 0o444 (no execute bit) from before the set_readonly_and_calculate_size fix. Since the cache persists on disk across restarts, these stale entries kept serving non-executable files like cargo_build_script_runner and cc_wrapper.sh. Add CACHE_FORMAT_VERSION (currently 2) with a version file check on startup. When the version is missing or stale, all entries are cleared and the version file is written. This ensures format changes (like permission semantics) automatically invalidate old entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

StreamingBlob core primitive (nativelink-util/src/streaming_blob.rs): - Single writer, multiple concurrent readers with independent cursors - RwLock<VecDeque<Bytes>> for O(1) append + O(1) front eviction - Sliding window eviction with configurable max_buffer_bytes - Writer Drop safety: sets terminal-error if EOF never sent - InFlightBlobMap with Arc::ptr_eq on removal (prevents race) - 10 tests: data flow, multi-reader, error propagation, window eviction, reader blocking, EOF gating, map operations Design documents: - streaming-blob-pipeline-design.md: 5-phase migration plan for concurrent read-while-write across server, workers, and P2P. Incorporates code + perf reviewer feedback (13 open issues). - resumable-writes-design.md: ByteStream write resumption after client disconnect, with cross-references to streaming pipeline. - latency-reduction-opportunities.md: 10 protocol-level enhancements with latency estimates, complexity ratings, and priority order. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log three new conditions: - BytesWrapper total_len != sum of chunk lengths (corrupt entry) - send() failure mid-stream (with chunk count, bytes sent, remaining) - Existing incomplete-read error now includes actual_data_len These diagnostics will identify whether the 6,496 hash mismatches are caused by corrupt BytesWrapper entries, send failures, or something else in the store chain above MemoryStore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log when fast store get_part() returns Ok but bytes_written doesn't match the digest's expected size. Two paths instrumented: - Direct fast store hit (has() + get_part) - Waiter path after loader populated the fast store Combined with the MemoryStore diagnostics (which showed zero issues), this will pinpoint whether the corruption happens in FastSlowStore's tee/forwarding or in a layer above it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major features: - Streaming read-while-write: InFlightBlobMap allows readers to stream from in-progress uploads before store commit. Bounded at 128 concurrent blobs with 64MiB sliding window per blob. - Partial-read false-positive fix: FastSlowStore size checks and LoggingReadStream hash verification now correctly skip validation for partial reads (worker parallel_chunk_count=64 splits blobs into 64 RPCs). - Missing digest hints: server resolves GetTree inline (200ms timeout), sends missing_digests in StartExecute so workers skip has() checks. - Zstd compression: opt-in transport-level compression on GrpcStore. - MATCH_CONCURRENCY 32: up from 8 for parallel worker matching. - Memory pressure management: idle stream tracking with saturating counters, sweeper evicts oldest idle streams when over budget (default 256MiB). Fixes from code review: - atomic_saturating_sub prevents partial_write_bytes counter underflow - Sweeper re-checks maybe_idle.is_some() before removing (double-lock race) - InFlightBlobMap capped at 128 entries (8GiB worst case) - FastSlowStore has() size comparison uses < instead of != (block alignment) - inner_prepare_action BoxFuture prevents stack overflow in tests - Demoted hot-path scheduler logs to debug! (tree cache hits, scoring) 35 new tests across streaming_blob, bytestream_server, grpc_store, running_actions_manager, and simple_scheduler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The old-format bytestream config (cas_stores map) didn't carry the streaming_read_while_write and max_streaming_blob_buffer_bytes fields through to the new ByteStreamConfig. This caused deny_unknown_fields rejection when the config used the old format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When multiple readers request the same blob that isn't in the fast store, the first reader populates from the slow store while subsequent readers (waiters) previously blocked on the OnceCell until populate completed, then read from the fast store. This added latency equal to the full populate time plus a TOCTOU risk where the blob could be evicted between populate and the waiter's read. Now the populate thread tees data into a StreamingBlobInner (64MiB sliding window). Waiters get a StreamingBlobReader and consume data concurrently as it arrives from disk, with no blocking and no eviction race. Also makes StreamingBlobInner::new, StreamingBlobWriter::new, and StreamingBlobReader::new public for cross-crate use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously waiters blocked on OnceCell::get_or_try_init until populate completed, then read from the streaming buffer. This was just TOCTOU protection, not a latency improvement. Now waiters bypass the OnceCell entirely using the is_new flag from get_loader. When is_new=false (another thread is populating), the waiter immediately creates a StreamingBlobReader and consumes data as chunks arrive from disk. Time-to-first-byte for waiters drops from full populate time (~100ms-10s) to first chunk arrival (~1-2ms). The populator still uses get_or_try_init to ensure exactly one populate runs, and tees data into both the client writer and the streaming buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

For blobs > 64 MiB (the sliding window size), the waiter's StreamingBlobReader could start at a non-zero chunk index after eviction, but the offset math assumed pos=0 was the blob start. This would silently serve partial data. Fix: check earliest_chunk_idx before starting to stream. If chunks have already been evicted (earliest > 0), fall back to slow store directly instead of serving a partial blob. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- info! log when waiter uses concurrent streaming path - Detect zero bytes streamed for full reads and fall back to slow store (handles race where populate completes before waiter subscribes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The waiter path entered the streaming reader even when the populate had already finished (is_terminal=true), causing zero-byte reads from a drained buffer. Now three distinct paths: 1. is_waiter && !is_terminal: stream concurrently from populate buffer 2. is_waiter && is_terminal: read from fast store (populate done) 3. !is_waiter (populator): populate_and_maybe_stream with tee Removes the zero-byte fallback hack which was masking this bug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When is_new=false (waiter), copy_slow_to_fast created a StreamingBlobWriter that was never used. Its Drop impl would log a spurious "dropped without eof" warning, and in the rare get_or_try_init retry case, could poison the StreamingBlobInner. Fix: populate_and_maybe_stream now takes Option<StreamingBlobWriter>. Waiters pass None, populators pass Some(writer). The tee and EOF calls are gated behind if let Some. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…reduced channels 10 optimizations for the read/write hot paths: 1. ShardedEvictingMap: parallel shard iteration for sizes_for_keys, get_many, insert_many using FuturesUnordered (independent locks) 2. FastSlowStore: skip has() + get_part() double lookup, attempt get_part() directly and handle NotFound as cache miss 3. CompletenessCheckingStore: fetch AC entry once, decode for completeness check, serve bytes directly (eliminates double fetch) 4. VerifyStore: reduce buf_channel from 256 to 4 slots (hasher is memory-speed, deep buffering is wasteful) 5. store_trait: reduce get_part_unchunked/update_oneshot channel from 1024 to 4 slots (collecting into single buffer, not streaming) 6. LoggingReadStream: remove redundant SHA256 hashing (VerifyStore already verifies on reads) 7. FastSlowStore mirror: pre-allocate BytesMut with digest size 8. ExistenceCacheStore: Vec::with_capacity for cache miss collection 9. DedupStore: FuturesOrdered → FuturesUnordered for has_with_results 10. GetTree BFS: capacity hints for deque, seen set, directories vec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ction When the server restarts, QUIC connections enter a half-open state. The old 60s idle timeout meant dead connections weren't detected for up to 60s, causing RPC timeouts on workers. The H3Connection layer has built-in reconnection (detects driver closure in poll_ready), but it only triggers after the connection is marked dead. With 2s keepalives and 15s idle timeout, dead connections are detected within ~4-6s, allowing automatic reconnection before the 120s RPC timeout expires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Server graceful shutdown: HTTP/2 GOAWAY drain on SIGTERM via hyper_util GracefulShutdown. QUIC serve_with_shutdown. SIGTERM sequence: stop accept → drain (30s) → flush writes (30s) → shutdown schedulers (20s). Worker CAS server: serve_with_shutdown with cas_shutdown_tx watch channel triggered on worker SIGTERM. Streaming blob fix: errored in-flight entries in InFlightBlobMap were poisoning reads. Now checks has_error() and falls through to store. MokaEvictingMap: lock-free TinyLFU cache replacing Mutex+LRU. Fixes: pin race (DashMap-first ordering), replaced-item unref, bounded eviction channel, BTree cleanup on eviction, batched pin_keys, atomic fast-path for pinned check. Phase 2: ExistenceCacheStore and MemoryStore migrated to Moka. EvictingMap shards reduced 64→16. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

has_with_results() now sends all N keys' STRLEN+EXISTS commands in one pipeline round-trip instead of N individual round-trips. With 64 connections and 16K permits, this eliminates the head-of-line blocking that caused 5.5s Redis latency from 3 connections. CrossSlot fallback for Redis cluster mode preserves the original per-key concurrent behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 2.3 + Phase 3: all stores now use Moka. FilesystemStore uses pinning, async unref, callbacks, insert_with_time. MemoryAwaitedActionDb is TTL-only. FsEvictingMap type alias simplified: lifetime parameter removed (MokaEvictingMap requires Q: 'static). Lookups still work with any lifetime via Borrow trait. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Startup atime ordering: insert_with_time() uses insert_startup() which skips frequency bump and run_pending_tasks(). Items enter at freq=1, preserving FIFO window ordering (oldest evicted first). 2. u32 weigher: scale to KB granularity (capacity/1024, weight/1024) so items up to 4TB fit in u32 without truncation. 3. Unpin re-admission: unpin_key() bumps frequency after re-inserting into Moka so TinyLFU doesn't immediately reject the item. 4. Channel-full callback skip: log warning when eviction channel is full and ItemCallbacks are skipped in the tokio::spawn fallback. 5. Startup batch perf: insert_startup() defers run_pending_tasks() to bulk processing when the first real operation triggers it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds batch_get_part_unchunked to StoreDriver trait with pipelined Redis implementation. GetTree BFS now fetches all directories per level in a single Redis pipeline (1 round-trip for 130+ dirs) instead of 130 individual GETRANGE commands. Store chain support: RedisStore (true pipeline), FastSlowStore (fast-first with slow fallback), SizePartitioningStore (split by threshold, concurrent), ExistenceCacheStore (delegates + updates cache), VerifyStore (pass-through). Default impl fans out via FuturesUnordered for non-Redis backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Redis update(): RENAME + PUBLISH combined into single pipeline (saves 1 RTT per write when pub_sub enabled). BatchReadBlobs gRPC handler: delegates to batch_get_part_unchunked instead of N individual get_part_unchunked calls. Single Redis pipeline for all requested blobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Redis pipeline chunking (MAX_PIPELINE_BATCH=5000): prevents unbounded response buffering for very large batches. CrossSlot fallback logging added for cluster-mode observability. BatchReadBlobs: 64MiB per-blob size cap prevents unbounded memory from malicious/misconfigured clients. ExistenceCacheStore: batch path now caches digest.size_bytes() instead of data.len(), matching single-key get_part behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

evicting_map.rs: 1593 → 84 lines (LenEntry, ItemCallback, NoopCallback retained). Removed EvictingMap, ShardedEvictingMap, State, EvictionItem, LockMetrics, lock_with_metrics!, SerializedLRU, and all constants. Deleted evicting_map_test.rs (668 lines). Removed lru dependency. Updated doc comments referencing old type names. Net: -2,177 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

18 integration tests covering: insert/get/remove, max_bytes/max_count eviction, TTL expiration, pin/unpin/pin_cap, replaced items, startup insert, sizes_for_keys, range queries, concurrent stress, callbacks. VerifyStore: documented that batch_get_part_unchunked bypasses hash verification. Audited all store wrappers — completeness_checking and dedup correctly fall through to default (no bypass). Expiry investigation: Moka's Expiry trait does NOT help with size-based eviction ordering (TTL and size eviction are independent). Corrected doc: window deque is unused in Moka 0.12, entries go to MainProbation. Current mitigation (insert_startup skips freq bump, FIFO ordering in MainProbation) correctly preserves atime-based eviction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MemoryStore: direct evicting_map.get_many() + BytesWrapper::to_contiguous(). Zero buf_channel allocations, zero async task pairs. Single-chunk blobs are zero-copy (Arc bump). ~50us for 500 keys vs ~2-5ms before. FilesystemStore: FuturesUnordered for parallel I/O but skips buf_channel. Each task does evicting_map.get() → read_file_entry_bytes() → Bytes. Preserves FD semaphore, stale-entry cleanup, length cap. 256MiB safety bound for unlimited reads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The dual_transport code was passing ALL endpoints (including use_http3=true) to the TCP ConnectionManager. This caused TCP connection attempts to the QUIC-only UDP port (50072), generating persistent ConnectionRefused errors and log floods on all workers. Fix: filter out use_http3 endpoints when building tcp_endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GetTree: MokaEvictingMap cache (512MiB, 10K entries, 5min TTL) for assembled tree results. 54 concurrent identical GetTree calls → 1 BFS + 53 cache hits. Eliminates ~15,900 redundant directory lookups. ExistenceCacheStore: batch_get_part_unchunked now uses insert_many() instead of 552 sequential insert() calls. Reduces ~2200 Moka ops + 552 run_pending_tasks() to ~552 inserts + 1 maintenance pass. get_many(): documented why sequential is correct (Moka lock-free reads at 100ns each, parallelism overhead exceeds benefit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GetTree coalescing: DashMap + watch channel prevents thundering herd. 54 concurrent same-root calls → 1 BFS + 53 waiters. TOCTOU race fixed with Entry API (single lock scope for check+register). Subtree cache (level 2): per-directory MokaEvictingMap (256MiB, 50K, 5min TTL). BFS checks subtree cache before store fetch. Different roots with 90% overlap → only ~10% dirs fetched from store. Arc<Vec<Directory>>: CachedTree wraps directories in Arc for cheap cache clones within moka. Response construction still deep-clones (required by protobuf ownership). insert_many: uses insert_batch (defers run_pending_tasks) instead of insert_inner (called it per-item). Now 1 maintenance pass per batch instead of N+1. CLAUDE.md: code review requirement before committing. Level 3 Tree proto lookup deferred (no root→tree digest mapping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 integration tests: tree_cache_hit, tree_cache_miss_different_root, subtree_cache_overlap, coalescing_concurrent, coalescing_leader_failure, paginated_bypasses_cache, subtree_cache_deduplication, next_page_token. Arc optimization: BFS result moved into Arc (zero-copy), cache gets Arc clone (refcount bump), response gets one deep clone. Eliminates transient double materialization (~5000 heap allocations saved). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FindMissingBlobs (120K/min), ByteStream read/write completed, mirror blob streamed, AC read/write, BlobsAvailable registration, streaming populate, connection creation — all downgraded from info! to debug!. With release_max_level_info these are compiled out in release builds, reducing log volume ~60x under load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rejuvenile and others added 30 commits March 23, 2026 09:39

Fix race log levels: server wins were at debug!, invisible at info level

63f1245

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rejuvenile and others added 28 commits April 14, 2026 21:52

Fix parallel_chunk_count doc: default is 64, not 8

622ace5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rejuvenile force-pushed the pr/all-changes branch from 4727b44 to bdb015a Compare April 15, 2026 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243

Production hardening: scheduling, P2P transfers, directory cache, TLS/QUIC#2243
rejuvenile wants to merge 310 commits intoTraceMachina:mainfrom
rejuvenile:pr/all-changes

rejuvenile commented Mar 24, 2026 •

edited by MarcusSorealheis

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rejuvenile commented Mar 24, 2026 • edited by MarcusSorealheis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Store reliability & I/O performance

MokaEvictingMap — lock-free TinyLFU cache (NEW)

Redis pipelining (NEW)

Batch store reads (NEW)

GetTree caching & coalescing (NEW)

Graceful shutdown (NEW)

Streaming blob fixes (NEW)

gRPC protocol & transport

Worker improvements

Observability

QUIC/HTTP3 transport

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rejuvenile commented Mar 24, 2026 •

edited by MarcusSorealheis

Loading