Skip to main content
← Back to list
01Issue
FeatureShippedExtensions
Assigneesstack72

Relationships

#436 Datastore extensions: only write dirty partition files on push

Opened by stack72 · 5/24/2026· Shipped 5/25/2026

Problem

The partitioned index dual-write in both @swamp/s3-datastore and @swamp/gcs-datastore rewrites ALL partition files on every push, regardless of which models changed. With the partition-per-instance design, a repo with 1000 model instances produces 1000 partition file PUTs on every push — even if only 1 model's data changed.

The parallelization fix (#435 follow-up) makes this viable from a latency perspective (~170ms vs ~52s sequential), but the cost and request volume are still proportional to total model count, not to change count:

  • API cost: 1000 S3/GCS PUT requests per push at ~$0.005/1000 requests adds up over frequent pushes
  • Rate limits: New S3 buckets that haven't auto-scaled may hit 503 SlowDown with 1000 concurrent PUTs
  • Unnecessary work: If 1 model changed, writing 999 unchanged partition files is wasted I/O

Proposed Solution

Only write partition files for models whose data actually changed during this push. The dirty sidecar already knows which paths changed — use dirtyPaths to determine which partition keys need rewriting.

Approach

  1. After the scoped push walk determines toPush (files to upload), derive the set of affected partition keys from those file paths using groupEntriesByPartition or partitionKeyFromModel
  2. Only write partition files for keys in that set
  3. Update _meta.json to include ALL known partition keys (not just dirty ones) — the meta file is a directory listing, not a change log
  4. On full walk (bulkInvalidated), write all partitions as today (safe fallback)

Expected impact

  • Repo with 1000 instances, 1 model changed: 1 partition PUT + 1 _meta.json PUT + 1 monolithic PUT = 3 PUTs (down from 1002)
  • Repo with 1000 instances, full walk: unchanged (1002 PUTs — correct for bulk operations)

Applies to

  • @swamp/s3-datastore
  • @swamp/gcs-datastore

Both extensions have the same writePartitionedIndex pattern and should get the same fix.

Backward Compatibility

No impact. Partition files are an optimization — readers fall back to monolithic when a partition is missing. Writing fewer partitions means some may be stale, but the monolithic index (always fully written) remains the source of truth. A reader doing scoped pull against a stale partition gets slightly outdated data, then the next full pull corrects it.

Actually, this needs careful thought: if a reader does scoped pull and the partition file is stale (hasn't been rewritten since the model's data changed), the reader would miss the change. The monolithic index has the correct data, but the scoped pull reads the partition, not the monolithic.

Mitigation: On push, if a partition file exists but its entries are a subset of what the monolithic index has for that prefix, rewrite it. Or: always rewrite partitions for dirty models + verify non-dirty partitions are consistent with the monolithic index.

This needs design work — filing as a follow-up rather than a quick fix.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 5 MOREREVIEW+ 3 MOREPR_MERGEDCOMPLETE

Shipped

5/25/2026, 4:22:26 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack725/25/2026, 2:26:54 PM

Sign in to post a ripple.