Relationships
#440 Datastore: lazy hydration with metadata-only setup and transparent content download
Opened by stack72 · 5/25/2026· Shipped 5/25/2026
Problem
When a new team member clones a repo with an S3/GCS datastore, or CI checks out a fresh copy, the first sync downloads the entire datastore contents — every model's data, outputs, workflow runs, audit logs, telemetry. If the repo has hundreds of models and gigabytes of data, this takes minutes even when the user only needs to run one model.
Key Insight
The data catalog (_catalog.db) backfill reads ONLY metadata.yaml files — never raw content files. Downloading metadata without content gives full catalog visibility (data list, data query, CEL expressions all work) while skipping the expensive bulk content download.
Data file structure per version:
data/{type}/{model-id}/{data-name}/{version}/
metadata.yaml ← ~200 bytes, contains all tags/owner/lifecycle info
raw ← variable size, the actual content (could be megabytes)
data/{type}/{model-id}/{data-name}/latest ← few bytes, version pointerProposed Solution
Selective pull on setup
When hydrationStrategy: "lazy" is configured, the setup/initial pull downloads only:
- All
metadata.yamlfiles (catalog needs them) - All
latestmarker files (catalog needs them for is_latest flag) - All partition index files (
_index/) - Skips all
rawcontent files
The catalog rebuilds normally from the downloaded metadata. data list, data query, and CEL expressions work immediately.
Transparent content hydration
When a command needs a raw file that wasn't downloaded:
- Model runs / workflow runs:
acquireModelLocks→ scoped pull → partition file lists all files → pull seesrawmissing locally → downloads automatically. Already works with Phase 1+2 machinery, no new code needed. data get(read-only, no sync): AhydrateFilehook onUnifiedDataRepository— whengetContent()can't find therawfile, calls the hook to download just that file from the remote, then retries the read. Same injection pattern asmarkDirty.
Framework changes
SyncCapabilitiesgainslazyHydration?: booleanDatastoreSyncServicegains optionalhydrateFile?(relativePath: string, options?: DatastoreSyncOptions): Promise<boolean>CustomDatastoreConfiggainshydrationStrategy?: "full" | "lazy"(defaults to"full")UnifiedDataRepositorygains optionalHydrateFileHookconstructor parameter — called ingetContent()whenrawfile is missingrepo_context.tswires the hook from sync service to data repository (same pattern asbuildMarkDirtyHook)
Extension changes (S3 + GCS)
pullChangedwith lazy strategy: filter files by suffix — downloadmetadata.yamlandlatest, skipraw, create directory structure for skipped fileshydrateFileimplementation: download a single file from remote by keycapabilities()returns{ scopedSync: true, lazyHydration: true }
Backward Compatibility
hydrationStrategydefaults to"full"— zero change for existing users"lazy"is opt-in via.swamp.yamlconfighydrateFileis optional onDatastoreSyncService— old extensions unaffected- Read-only commands on full-hydration repos: hook never fires (file always exists)
Benchmarking Requirements
The following benchmarks must be run before and after the change to validate performance and correctness.
Setup benchmarks
- Full setup (baseline): total wall time, file count downloaded
- Lazy setup: total wall time, file count downloaded (should be dramatically faster — metadata only)
Data command benchmarks (run AFTER setup, before any model runs)
data listafter lazy setup: wall time, result count (must match full setup count)data listafter full setup (baseline): wall time, result countdata query --where "tags.type == 'state'"after lazy setup: wall time, result count (must match full setup)data get @type/model dataNameafter lazy setup (first access): wall time (includes transparent hydration)data get @type/model dataNameafter lazy setup (second access): wall time (cached, should match full setup)data getafter full setup (baseline): wall time
Model run benchmarks (CI pipeline scenario)
- Full setup → model run: setup + first run total time (baseline)
- Lazy setup → model run (same model): setup + first run total time (should be faster overall)
- Lazy setup → second different model run: run time (slight overhead on first run of each new model)
Cross-repo benchmarks (team collaboration)
- Writer pushes 50 models, reader does full setup + data list: total time (baseline)
- Writer pushes 50 models, reader does lazy setup + data list: total time (faster setup, same results)
- Writer pushes 50 models, reader does lazy setup + runs 1 model: total time (much faster — only 1 model hydrated)
Scale parameters
- 50 models, ~20 data artifacts each (1000 total)
- Each raw file: ~5KB
- Each metadata.yaml: ~200 bytes
- Total raw content: ~5MB, total metadata: ~200KB
- Lazy setup downloads ~200KB instead of ~5.2MB
Dependencies
- Phase 1 (#378) — framework contracts (shipped)
- Phase 2 (#379, #434) — S3/GCS extension overhaul with partitioned index and scoped sync (shipped)
- #436 — dirty-only partition writes (shipped)
Supersedes #380.
Shipped
Click a lifecycle step above to view its details.
stack72 commented 5/25/2026, 6:29:06 PM
Design update: dropping the hydrationStrategy config option entirely. Lazy hydration should be the default behavior, not opt-in.
If the implementation works correctly — catalog populated from metadata, model runs hydrate via scoped pull, data get hydrates transparently — there's no reason to offer "full" as an alternative. The upfront full download was never a deliberate design choice, just the only option before on-demand hydration existed.
Changes from the original issue:
- No
hydrationStrategyconfig field - No
lazyHydrationcapability flag (it's just how the extension works) - Setup always downloads metadata.yaml + latest only, skips raw
- Content hydrates transparently on first access
- Users who want everything upfront can run
swamp datastore sync --pullafter setup
This simplifies the implementation: no config parsing, no capability negotiation, no branching on strategy. The extension just does the right thing.
stack72 commented 5/25/2026, 10:48:58 PM
Bug: push after lazy pull tombstones un-hydrated files
Found by @keeb while implementing lazy hydration for the MongoDB extension. Applies to S3/GCS too.
The problem: After a lazy pull (metadataOnly=true), raw files don't exist locally. On the next push, the full walk compares local files against the remote index. Files in the remote index that aren't on local disk are interpreted as "deleted locally" and removed from the index (S3/GCS) or tombstoned (MongoDB). This is silent data loss — other repos pulling afterward won't see those files.
Root cause: The push path cannot distinguish "file was never downloaded (lazy hydration skipped it)" from "file was deleted by the user." Both look the same: file exists on remote, missing locally.
The fix: The sidecar must record that a lazy pull happened and which files were skipped. The push path must treat skipped-but-not-hydrated files as "leave remote alone" — not "delete from remote." Options:
Sidecar records skipped paths — on lazy pull, record the set of paths that were intentionally skipped. Push checks this set before tombstoning. Downside: set could be large.
Sidecar records lazy pull flag — simpler: a boolean
lazyPullActive: truein the sidecar. When true, push only uploads dirty paths and NEVER removes anything from the remote index that isn't in the dirty set. Cleared when a full pull runs.Push merges with remote index, never removes — after a lazy pull, the push only ADDS entries to the remote index, never removes. Deletions are only processed after a full pull confirms the file was intentionally removed.
Option 2 is simplest and safest. The lazyPullActive flag is cleared after the first full (non-lazy) pull, at which point the local cache is complete and the push can safely tombstone.
This blocks shipping lazy hydration for all extensions (S3, GCS, MongoDB).
Sign in to post a ripple.