paths.base: manifest is not honored for workflows: — bundled workflows only resolve from repo root, blocking self-contained subdir layouts (sibling to #459)

When a new team member clones a repo with an S3/GCS datastore, or CI checks out a fresh copy, the first sync downloads the entire datastore contents — every model's data, outputs, workflow runs, audit logs, telemetry. If the repo has hundreds of models and gigabytes of data, this takes minutes even when the user only needs to run one model.

Key Insight

The data catalog (_catalog.db) backfill reads ONLY metadata.yaml files — never raw content files. Downloading metadata without content gives full catalog visibility (data list, data query, CEL expressions all work) while skipping the expensive bulk content download.

Data file structure per version:

data/{type}/{model-id}/{data-name}/{version}/
  metadata.yaml   ← ~200 bytes, contains all tags/owner/lifecycle info
  raw             ← variable size, the actual content (could be megabytes)
data/{type}/{model-id}/{data-name}/latest  ← few bytes, version pointer

Proposed Solution

Selective pull on setup

When hydrationStrategy: "lazy" is configured, the setup/initial pull downloads only:

All metadata.yaml files (catalog needs them)
All latest marker files (catalog needs them for is_latest flag)
All partition index files (_index/)
Skips all raw content files

The catalog rebuilds normally from the downloaded metadata. data list, data query, and CEL expressions work immediately.

Transparent content hydration

When a command needs a raw file that wasn't downloaded:

Model runs / workflow runs: acquireModelLocks → scoped pull → partition file lists all files → pull sees raw missing locally → downloads automatically. Already works with Phase 1+2 machinery, no new code needed.
data get (read-only, no sync): A hydrateFile hook on UnifiedDataRepository — when getContent() can't find the raw file, calls the hook to download just that file from the remote, then retries the read. Same injection pattern as markDirty.

Framework changes

SyncCapabilities gains lazyHydration?: boolean
DatastoreSyncService gains optional hydrateFile?(relativePath: string, options?: DatastoreSyncOptions): Promise<boolean>
CustomDatastoreConfig gains hydrationStrategy?: "full" | "lazy" (defaults to "full")
UnifiedDataRepository gains optional HydrateFileHook constructor parameter — called in getContent() when raw file is missing
repo_context.ts wires the hook from sync service to data repository (same pattern as buildMarkDirtyHook)

Extension changes (S3 + GCS)

pullChanged with lazy strategy: filter files by suffix — download metadata.yaml and latest, skip raw, create directory structure for skipped files
hydrateFile implementation: download a single file from remote by key
capabilities() returns { scopedSync: true, lazyHydration: true }

Backward Compatibility

hydrationStrategy defaults to "full" — zero change for existing users
"lazy" is opt-in via .swamp.yaml config
hydrateFile is optional on DatastoreSyncService — old extensions unaffected
Read-only commands on full-hydration repos: hook never fires (file always exists)

Benchmarking Requirements

The following benchmarks must be run before and after the change to validate performance and correctness.

Setup benchmarks

Full setup (baseline): total wall time, file count downloaded
Lazy setup: total wall time, file count downloaded (should be dramatically faster — metadata only)

Data command benchmarks (run AFTER setup, before any model runs)

data list after lazy setup: wall time, result count (must match full setup count)
data list after full setup (baseline): wall time, result count
data query --where "tags.type == 'state'" after lazy setup: wall time, result count (must match full setup)
data get @type/model dataName after lazy setup (first access): wall time (includes transparent hydration)
data get @type/model dataName after lazy setup (second access): wall time (cached, should match full setup)
data get after full setup (baseline): wall time

Model run benchmarks (CI pipeline scenario)

Full setup → model run: setup + first run total time (baseline)
Lazy setup → model run (same model): setup + first run total time (should be faster overall)
Lazy setup → second different model run: run time (slight overhead on first run of each new model)

Cross-repo benchmarks (team collaboration)

Writer pushes 50 models, reader does full setup + data list: total time (baseline)
Writer pushes 50 models, reader does lazy setup + data list: total time (faster setup, same results)
Writer pushes 50 models, reader does lazy setup + runs 1 model: total time (much faster — only 1 model hydrated)

Scale parameters

50 models, ~20 data artifacts each (1000 total)
Each raw file: ~5KB
Each metadata.yaml: ~200 bytes
Total raw content: ~5MB, total metadata: ~200KB
Lazy setup downloads ~200KB instead of ~5.2MB

Dependencies

Phase 1 (#378) — framework contracts (shipped)
Phase 2 (#379, #434) — S3/GCS extension overhaul with partitioned index and scoped sync (shipped)
#436 — dirty-only partition writes (shipped)

Supersedes #380.

02Bog Flow

Shipped

5/25/2026, 11:59:55 PM

Click a lifecycle step above to view its details.

03Sludge Pulse

stack72 assigned stack725/25/2026, 6:24:44 PM

stack72 commented 5/25/2026, 6:29:06 PM

Design update: dropping the hydrationStrategy config option entirely. Lazy hydration should be the default behavior, not opt-in.

If the implementation works correctly — catalog populated from metadata, model runs hydrate via scoped pull, data get hydrates transparently — there's no reason to offer "full" as an alternative. The upfront full download was never a deliberate design choice, just the only option before on-demand hydration existed.

Changes from the original issue:

No hydrationStrategy config field
No lazyHydration capability flag (it's just how the extension works)
Setup always downloads metadata.yaml + latest only, skips raw
Content hydrates transparently on first access
Users who want everything upfront can run swamp datastore sync --pull after setup

This simplifies the implementation: no config parsing, no capability negotiation, no branching on strategy. The extension just does the right thing.

stack72 commented 5/25/2026, 10:48:58 PM

Bug: push after lazy pull tombstones un-hydrated files

Found by @keeb while implementing lazy hydration for the MongoDB extension. Applies to S3/GCS too.

The problem: After a lazy pull (metadataOnly=true), raw files don't exist locally. On the next push, the full walk compares local files against the remote index. Files in the remote index that aren't on local disk are interpreted as "deleted locally" and removed from the index (S3/GCS) or tombstoned (MongoDB). This is silent data loss — other repos pulling afterward won't see those files.

Root cause: The push path cannot distinguish "file was never downloaded (lazy hydration skipped it)" from "file was deleted by the user." Both look the same: file exists on remote, missing locally.

The fix: The sidecar must record that a lazy pull happened and which files were skipped. The push path must treat skipped-but-not-hydrated files as "leave remote alone" — not "delete from remote." Options:

Sidecar records skipped paths — on lazy pull, record the set of paths that were intentionally skipped. Push checks this set before tombstoning. Downside: set could be large.
Sidecar records lazy pull flag — simpler: a boolean lazyPullActive: true in the sidecar. When true, push only uploads dirty paths and NEVER removes anything from the remote index that isn't in the dirty set. Cleared when a full pull runs.
Push merges with remote index, never removes — after a lazy pull, the push only ADDS entries to the remote index, never removes. Deletions are only processed after a full pull confirms the file was intentionally removed.

Option 2 is simplest and safest. The lazyPullActive flag is cleared after the first full (non-lazy) pull, at which point the local cache is complete and the push can safely tombstone.

This blocks shipping lazy hydration for all extensions (S3, GCS, MongoDB).

Should be able to see all the issues I created by a filter "submitted by me"

Ability to change the email address associated with my Swamp Club Account

feat: giga-swamp phase 5 — CLI output + namespace management commands

CI review jobs use two-dot diff that includes files the PR never touched

paths.base: manifest is not honored for workflows: — bundled workflows only resolve from repo root, blocking self-contained subdir layouts (sibling to #459)

Lab profanity filter rejects legitimate CLI flag tokens via substring match

Sign and notarize the swamp macOS binary

Add platform type to issue-lifecycle extension model Zod schema

fix: datastoreSetupExtension() ignores namespace config on initial migration push/pull

Remote execution: orchestrator/worker fan-out (replaces execution drivers)

swamp datastore sync --push creates global .datastore-index.json ignoring namespace config

feat: S3/GCS extension namespace-scoped sync support

Copy explicitGlobalArgs before mutation in resolveOrCreateDefinition

vault.get() expressions in extension model globalArguments are not resolved at runtime

swamp-issue skill should scrub secrets and org-specific data before submission

workflow validate: trim stale 'skipped' label from model_not_found warning

Add pi coding agent support

hashicorp-vault should read token from env

swamp-extension adversarial review skill needs mandatory mechanical verification checklist

feat: giga-swamp phase 6 — Namespace-scoped sync

swamp workflow validate emits misleading "Extension failed to load" warning when type resolves locally

Add issue search/list command to discover existing issues

Support vault-resolved private key content in transport auth (not just file paths)

Workflow engine resolves extension methods against base type, ignoring extension-registered methods

Per-model LockTimeoutError at 60s causes cascading failures under concurrent access

Persistent, queryable workflow runs (status / cancel from any shell)

swamp repo upgrade: ERR_SQLITE_ERROR 'attempt to write a readonly database' during extension catalog schema migration

workflow validate: fail on references to unknown model instances (typo'd modelIdOrName)

feat: giga-swamp phase 4 — CEL cross-namespace queries

Docs: document the extension push adversarial-review gate

vault://local_encryption token does not round-trip correctly for GCP OAuth2 access tokens

swamp issue: add ability to edit issue title and body after submission

@swamp/gcp/iam: add WIF pool, provider, service account, and binding support

Support vault-sourced identity keys

copy method reports success when scp exits non-zero (e.g. 255)

Docs: TLS behind inspecting proxies / private CAs (system trust store, DENO_CERT, SSL_CERT_FILE)

Extension quality/adversarial-review: add a 'published-surface hygiene' check for real infra identifiers

Feed-post scoring is a direct domain write, not a consumer of feed_post_approved telemetry

workflow validate silently PASSES steps whose model type is a pulled extension (step-inputs skipped = false pass)

extension quality fails to resolve bare specifiers — contradicts fmt no-import-prefix rule

Allow global arguments in direct type execution (workflow fan-out)

Bundled Deno does not honor the OS/system CA trust store

Gator-approved feed post did not trigger Discord activity or profile points

username_metrics projection backfill does not trigger re-scoring (stale UserScore for dormant users)

Enforce adversarial review gate before extension push

support git forge / web namespaces for collectives

Report type filtering in report search

extension search: empty results from CLI despite known extensions

workflow approve/resume cannot find suspended runs

vault annotate --url fails with query params on @swamp/aws-sm

datastore compact VACUUM fails with ERR_SQLITE_ERROR

workflow approve/resume cannot find suspended run when using S3 datastore

reindexByUsername re-strands pre-association history and wipes sign_in_dates

Telemetry never retroactively credits a device's pre-association history

Docs: document swamp doctor secrets in manual reference doctor.md

Docs: document 'swamp workflow resume --input' in manual reference

Cloudflare codegen: manifest version bumps on every regeneration (README not deno-fmt-clean)

Support dynamic host discovery from external sources

feat: giga-swamp phase 3 — Path resolver + per-namespace locking

@swamp/ssh exec: string host selector only matches 'all', ignores host names and tags

Add integration test for sensitive-arg guard on lazily-loaded extension types (follow-up to #480)

Remediate existing definitions holding cleartext sensitive global arguments (follow-up to #480)

Docs: document refusal of literal sensitive global arguments (follow-up to #480)

Docs: update extension-trust reference for swamp-only default + lockfile version pinning (swamp-club#465)

feat: giga-swamp phase 2 — Catalog schema v4 + repository interface

Support for Custom CA's

Cloudflare: support vault expressions for API credentials instead of env-var-only auth

GCP: support vault expressions for credentials instead of env-var-only auth

AWS: support vault expressions for credentials instead of env-var/SDK-chain-only auth

DigitalOcean: support vault expressions for the API token instead of DO_API_TOKEN env var

swamp model get does not redact `sensitive: true` fields (logs/reports/storage do)

Support vault expressions for API token instead of env var

UAT tests for manual_approval workflow commands

Document manual_approval workflow step type and suspend/approve/resume flow

Stale extension bundles break after swamp upgrade

Support --input flags on workflow resume for elevated permissions and runtime overrides

Add HTTP approval endpoints to swamp serve for manual_approval steps

feat: giga-swamp phase 1 — Namespace value object + config

swamp serve scheduled workflows do not load repo extension registries

ci: aws-check and gcp-check jobs take ~30min — rethink whether full model type-checking is needed per PR