The MURE LOG #4: Git breaks under agents, context becomes infra
The data systems log, curated by humans
Git collapses under agent workloads; LanceDB's multi-base and MotherDuck snapshots solve versioning elegantly. Context File Systems crystallize procedures into reusable operations (token reduction). Semantic layers return as trust infrastructure for AI. Google proves task alignment matters more than agent count. Data teams shift from scaling agents to building runbook libraries agents execute autonomously. Governance becomes the new infrastructure.
Data Reads
01 / Towards a science of scaling agent systems: When and why agent systems work - Yubin Kim et al.
Multi-agent crushes parallelizable tasks (+81% finance) but tanks sequential ones (-70% planning). Centralized orchestrators cut error amplification 17.2x→4.4x. Predictive model nails optimal architecture for 87% of unseen tasks. More agents ≠ better; alignment with task properties is everything.
02 / Your agents need runbooks, not bigger context windows - Ben Lorica
Context File System separates thinking from doing: first execution pays exploration cost, subsequent runs replay procedures (90% token reduction). Agents trapped in perpetual improvisation without crystallizing solutions into versioned, reusable procedures. Self-healing skill stores beat context bloat.
03 / Git for Data Applied: Comparing Git-like Tools That Separate Metadata from Data - Simon Späti
LakeFS, Dolt, Nessie, Bauplan, MotherDuck, DuckLake solve git-for-data three ways: metadata pointers, copy-on-write, differential storage. Zero-copy branching avoids petabyte duplication. Dagster branch deployments extend pattern to orchestration,clone entire environments including data.
04 / Duck, Dive, and Answer - Jordan Tigani
Dives: agent-generated interactive SQL visualizations that stay live across sessions. 95% text-to-SQL accuracy with proper context vs 80% benchmark. Business users catch hallucinations better than analysts. Semantic explanations + charts together answer follow-up questions. Non-technical teams (sales) now query data without SQL.
05 / First Context Engineering study - are semantic layers worth it? - Claire Gouze
Schema+sample+rules.md hits 45% reliability on 40 unit tests. Semantic layers flopped: barely any answers, 4x tool overhead, 3x slower. Rules.md alone beats comprehensive context, suggest markdown discipline outscores exhaustive docs for smaller datasets. Profiling>sampling. dbt repo hurt performance.
06 / Apache Gluten Engine Benefit to an Ingestion Workload - Binwei Yang
Gluten compiles Spark SQL to native code (FPGA/x86), pushes down to vectorized engines. Columnar optimization + predicate pushdown cuts ingestion latency, eliminates serialization overhead. ETL pipelines see 2-5x speedup on commodity hardware without rewriting code.
07 / dbt with Fabric Spark in Production - Raki Rahman
Spark Streams handle schema-on-read JSON with idempotency checkpoints; dbt manages silver-to-gold; Declarative Pipelines box in custom logic. Local devcontainer+Livy mirrors Fabric exactly; CI mounts OneLake, identical behavior everywhere. Humans architect, approve AI specs before execution.
08 / Building an Obsidian RAG with DuckDB and MotherDuck - Simon Späti
8,963 markdown notes → BGE-M3 embeddings → DuckDB VSS index. Web app: Next.js+MotherDuck WASM, FastAPI embedding service. Graph-boosted search surfaces backlink-connected notes. Hidden connections discover forgotten ideas across months. Built UI in 3-4 hours; human architect prevented vibe-code chaos.
09 / Branching and Shallow Cloning in Lance: Towards a "Git for AI Data" - Jack Ye
Lance unifies Iceberg branching (traceability) + Delta shallow clone (isolation) via multi-base. Branches track by root, not head, eliminates commit coupling, enables per-branch cost attribution, full per-branch time travel. Tags are immutable global refs. Path forward: lance-git maps fetch/push/pull to dataset versioning.
10 / Mitchell Hashimoto’s new way of writing code - Gergely Orosz
Agents running in background while you code, research, or review. Git needs Gmail-like redesign for agentic era, merge queues become untenable, branches explode. OSS shifts to “default deny” as AI floods repos with plausible garbage. Best engineers: invisible backgrounds, no GitHub, no social media, context-switching is the hidden tax.
What’s New
Apache Iceberg - file format API
Apache Parquet - native geospatial types support
Apache Polaris - graduated to top level project
Apple - acquired Kuzu, graph DB
ClickHouse - February 2026 newsletter, including $400M Series D
Databricks - custom agents available to deploy as Databricks Apps
Microsoft - Fabric February 2026 feature summary is out
OneHouse - announced LakeBase
Tools & demos
Alibaba - zvec, in-process vector database
AWS Labs - benchmarks, tpc-ds, Graviton and Spark acceleration
Databricks - skills, experimental
Google - TimesFM, Time Series Foundation Model for time-series forecasting
Hugging Face - skills
Tobias Müller - polyglot, Rust/Wasm-powered SQL transpiler
Xiangpeng Hao - parquet-linter, towards a better Parquet
Yazhou Li - rimio, write-back cache that accelerates object-native systems
Upcoming Events
Big Data & AI World London · March 4 / London
FabCon - Microsoft Fabric Community Conference · March 16 / Atlanta
QCon London · March 16 / London
underCurrent · March 26 / SF
Iceberg Summit · April 8 / SF
Data Engineering Open Forum at Netflix · April 16 / SF
Google Cloud Next · April 22 / Las Vegas
SQLBits · April 22 / Newport, Wales
ODSC · April 28 / Boston
OpenXData · April 29 / Virtual
AI Council · May 12 / SF
PyCon US · May 13 / Long Beach, CA
Snowflake Summit · June 1 / SF
Databricks Data + AI Summit · June 15 / SF
That’s all for now — we’ll be back in your inbox in two weeks.

