The MURE LOG #2: Multimodal at scale, agent-native infra, query engines
A data systems log, curated by humans
This edition tracks the convergence of multimodal data, agentic systems, and query optimization at scale. PostgreSQL handles millions of QPS without sharding, while Delta Lake and LanceDB redefine storage for PDFs, images, and video without re-encoding. The 2026 inflection emerges: from data volume to freshness, unstructured data taming, streaming-first architectures, and agents—plus deep-dives on extensible query engines, orchestration trade-offs, and disciplined agent patterns.
Data Reads
01 / xBound: Join Size Lower Bounds - Mihail Stoian et al.
Database optimizers underestimate join cardinality more often than overestimate—a critical problem xBound solves using provable lower bounds. Tested on DuckDB 1.4, PostgreSQL 18, and Fabric DW, corrected 17.5% of DuckDB underestimates, up to 36.1% in Fabric.
02 / Multimodal with Delta Lake - R. Tyler Croy
Storing PDFs, images, and video in Delta Lake without re-encoding requires Parquet Anchors (metadata in binary protocol) and Virtual Delta Tables (on-demand artifact encoding). Keeps petabyte-scale data untouched while exposing structured schemas for seamless query integration.
03 / Slides-to-Speech: Turn your presentations into narrated content with CocoIndex and LanceDB - Linghua Jin and Prashanth Rao
CocoIndex + LanceDB automates slide conversion: extracts speaker notes via Gemini Vision, synthesizes narration with Piper TTS, stores in LanceDB for semantic search with auto-refresh on file changes. One end-to-end pipeline unlocks knowledge trapped in presentation decks.
04 / Scaling PostgreSQL to power 800 million ChatGPT users - Bohan Zhang
PostgreSQL handles millions of QPS reliably without sharding—OpenAI serves 800M ChatGPT users on single-primary + 50 read replicas. Bottleneck is write amplification via MVCC; fixes include query optimization, connection pooling, cache locking, and cascading replication for 10-50ms p99 latency and five-nines availability.
05 / Claude meets JSON: Automating Databricks Dashboards - Steven Kempers
Claude Code generates and modifies Databricks dashboards as JSON, replacing hours of manual UI work with AI-assisted code generation. Export dashboard structure locally, use Claude with BAML subagents to handle Databricks syntax, then deploy via Asset Bundles—full multi-page dashboards now prototype in 30-60 minutes.
06 / From Human Ergonomics to Agent Ergonomics - Wes McKinney
LLM agents iterate 1-2 orders of magnitude faster than humans, making slow test cycles prohibitive. Fast compile/test and seamless distribution now matter more than readability—expect Go/Rust to dominate agent infrastructure, while Python’s performance overhead becomes a liability even for data engineers.
07 / AI Trends Reshaping Data Engineering in 2026 - Alibaba Cloud Big Data and AI
Five mega-trends emerge: data+AI infra merge, shift from volume to freshness, unstructured data taming (80% of enterprise knowledge), context over prompts, and agent-native infrastructure. Unstructured data grows 49.3% CAGR; streaming-first architectures and multimodal databases (Hologres) are now essential, not optional.
08 / Databricks Lakeflow vs Apache Airflow - Daniel Beach
Lakeflow wins for Databricks-only workloads with clearer pipeline health visibility; Airflow dominates multi-cloud, multi-tool orchestration with 75% of users running ≤250 DAGs. Choose Lakeflow if fully Databricks-committed; choose Airflow if you need diverse connectors and external system integration.
09 / How we use Apache DataFusion at Spice AI - Phillip LeBlanc
DataFusion is a programmable query compiler, not just an engine. Spice maintains a fork with 20+ connectors, custom optimizer rules for cache + aggregate pushdown, async UDFs for inline AI inference, and physical execution extensions—proving that query engines should be extensible foundations, not black boxes.
10 / How do you vibe-engineer? - Julien Hurault
Effective agent development requires disciplined planning upfront, aggressive simplification, PoC isolation, and multi-model loops (Claude + GPT for convergence). Manage agent loops like optimization loops; plan mode trumps DO DO DO; multi-agent workflows benefit from YAML-based orchestration as 2026 standard.
What’s New
AWS - Spark 4 in preview on EMR serverless
ClickHouse - acquired Langfuse, OSS LLM observability platform
Columnar - ADBC driver for Databricks
Databricks - Iceberg support in Databricks Delta Sharing
Databricks - BlackIce, docker image with 14 OSS bundles for AI security testing
Google - launched conversational analytics in BigQuery in preview
Lance - launched lance-context, multimodal agentic context lifecycle with Lance
Microsoft - Fabric January 2026 feature summary is out
Snowflake - Open Semantic Interchange (OSI) available via Apache 2 license
Snowflake - acquired Observe, observability stack
Tabular Editor - launched Semantic Bridge, semantic model compiler
Tools & demos
Anton Babenko - terraform-skill, Claude skill for Terraform and OpenTofu
CauchyIO - claudetracing, MLFlow tracing for Claude Code
CocoIndex - CocoIndex, declarative pipelines for AI
Michale Drogalis - ShadowTraffic, synthetic data for your backend
Simon Späti - Data Engineering Vault, data engineering KB
Tobias Muller - ai-observer, local observability for AI coding assistants
Pixeltable - Pixeltable, declarative data infra for multimodal AI apps
Steve Yegge - beads, a memory upgrade for coding agents
Wes McKinney - roborev, automatic code review for git commits using AI agents
Wes McKinney - agent-session-viewer, manage AI coding sessions
Upcoming Events
Big Data & AI World London
March 4 / London
FabCon - Microsoft Fabric Community Conference
March 16 / Atlanta
QCon London
March 16 / London
Iceberg Summit
April 8 / SF
Data Engineering Open Forum at Netflix
April 16 / SF
Google Cloud Next
April 22 / Las Vegas
SQLBits
April 22 / Newport, Wales
ODSC
April 28 / Boston
OpenXData
April 29 / Virtual
That’s all for now — we’ll be back in your inbox in two weeks.

