The MURE LOG #1: 2025 recap, 2026 predictions. Dicer. CocoIndex and LanceDB. Iceberg and BigLake.

The data systems log, curated by humans

Jan 16, 2026

This first edition covers the 2025 retrospectives and 2026 outlook. Year-in-review posts from Pavlo, Packkildurai, Cook, and Lorica dissect what actually happened in databases, data engineering, and infrastructure. Forward-looking pieces examine where agents, open formats, and the metadata layer are taking us. Plus practical deep-dives on Iceberg datasets, incremental indexing, and AI-powered Spark optimization.

Data Reads

01 / A Diary of a Data Engineer - Simon Späti
Tools change, fundamentals don't. Learn data modeling deeply, understand how data flows. Excel isn't the enemy—it's the business telling you what they need. When you do your job well, you're invisible. When something breaks, you're under a microscope.

02 / Databases in 2025: A Year in Review - Andy Pavlo
PostgreSQL keeps eating the world. Databricks paid $1B for Neon, Snowflake grabbed CrunchyData for $250M. MCP became the standard for AI-to-database connections. MongoDB sued FerretDB. File format wars heated up with five new Parquet challengers. Chaotic and exciting.

03 / 10 Predictions for Data Infrastructure in 2026 - Ian Cook
Arrow approaches its 10th anniversary as ubiquitous infrastructure but faces funding strain. ADBC is becoming the connectivity layer for columnar data. Iceberg climbed past hype into production. Multi-engine stacks go mainstream. The boring work of interoperability will be the highest-leverage engineering.

04 / DEW - The Year in Review 2025 - Ananth Packkildurai
Data engineering shifted from pipeline plumbing to cognitive infrastructure—designing systems for AI agents to reason and act. MCP became "USB-C for Agents." The bottleneck moved from model capacity to managing meaning. The "Big Data" era ended; the "Context Era" began.

05 / Data Engineering in 2026: What Changes? - Ben Lorica
Two forces: more automation (agents doing real work) and more scrutiny ("close enough" fails when software decides). Databricks says 80%+ of new databases are agent-launched. The fragmentation tax kills agentic reliability. Success = version control, tests, unified execution—applied to tables and embeddings, not just code.

06 / From ETL to Autonomy: Data Engineering in 2026 - Chris Child
Data engineers are transitioning from builders to strategists. Open formats like Iceberg will be embraced by C-suites as the AI foundation—eliminating vendor lock-in. The metadata layer becomes the critical control plane. Data engineers are no longer technical resources—they're business partners.

07 / Explore public datasets with Apache Iceberg & BigLake - Talat Uyarer et al.
High-quality public datasets now available via Iceberg REST Catalog on BigLake. Engine-agnostic: Spark, Trino, Flink, BigQuery—you bring compute, they provide storage and catalog. No file copying, no bucket management. Just configure and query. New V3 spec test datasets coming.

08 / Keep Your Data Fresh with CocoIndex and LanceDB - Prashanth Rao et al.
CocoIndex keeps source and target in sync automatically via incremental transformations. Combined with LanceDB—a multimodal vector database—it enables real-time semantic search without manual refresh. Declare transformations, it handles the sync. Always-fresh data for AI apps.

09 / How Slack achieved operational excellence for Spark on Amazon EMR using generative AI - Avijit Goswami et al.
Slack built a monitoring framework capturing 40+ Spark metrics, processing through Kafka and Iceberg, using Bedrock for AI-powered tuning. Results: 30-50% cost cuts, 40-60% faster jobs. Metrics exposed via internal MCP server for Cursor/Claude Code integration. No more guesswork tuning.

What’s New

Databricks - Open sourced Dicer (auto sharder)
Google - A gRPC transport for the Model Context Protocol
Microsoft - Acquired Osmos, agentic AI for data engineering in Fabric
Tinybird - Launched ClickHouse for developers

Tools & demos

Kubeflow - MCP for Apache Spark History Server
DataFlint - DataFlint OSS, drop-in replacement for Apache Spark UI
Databricks - DLT-META, a metadata-driven framework for Declarative Pipelines
Data Goblin - fabric-cli-plugin, Claude Code plugin for Microsoft Fabric CLI

Upcoming Events

Big Data & AI World London
March 4 / London

FabCon
March 16 / Atlanta

QCon London
March 16 / London

Iceberg Summit
April 8 / SF

Data Engineering Open Forum at Netflix
April 16 / SF

Google Cloud Next
April 22 / Las Vegas

SQLBits
April 22 / Newport, Wales

ODSC
April 28 / Boston

That’s all for now — we’ll be back in your inbox in two weeks.

MURE LOG

Discussion about this post

Ready for more?