MEQuest
Module 2Unit 2 of 57 min

Data Lakes & Warehouses

While data historians excel at storing time-series sensor data, a digital oilfield produces many other types of data - well reports, geological models, maintenance logs, contracts, and more. Data lakes and data warehouses provide the broader storage infrastructure to handle this variety.

Data Lake vs Data Warehouse

Data Lake

Stores raw data in its original format - structured, semi-structured, or unstructured. Schema is applied when data is read (schema-on-read).

  • • Stores everything: sensor data, PDFs, images, logs
  • • Flexible and cost-effective for large volumes
  • • Ideal for data science and exploration
  • • Risk of becoming a "data swamp" without governance

Example: An Azure Data Lake stores 10 TB of raw well test reports, seismic files, and sensor exports for a data science team.

Data Warehouse

Stores cleaned, structured, and organised data optimised for querying and reporting. Schema is applied when data is written (schema-on-write).

  • • Highly structured with defined schemas
  • • Fast query performance for dashboards
  • • Ideal for business reporting and KPIs
  • • Requires ETL pipelines to load data

Example: A Snowflake data warehouse holds curated production, deferment, and cost data used by Power BI dashboards.

The Data Lakehouse

The data lakehouse is a newer architecture that combines the flexibility of a data lake with the performance and structure of a data warehouse. Technologies like Databricks Delta Lake and Apache Iceberg enable this by adding ACID transactions and schema enforcement on top of data lake storage.

Many oil and gas companies are adopting the lakehouse pattern because it eliminates the need to maintain separate lake and warehouse systems. Raw sensor data, well reports, and curated production tables all live in the same platform - reducing data duplication and simplifying the architecture.

ETL / ELT Pipelines

Data does not magically appear in a lake or warehouse. ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) pipelines move data from source systems, clean and transform it, and load it into the target storage.

Extract

Pull data from historians, SCADA, ERP, maintenance systems

Transform

Clean, validate, convert units, join related datasets

Load

Write to data lake, warehouse, or lakehouse tables

Do not underestimate the plumbing
Data infrastructure is not glamorous, but it is the foundation of everything else. If your data pipelines are unreliable, your dashboards will show stale data, your ML models will train on garbage, and your engineers will lose trust in the entire digital system.