Incremental Models in dbt: Processing Data Efficiently
A deep dive into dbt's incremental materialization strategy, covering when to use incremental models, how to implement them correctly, and how to avoid the common pitfalls that lead to data inconsistencies.
Full table refreshes are the simplest and most reliable way to materialize dbt models. Every run rebuilds the entire table from scratch, guaranteeing that the output always reflects the current state of the upstream data. But simplicity has a cost: as data volumes grow, full refreshes become slow and expensive. A table with 500 million rows that takes 20 minutes to rebuild is tolerable when it runs once daily but unacceptable when stakeholders need hourly updates. Incremental models solve this problem by processing only new or changed data since the last run. This article covers when incremental models are appropriate, how to implement them correctly, and how to avoid the pitfalls that lead to subtle data inconsistencies.
When to Use Incremental Models
Not every model should be incremental. The added complexity of incremental logic introduces failure modes that do not exist with full refreshes. Use incremental models only when the data volume or processing cost justifies the complexity:
-- Decision framework for materialization strategy:
--
-- | Data Volume | Update Frequency | Late-Arriving Data | Recommendation |
-- |-----------------|------------------|--------------------|-------------------|
-- | < 1M rows | Any | Any | Table (full) |
-- | 1M - 100M rows | Daily | Rare | Incremental |
-- | 1M - 100M rows | Daily | Common | Incremental + lookback |
-- | > 100M rows | Any | Any | Incremental |
-- | Any | Real-time | Any | Incremental |
--
-- The threshold varies by warehouse. BigQuery handles larger full refreshes
-- more efficiently than PostgreSQL, for example.A good rule of thumb: if the full refresh takes less than 5 minutes and runs infrequently, keep it as a full table. The simplicity and reliability are worth the warehouse cost. If it takes more than 5 minutes or runs frequently, consider incremental.
Basic Incremental Pattern
The simplest incremental model filters source data by a timestamp column, processing only records newer than the most recent record in the existing table:
-- models/marts/finance/fct_transactions.sql
{{ config(
materialized='incremental',
unique_key='transaction_id',
incremental_strategy='merge'
) }}
with source_data as (
select
transaction_id,
customer_id,
amount,
currency,
status,
created_at,
updated_at
from {{ ref('stg_app__transactions') }}
{% if is_incremental() %}
-- Only process records created or updated since the last run
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select
transaction_id,
customer_id,
amount,
currency,
status,
created_at,
updated_at,
current_timestamp as dbt_updated_at
from source_dataThe is_incremental() macro returns true when the target table already exists and the run is not a full refresh. The {{ this }} macro refers to the target table itself. Together, they create a filter that processes only new or updated records.
The unique_key configuration is critical. It tells dbt how to handle records that already exist in the target table. With incremental_strategy='merge', dbt generates a SQL MERGE statement that updates existing records and inserts new ones.
Handling Late-Arriving Data
In real-world data pipelines, data rarely arrives in perfect chronological order. Payment confirmations may arrive hours after the initial transaction. Event logs may be batched and delivered with delays. A simple where updated_at > max(updated_at) filter misses these late arrivals.
The solution is a lookback window that reprocesses a fixed period of historical data on each run:
-- models/marts/finance/fct_transactions_with_lookback.sql
{{ config(
materialized='incremental',
unique_key='transaction_id',
incremental_strategy='merge'
) }}
{% set lookback_period = 3 %} -- days
with source_data as (
select
transaction_id,
customer_id,
amount,
currency,
status,
created_at,
updated_at
from {{ ref('stg_app__transactions') }}
{% if is_incremental() %}
-- Reprocess the last N days to catch late-arriving data
where updated_at >= (
select dateadd(day, -{{ lookback_period }}, max(updated_at))
from {{ this }}
)
{% endif %}
),
-- Deduplicate in case the lookback window produces duplicates
-- The MERGE with unique_key handles this, but explicit dedup
-- is clearer about intent
deduplicated as (
select *,
row_number() over (
partition by transaction_id
order by updated_at desc
) as row_num
from source_data
)
select
transaction_id,
customer_id,
amount,
currency,
status,
created_at,
updated_at,
current_timestamp as dbt_updated_at
from deduplicated
where row_num = 1The lookback period is a trade-off between data completeness and processing cost. A 3-day lookback means every run reprocesses 3 days of data, which increases warehouse costs but ensures that late-arriving records within that window are captured. Analyze your data's actual late-arrival patterns to choose an appropriate lookback period.
Incremental Strategies by Warehouse
dbt supports multiple incremental strategies, and the optimal choice depends on your warehouse:
-- Strategy 1: merge (default for Snowflake, BigQuery)
-- Uses SQL MERGE to upsert records
-- Best when: you have a reliable unique key and need to update existing records
{{ config(
materialized='incremental',
unique_key='transaction_id',
incremental_strategy='merge'
) }}
-- Strategy 2: delete+insert (available on Snowflake, Redshift)
-- Deletes matching records then inserts new ones
-- Best when: MERGE is slow or unsupported
{{ config(
materialized='incremental',
unique_key='transaction_id',
incremental_strategy='delete+insert'
) }}
-- Strategy 3: insert_overwrite (BigQuery, Spark)
-- Replaces entire partitions
-- Best when: data is naturally partitioned by date
{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'created_at',
'data_type': 'date',
'granularity': 'day'
}
) }}
-- Strategy 4: append (all warehouses)
-- Simply inserts new records without deduplication
-- Best when: the source guarantees no duplicates (event streams)
{{ config(
materialized='incremental',
incremental_strategy='append'
) }}For BigQuery, insert_overwrite with partition-based filtering is often the most efficient strategy because BigQuery is optimized for partition-level operations. For Snowflake, merge is generally the best choice because Snowflake's MERGE implementation is highly optimized.
Testing Incremental Models
Incremental models require additional testing beyond what full-refresh models need. Specifically, you must verify that the incremental logic produces the same result as a full refresh:
-- tests/assert_incremental_matches_full.sql
-- Run periodically to verify that the incremental model has not drifted
-- from what a full refresh would produce
-- Compare row counts
with incremental_count as (
select count(*) as cnt from {{ ref('fct_transactions') }}
),
full_refresh_count as (
-- This subquery replicates the model logic without the incremental filter
select count(*) as cnt
from {{ ref('stg_app__transactions') }}
where status != 'deleted'
)
select
i.cnt as incremental_rows,
f.cnt as full_refresh_rows,
abs(i.cnt - f.cnt) as difference
from incremental_count i
cross join full_refresh_count f
where abs(i.cnt - f.cnt) > 0# Automated full-refresh reconciliation
# Run weekly as part of the data quality suite
def reconcile_incremental_model(
model_name: str,
warehouse_connection,
tolerance_pct: float = 0.001,
) -> dict:
"""
Compare the current incremental table against a temporary
full refresh to detect drift.
"""
# Run dbt with --full-refresh to a temporary schema
temp_schema = f"reconciliation_{model_name}"
run_dbt([
"run",
"--select", model_name,
"--full-refresh",
"--target", "reconciliation", # Points to temp_schema
])
# Compare row counts
incremental_count = warehouse_connection.query(
f"SELECT COUNT(*) FROM production.{model_name}"
)[0][0]
full_count = warehouse_connection.query(
f"SELECT COUNT(*) FROM {temp_schema}.{model_name}"
)[0][0]
drift_pct = abs(incremental_count - full_count) / max(full_count, 1)
# Compare key aggregates
incremental_sum = warehouse_connection.query(
f"SELECT SUM(amount) FROM production.{model_name}"
)[0][0]
full_sum = warehouse_connection.query(
f"SELECT SUM(amount) FROM {temp_schema}.{model_name}"
)[0][0]
amount_drift_pct = abs(incremental_sum - full_sum) / max(abs(full_sum), 1)
result = {
"model": model_name,
"incremental_rows": incremental_count,
"full_refresh_rows": full_count,
"row_drift_pct": round(drift_pct, 6),
"amount_drift_pct": round(amount_drift_pct, 6),
"passed": drift_pct <= tolerance_pct and amount_drift_pct <= tolerance_pct,
}
# Clean up temporary schema
warehouse_connection.execute(f"DROP SCHEMA {temp_schema} CASCADE")
return resultThe Full-Refresh Escape Hatch
Despite careful implementation, incremental models occasionally drift from the truth. Schema changes in upstream sources, bugs in the incremental filter logic, or warehouse-level issues can cause the incremental table to diverge from what a full refresh would produce.
Always maintain the ability to run a full refresh:
# Run a full refresh for a specific model
dbt run --select fct_transactions --full-refresh
# Run a full refresh for all models tagged 'incremental'
dbt run --select tag:incremental --full-refresh
# Schedule periodic full refreshes (e.g., weekly on Sunday)
# This resets any drift that has accumulatedAt Klivvr, we run full refreshes of all incremental models every Sunday during low-traffic hours. This provides a regular correction mechanism that prevents drift from accumulating over time. The rest of the week, incremental runs process only new and updated data, keeping the pipeline fast and warehouse costs low.
Conclusion
Incremental models are essential for scaling dbt pipelines to large data volumes. They reduce processing time and warehouse costs by orders of magnitude compared to full refreshes. But they introduce complexity that requires careful handling: late-arriving data needs lookback windows, deduplication must be explicit, and the incremental logic must be regularly validated against full refreshes.
The key principles are straightforward. Use incremental models only when the performance benefit justifies the complexity. Always define a unique key for proper upsert behavior. Implement a lookback window sized to your data's actual late-arrival patterns. Choose the incremental strategy that matches your warehouse's strengths. Test the incremental model against full refreshes regularly. And maintain the ability to run a full refresh at any time, because the full refresh is your ground truth and your escape hatch when incremental logic goes wrong.
Related Articles
Data-Driven Decision Making in Fintech
How a well-architected data platform enables better business decisions across product, finance, and operations — with practical examples of self-service analytics and data democratization at Klivvr.
Data Quality Testing: Strategies and Implementation
A comprehensive guide to implementing data quality testing across the data pipeline, from schema validation and freshness checks to statistical anomaly detection and business rule enforcement.
DAG Pipeline Design: Principles for Data Engineering
Core principles for designing directed acyclic graph (DAG) pipelines that are maintainable, observable, and resilient, with practical examples from production data engineering systems.