Incremental Models in dbt: Processing Data Efficiently

Full table refreshes are the simplest and most reliable way to materialize dbt models. Every run rebuilds the entire table from scratch, guaranteeing that the output always reflects the current state of the upstream data. But simplicity has a cost: as data volumes grow, full refreshes become slow and expensive. A table with 500 million rows that takes 20 minutes to rebuild is tolerable when it runs once daily but unacceptable when stakeholders need hourly updates. Incremental models solve this problem by processing only new or changed data since the last run. This article covers when incremental models are appropriate, how to implement them correctly, and how to avoid the pitfalls that lead to subtle data inconsistencies.

When to Use Incremental Models

Not every model should be incremental. The added complexity of incremental logic introduces failure modes that do not exist with full refreshes. Use incremental models only when the data volume or processing cost justifies the complexity:

-- Decision framework for materialization strategy:
--
-- | Data Volume     | Update Frequency | Late-Arriving Data | Recommendation    |
-- |-----------------|------------------|--------------------|-------------------|
-- | < 1M rows       | Any              | Any                | Table (full)      |
-- | 1M - 100M rows  | Daily            | Rare               | Incremental       |
-- | 1M - 100M rows  | Daily            | Common             | Incremental + lookback |
-- | > 100M rows     | Any              | Any                | Incremental       |
-- | Any             | Real-time        | Any                | Incremental       |
--
-- The threshold varies by warehouse. BigQuery handles larger full refreshes
-- more efficiently than PostgreSQL, for example.

A good rule of thumb: if the full refresh takes less than 5 minutes and runs infrequently, keep it as a full table. The simplicity and reliability are worth the warehouse cost. If it takes more than 5 minutes or runs frequently, consider incremental.

Basic Incremental Pattern

The simplest incremental model filters source data by a timestamp column, processing only records newer than the most recent record in the existing table:

-- models/marts/finance/fct_transactions.sql
{{ config(
    materialized='incremental',
    unique_key='transaction_id',
    incremental_strategy='merge'
) }}
 
with source_data as (
    select
        transaction_id,
        customer_id,
        amount,
        currency,
        status,
        created_at,
        updated_at
    from {{ ref('stg_app__transactions') }}
 
    {% if is_incremental() %}
        -- Only process records created or updated since the last run
        where updated_at > (select max(updated_at) from {{ this }})
    {% endif %}
)
 
select
    transaction_id,
    customer_id,
    amount,
    currency,
    status,
    created_at,
    updated_at,
    current_timestamp as dbt_updated_at
from source_data

The is_incremental() macro returns true when the target table already exists and the run is not a full refresh. The {{ this }} macro refers to the target table itself. Together, they create a filter that processes only new or updated records.

The unique_key configuration is critical. It tells dbt how to handle records that already exist in the target table. With incremental_strategy='merge', dbt generates a SQL MERGE statement that updates existing records and inserts new ones.

Handling Late-Arriving Data

In real-world data pipelines, data rarely arrives in perfect chronological order. Payment confirmations may arrive hours after the initial transaction. Event logs may be batched and delivered with delays. A simple where updated_at > max(updated_at) filter misses these late arrivals.

The solution is a lookback window that reprocesses a fixed period of historical data on each run:

-- models/marts/finance/fct_transactions_with_lookback.sql
{{ config(
    materialized='incremental',
    unique_key='transaction_id',
    incremental_strategy='merge'
) }}
 
{% set lookback_period = 3 %}  -- days
 
with source_data as (
    select
        transaction_id,
        customer_id,
        amount,
        currency,
        status,
        created_at,
        updated_at
    from {{ ref('stg_app__transactions') }}
 
    {% if is_incremental() %}
        -- Reprocess the last N days to catch late-arriving data
        where updated_at >= (
            select dateadd(day, -{{ lookback_period }}, max(updated_at))
            from {{ this }}
        )
    {% endif %}
),
 
-- Deduplicate in case the lookback window produces duplicates
-- The MERGE with unique_key handles this, but explicit dedup
-- is clearer about intent
deduplicated as (
    select *,
        row_number() over (
            partition by transaction_id
            order by updated_at desc
        ) as row_num
    from source_data
)
 
select
    transaction_id,
    customer_id,
    amount,
    currency,
    status,
    created_at,
    updated_at,
    current_timestamp as dbt_updated_at
from deduplicated
where row_num = 1

The lookback period is a trade-off between data completeness and processing cost. A 3-day lookback means every run reprocesses 3 days of data, which increases warehouse costs but ensures that late-arriving records within that window are captured. Analyze your data's actual late-arrival patterns to choose an appropriate lookback period.

Incremental Strategies by Warehouse

dbt supports multiple incremental strategies, and the optimal choice depends on your warehouse:

-- Strategy 1: merge (default for Snowflake, BigQuery)
-- Uses SQL MERGE to upsert records
-- Best when: you have a reliable unique key and need to update existing records
{{ config(
    materialized='incremental',
    unique_key='transaction_id',
    incremental_strategy='merge'
) }}
 
-- Strategy 2: delete+insert (available on Snowflake, Redshift)
-- Deletes matching records then inserts new ones
-- Best when: MERGE is slow or unsupported
{{ config(
    materialized='incremental',
    unique_key='transaction_id',
    incremental_strategy='delete+insert'
) }}
 
-- Strategy 3: insert_overwrite (BigQuery, Spark)
-- Replaces entire partitions
-- Best when: data is naturally partitioned by date
{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={
        'field': 'created_at',
        'data_type': 'date',
        'granularity': 'day'
    }
) }}
 
-- Strategy 4: append (all warehouses)
-- Simply inserts new records without deduplication
-- Best when: the source guarantees no duplicates (event streams)
{{ config(
    materialized='incremental',
    incremental_strategy='append'
) }}

For BigQuery, insert_overwrite with partition-based filtering is often the most efficient strategy because BigQuery is optimized for partition-level operations. For Snowflake, merge is generally the best choice because Snowflake's MERGE implementation is highly optimized.

Testing Incremental Models

Incremental models require additional testing beyond what full-refresh models need. Specifically, you must verify that the incremental logic produces the same result as a full refresh:

-- tests/assert_incremental_matches_full.sql
-- Run periodically to verify that the incremental model has not drifted
-- from what a full refresh would produce
 
-- Compare row counts
with incremental_count as (
    select count(*) as cnt from {{ ref('fct_transactions') }}
),
 
full_refresh_count as (
    -- This subquery replicates the model logic without the incremental filter
    select count(*) as cnt
    from {{ ref('stg_app__transactions') }}
    where status != 'deleted'
)
 
select
    i.cnt as incremental_rows,
    f.cnt as full_refresh_rows,
    abs(i.cnt - f.cnt) as difference
from incremental_count i
cross join full_refresh_count f
where abs(i.cnt - f.cnt) > 0

# Automated full-refresh reconciliation
# Run weekly as part of the data quality suite
 
def reconcile_incremental_model(
    model_name: str,
    warehouse_connection,
    tolerance_pct: float = 0.001,
) -> dict:
    """
    Compare the current incremental table against a temporary
    full refresh to detect drift.
    """
    # Run dbt with --full-refresh to a temporary schema
    temp_schema = f"reconciliation_{model_name}"
    run_dbt([
        "run",
        "--select", model_name,
        "--full-refresh",
        "--target", "reconciliation",  # Points to temp_schema
    ])
 
    # Compare row counts
    incremental_count = warehouse_connection.query(
        f"SELECT COUNT(*) FROM production.{model_name}"
    )[0][0]
 
    full_count = warehouse_connection.query(
        f"SELECT COUNT(*) FROM {temp_schema}.{model_name}"
    )[0][0]
 
    drift_pct = abs(incremental_count - full_count) / max(full_count, 1)
 
    # Compare key aggregates
    incremental_sum = warehouse_connection.query(
        f"SELECT SUM(amount) FROM production.{model_name}"
    )[0][0]
 
    full_sum = warehouse_connection.query(
        f"SELECT SUM(amount) FROM {temp_schema}.{model_name}"
    )[0][0]
 
    amount_drift_pct = abs(incremental_sum - full_sum) / max(abs(full_sum), 1)
 
    result = {
        "model": model_name,
        "incremental_rows": incremental_count,
        "full_refresh_rows": full_count,
        "row_drift_pct": round(drift_pct, 6),
        "amount_drift_pct": round(amount_drift_pct, 6),
        "passed": drift_pct <= tolerance_pct and amount_drift_pct <= tolerance_pct,
    }
 
    # Clean up temporary schema
    warehouse_connection.execute(f"DROP SCHEMA {temp_schema} CASCADE")
 
    return result

The Full-Refresh Escape Hatch

Despite careful implementation, incremental models occasionally drift from the truth. Schema changes in upstream sources, bugs in the incremental filter logic, or warehouse-level issues can cause the incremental table to diverge from what a full refresh would produce.

Always maintain the ability to run a full refresh:

# Run a full refresh for a specific model
dbt run --select fct_transactions --full-refresh
 
# Run a full refresh for all models tagged 'incremental'
dbt run --select tag:incremental --full-refresh
 
# Schedule periodic full refreshes (e.g., weekly on Sunday)
# This resets any drift that has accumulated

At Klivvr, we run full refreshes of all incremental models every Sunday during low-traffic hours. This provides a regular correction mechanism that prevents drift from accumulating over time. The rest of the week, incremental runs process only new and updated data, keeping the pipeline fast and warehouse costs low.

Conclusion

Incremental models are essential for scaling dbt pipelines to large data volumes. They reduce processing time and warehouse costs by orders of magnitude compared to full refreshes. But they introduce complexity that requires careful handling: late-arriving data needs lookback windows, deduplication must be explicit, and the incremental logic must be regularly validated against full refreshes.

The key principles are straightforward. Use incremental models only when the performance benefit justifies the complexity. Always define a unique key for proper upsert behavior. Implement a lookback window sized to your data's actual late-arrival patterns. Choose the incremental strategy that matches your warehouse's strengths. Test the incremental model against full refreshes regularly. And maintain the ability to run a full refresh at any time, because the full refresh is your ground truth and your escape hatch when incremental logic goes wrong.

Incremental Models in dbt: Processing Data Efficiently

When to Use Incremental Models

Basic Incremental Pattern

Handling Late-Arriving Data

Incremental Strategies by Warehouse

Testing Incremental Models

The Full-Refresh Escape Hatch

Conclusion

Related Articles

Data-Driven Decision Making in Fintech

Data Quality Testing: Strategies and Implementation

DAG Pipeline Design: Principles for Data Engineering

Related Articles

Data-Driven Decision Making in Fintech
November 20, 20256 min read

Data Quality Testing: Strategies and Implementation
October 28, 20258 min read

DAG Pipeline Design: Principles for Data Engineering
October 26, 20258 min read