📈 Complete Guide to dbt Incremental Strategies

Master Every Approach to Incremental Data Loading

📋 Navigation

🎯 Strategy Overview

Quick comparison of all strategies

🔄 Delete + Insert

The default strategy

🔀 Merge

Database-native MERGE

➕ Append

Insert-only strategy

🔁 Insert Overwrite

Partition-level replacement

⚡ Microbatch

Time-based processing

🧭 Decision Guide

Which strategy to choose

🎯 Strategy Overview

delete+insert (Default)

✅ Insert ✅ Update ❌ Delete Medium Performance

Deletes existing records matching the unique_key, then inserts all new data. Most reliable and widely supported.

merge

✅ Insert ✅ Update ⚠️ Configurable High Performance

Uses database-native MERGE statements. Best performance for large datasets with many updates.

append

✅ Insert ❌ No Update ❌ Delete Highest Performance

Simply appends new rows. Perfect for event logs and time-series data. Creates duplicates if used incorrectly.

insert_overwrite

✅ Insert ✅ Update (Partition) ✅ Delete (Partition) High Performance

Overwrites entire partitions. Excellent for time-partitioned data with complete daily/hourly loads.

microbatch

✅ Insert ✅ Update ❌ Delete Optimized Performance

Processes data in time-based batches. Ideal for large datasets with consistent time-based partitioning.

Complete Comparison Table

Strategy Inserts Updates Deletes Performance Best For Warehouse Support
delete+insert ✅ Yes ✅ Yes ❌ No Medium
General purpose, reliable All warehouses
merge ✅ Yes ✅ Yes ⚠️ Optional High
Large datasets, many updates BigQuery, Snowflake, Databricks
append ✅ Yes ❌ No ❌ No Highest
Event logs, time-series All warehouses
insert_overwrite ✅ Yes ✅ Partition-level ✅ Partition-level High
Time-partitioned data BigQuery, Spark/Databricks
microbatch ✅ Yes ✅ Yes ❌ No Optimized
Large time-series datasets All warehouses (dbt 1.9+)

🔄 Delete + Insert Strategy (Default)

The delete+insert strategy is dbt's default incremental approach. It's the most reliable and universally supported method.

How It Works

📋 Code Examples: Basic Implementation | Advanced Configuration

{{ config( materialized='incremental', unique_key='customer_id' -- Required for updates -- incremental_strategy='delete+insert' -- This is the default ) }} with incremental_data as ( select customer_id, customer_name, email, phone, last_updated from {{ ref('stg_customers') }} {% if is_incremental() %} -- Only process changed records where last_updated > (select max(last_updated) from {{ this }}) {% endif %} ) select * from incremental_data -- BEHIND THE SCENES: -- Step 1: DELETE FROM target WHERE customer_id IN (1, 2, 3, ...) -- Step 2: INSERT INTO target SELECT * FROM incremental_data
✅ Advantages
  • Universal support: Works on all data warehouses
  • Reliable: Simple logic, easy to debug
  • Handles updates: Automatically updates existing records
  • Transactional safety: Atomic operation in most warehouses
  • No special features required: Uses basic SQL
❌ Limitations
  • Two operations: DELETE + INSERT (can be slower)
  • No deletion handling: Doesn't remove records deleted in source
  • Lock contention: May cause brief table locks
  • Inefficient for sparse updates: Rewrites entire rows

Advanced Configuration

{{ config( materialized='incremental', unique_key='customer_id', incremental_strategy='delete+insert', -- Control which columns to exclude from updates merge_exclude_columns=['created_at', 'inserted_by'], -- Add conditions to the DELETE statement incremental_predicates=[ "DBT_INTERNAL_DEST.last_updated < current_timestamp - interval '30 days'" ] ) }} -- Example: Only update records that have actually changed select customer_id, customer_name, email, phone, last_updated, -- Create a hash to detect actual changes {{ dbt_utils.generate_surrogate_key([ 'customer_name', 'email', 'phone' ]) }} as row_hash from {{ ref('stg_customers') }} {% if is_incremental() %} where last_updated > (select max(last_updated) from {{ this }}) -- Only include records where content actually changed and {{ dbt_utils.generate_surrogate_key([ 'customer_name', 'email', 'phone' ]) }} not in ( select row_hash from {{ this }} where customer_id in ( select customer_id from {{ ref('stg_customers') }} where last_updated > (select max(last_updated) from {{ this }}) ) ) {% endif %}

⚠️ Important Considerations

  • unique_key is mandatory for updates to work properly
  • NULL values in unique_key will cause issues - always validate
  • Composite keys should be provided as a list: unique_key=['col1', 'col2']
  • Performance impact: DELETE operations scan the entire table

🔀 Merge Strategy

The merge strategy uses database-native MERGE statements for optimal performance, especially with large datasets containing many updates.

Basic Implementation

📋 Code Examples: Basic Merge | Advanced with Deletes

{{ config( materialized='incremental', unique_key='customer_id', incremental_strategy='merge' ) }} select customer_id, customer_name, email, phone, subscription_status, last_updated from {{ ref('stg_customers') }} {% if is_incremental() %} where last_updated > (select max(last_updated) from {{ this }}) {% endif %} -- GENERATES NATIVE MERGE STATEMENT: -- MERGE target USING source ON target.customer_id = source.customer_id -- WHEN MATCHED THEN UPDATE SET ... -- WHEN NOT MATCHED THEN INSERT ...

Advanced Merge with Deletion Support

{{ config( materialized='incremental', unique_key='customer_id', incremental_strategy='merge', -- Exclude certain columns from being updated merge_exclude_columns=['created_at', 'first_seen'], -- Add deletion logic merge_update_columns={ 'customer_name': 'source.customer_name', 'email': 'source.email', 'is_active': 'source.is_active' } ) }} with source_with_deletes as ( select customer_id, customer_name, email, phone, is_active, last_updated, -- Mark records for deletion case when is_active = false then true else false end as _should_delete from {{ ref('stg_customers') }} {% if is_incremental() %} where last_updated > (select max(last_updated) from {{ this }}) or is_active = false -- Always include records marked for deletion {% endif %} ) select * from source_with_deletes -- ADVANCED: Custom merge with deletion (warehouse-specific) -- BigQuery example: -- MERGE target T USING source S ON T.customer_id = S.customer_id -- WHEN MATCHED AND S._should_delete = true THEN DELETE -- WHEN MATCHED AND S._should_delete = false THEN UPDATE SET ... -- WHEN NOT MATCHED AND S._should_delete = false THEN INSERT ...

Warehouse-Specific Features

BigQuery

✅ Full MERGE support
✅ DELETE in MERGE
✅ Complex conditions

Snowflake

✅ Full MERGE support
✅ DELETE in MERGE
✅ Multi-table operations

Databricks

✅ Delta Lake MERGE
✅ Time travel
✅ Optimized performance

Redshift

⚠️ Limited MERGE
❌ No DELETE in MERGE
✅ Basic operations

PostgreSQL

❌ No native MERGE
⚠️ Falls back to delete+insert
✅ Works but slower

✅ Advantages
  • Best performance: Single operation vs DELETE+INSERT
  • Atomic operation: True ACID compliance
  • Efficient updates: Only changes modified columns
  • Deletion support: Can handle DELETEs in same operation
  • Less locking: Reduced contention on target table
❌ Limitations
  • Warehouse dependent: Not supported everywhere
  • Complex syntax: Harder to debug when issues arise
  • Feature variations: Different capabilities per warehouse
  • Memory usage: Can require more memory for large merges

➕ Append Strategy

The append strategy simply adds new rows to the target table without any deduplication or updating. Perfect for event logs and time-series data.

Basic Implementation

📋 Code Examples: Basic Append | Event Logging | Safe Patterns

{{ config( materialized='incremental', incremental_strategy='append' -- unique_key is IGNORED with append strategy ) }} with new_events as ( select event_id, user_id, event_type, event_timestamp, properties from {{ ref('stg_events') }} {% if is_incremental() %} -- Only get events since last run where event_timestamp > (select max(event_timestamp) from {{ this }}) {% endif %} ) select * from new_events -- BEHIND THE SCENES: -- Simply: INSERT INTO target SELECT * FROM new_events -- No DELETE operations, no deduplication

Perfect Use Cases

-- Event Logging Table {{ config( materialized='incremental', incremental_strategy='append', partition_by={'field': 'event_date', 'data_type': 'date'} ) }} select event_id, -- Naturally unique (UUID, timestamp-based) user_id, event_type, event_timestamp, date(event_timestamp) as event_date, session_id, page_url, referrer, user_agent, properties -- JSON/variant column from {{ ref('stg_web_events') }} {% if is_incremental() %} where event_timestamp > (select max(event_timestamp) from {{ this }}) {% endif %} -- Time Series Data (IoT Sensors) {{ config( materialized='incremental', incremental_strategy='append', cluster_by=['sensor_id', 'measurement_timestamp'] ) }} select sensor_id, measurement_timestamp, temperature, humidity, pressure, battery_level from {{ ref('stg_sensor_readings') }} {% if is_incremental() %} where measurement_timestamp > (select max(measurement_timestamp) from {{ this }}) {% endif %}

⚠️ Critical Warning: Duplicate Risk

The append strategy will create duplicates if you have:

  • Records with the same business key but different timestamps
  • Late-arriving data that overlaps with previous runs
  • Source system that updates existing records
  • Backfill scenarios where you reprocess historical data
-- ❌ BAD: This will create duplicates {{ config( materialized='incremental', incremental_strategy='append' -- Wrong choice! ) }} select customer_id, -- Same customer can appear multiple times customer_name, -- Customer name might get updated email, last_updated from customers -- Result: Multiple rows per customer = data corruption!

Safe Append Patterns

-- Pattern 1: Natural Uniqueness with Immutable Events {{ config( materialized='incremental', incremental_strategy='append' ) }} select {{ dbt_utils.generate_surrogate_key([ 'user_id', 'event_timestamp', 'event_type' ]) }} as event_id, -- Generate unique ID user_id, event_type, event_timestamp, properties from {{ ref('stg_events') }} {% if is_incremental() %} where event_timestamp > (select max(event_timestamp) from {{ this }}) -- Add safety check to prevent duplicates and {{ dbt_utils.generate_surrogate_key([ 'user_id', 'event_timestamp', 'event_type' ]) }} not in (select event_id from {{ this }}) {% endif %} -- Pattern 2: Deduplication Before Append {{ config( materialized='incremental', incremental_strategy='append', pre_hook="delete from {{ this }} where created_date = current_date" -- Clean today's data first ) }} with deduplicated_data as ( select transaction_id, customer_id, amount, transaction_timestamp, date(transaction_timestamp) as created_date, -- Remove duplicates within the batch row_number() over ( partition by transaction_id order by transaction_timestamp desc ) as rn from {{ ref('stg_transactions') }} {% if is_incremental() %} where date(transaction_timestamp) >= current_date - interval '1' day {% endif %} ) select * from deduplicated_data where rn = 1
✅ Advantages
  • Fastest performance: Single INSERT operation
  • Simple logic: Easy to understand and debug
  • No locks: Minimal contention on target table
  • Perfect for events: Natural fit for immutable data
  • Universal support: Works on all databases
❌ Limitations
  • No updates: Cannot modify existing records
  • Duplicate risk: Easy to accidentally create duplicates
  • No deletion: Cannot remove records
  • Limited use cases: Only suitable for append-only data

🔁 Insert Overwrite Strategy

The insert_overwrite strategy completely replaces entire partitions with new data. Perfect for time-partitioned data where you get complete datasets.

Basic Implementation

📋 Code Examples: Basic Implementation | Dynamic Partitions | Hourly Partitioning

{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={'field': 'order_date', 'data_type': 'date'} ) }} with daily_orders as ( select order_id, customer_id, order_amount, order_status, order_timestamp, date(order_timestamp) as order_date from {{ ref('stg_orders') }} {% if is_incremental() %} -- Only process recent dates (will overwrite entire partitions) where date(order_timestamp) >= current_date - interval '7' day {% endif %} ) select * from daily_orders -- BEHIND THE SCENES: -- 1. Identifies which partitions contain new data -- 2. Drops those entire partitions -- 3. Inserts all new data for those partitions

Advanced Partition Management

-- Dynamic Partition Overwrite {{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'event_date', 'data_type': 'date', 'granularity': 'day' } ) }} with processed_events as ( select event_id, user_id, event_type, event_timestamp, date(event_timestamp) as event_date, -- Calculate derived metrics that might change with reprocessing count(*) over ( partition by user_id, date(event_timestamp) ) as daily_event_count, lag(event_timestamp) over ( partition by user_id order by event_timestamp ) as previous_event_timestamp from {{ ref('stg_events') }} {% if is_incremental() %} where date(event_timestamp) in ( -- Dynamically determine which partitions to overwrite {% set overwrite_dates %} select distinct date(event_timestamp) from {{ ref('stg_events') }} where _loaded_at > ( select max(_loaded_at) from {{ this }} ) {% endset %} {{ dbt_utils.get_query_results_as_dict(overwrite_dates)['date'] | join(', ') }} ) {% endif %} ) select * from processed_events -- Hourly Partitioning Example {{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'event_hour', 'data_type': 'timestamp', 'granularity': 'hour' } ) }} select event_id, user_id, event_timestamp, timestamp_trunc(event_timestamp, HOUR) as event_hour, properties from {{ ref('stg_realtime_events') }} {% if is_incremental() %} where timestamp_trunc(event_timestamp, HOUR) >= timestamp_sub(current_timestamp(), interval 6 hour) {% endif %}

Perfect Use Cases

Daily Batch Loads

Complete daily files from external systems where you always get the full day's data

-- Daily sales files where date(sale_timestamp) = current_date - 1

Data Reprocessing

When you need to recalculate derived metrics or fix data quality issues

-- Recalculate aggregations where date >= '2024-01-01'

Late-Arriving Data

When source systems frequently send updates for past dates

-- Handle late data where date >= current_date - 30

Warehouse Support

BigQuery

✅ Native support
✅ Date/timestamp partitioning
✅ Automatic partition pruning

Databricks

✅ Delta Lake partitions
✅ Dynamic partition overwrite
✅ OPTIMIZE commands

Spark

✅ Partition overwrite
⚠️ Configuration dependent
✅ Hive-style partitions

Snowflake

❌ No native partition overwrite
⚠️ Falls back to delete+insert
✅ Clustering keys help

Redshift

❌ Limited partition support
⚠️ Falls back to delete+insert
✅ Sort keys help

✅ Advantages
  • Complete refresh: Handles all types of changes (I/U/D)
  • Data consistency: Entire partition is always consistent
  • Handles late data: Perfect for late-arriving updates
  • Reprocessing friendly: Easy to recompute historical data
  • High performance: Leverages partition pruning
❌ Limitations
  • Requires partitioning: Only works with partitioned tables
  • Limited warehouse support: Not available everywhere
  • Coarse granularity: Updates entire partitions
  • Resource intensive: Rewrites large amounts of data

⚡ Microbatch Strategy

The microbatch strategy (dbt 1.9+) processes data in time-based batches automatically, optimizing performance for large time-series datasets.

Basic Implementation

📋 Code Examples: Basic Microbatch | Advanced Configuration | Custom Logic

{{ config( materialized='incremental', incremental_strategy='microbatch', event_time='order_timestamp', batch_size='day' ) }} select order_id, customer_id, order_amount, order_status, order_timestamp, -- Calculations are done per batch sum(order_amount) over ( partition by customer_id, date(order_timestamp) ) as daily_customer_total from {{ ref('stg_orders') }} -- dbt automatically handles: -- 1. Splitting data into daily batches -- 2. Processing each batch incrementally -- 3. Handling overlapping data between batches

Advanced Microbatch Configuration

-- Hourly Microbatches {{ config( materialized='incremental', incremental_strategy='microbatch', event_time='event_timestamp', batch_size='hour', lookback='3 hours' -- Reprocess last 3 hours for late data ) }} with event_metrics as ( select event_id, user_id, event_type, event_timestamp, -- Calculate session boundaries (cross-batch calculations) lag(event_timestamp, 1, '1900-01-01'::timestamp) over ( partition by user_id order by event_timestamp ) as prev_event_timestamp, case when event_timestamp - lag(event_timestamp) over ( partition by user_id order by event_timestamp ) > interval '30 minutes' then 1 else 0 end as new_session_flag from {{ ref('stg_events') }} ) select *, sum(new_session_flag) over ( partition by user_id order by event_timestamp rows unbounded preceding ) as session_id from event_metrics -- Custom Batch Processing Logic {{ config( materialized='incremental', incremental_strategy='microbatch', event_time='created_at', batch_size='day', -- Custom batch logic begin_batch_macro='get_batch_start', end_batch_macro='get_batch_end' ) }} -- Custom macros in macros/batch_logic.sql {% macro get_batch_start() %} select min(created_at) from {{ ref('stg_source') }} where date(created_at) = '{{ var("batch_date") }}' {% endmacro %} {% macro get_batch_end() %} select max(created_at) from {{ ref('stg_source') }} where date(created_at) = '{{ var("batch_date") }}' {% endmacro %}

Microbatch Benefits

Automatic Parallelization

dbt automatically runs batches in parallel when possible, dramatically improving performance

Memory Optimization

Processes smaller chunks of data, reducing memory pressure and avoiding OOM errors

Late Data Handling

Built-in lookback windows automatically reprocess recent batches when late data arrives

Failure Recovery

If a batch fails, only that batch needs to be rerun, not the entire model

✅ Advantages
  • Optimized performance: Automatic parallelization and memory management
  • Built-in late data handling: Configurable lookback windows
  • Failure recovery: Granular retry capabilities
  • Memory efficient: Processes data in smaller chunks
  • Universal support: Works on all warehouses
❌ Limitations
  • dbt 1.9+ only: Requires latest dbt version
  • Time-based data only: Requires event_time column
  • Complex debugging: Multiple batches make troubleshooting harder
  • No cross-batch joins: Limited for complex transformations

🧭 Decision Guide: Which Strategy to Choose

🎯 Strategy Selection Decision Tree

Step 1: Data Type Analysis
  • Immutable events/logs? → Use append
  • Time-partitioned data with complete loads? → Use insert_overwrite
  • Large time-series with updates? → Use microbatch
  • Traditional dimensional/fact tables? → Continue to Step 2
Step 2: Update Pattern Analysis
  • No updates needed (insert-only)? → Use append
  • Frequent updates to existing records? → Continue to Step 3
  • Occasional updates? → Use delete+insert (safe default)
Step 3: Performance & Scale Analysis
  • Small-medium tables (<10M rows)? → Use delete+insert
  • Large tables (>10M rows) with many updates? → Use merge
  • Warehouse doesn't support MERGE? → Use delete+insert
Step 4: Deletion Requirements
  • Need to handle source deletions? → Use merge with delete logic
  • Soft deletes acceptable? → Use any strategy + soft delete pattern
  • Partition-level deletion? → Use insert_overwrite

Quick Reference by Use Case

Use Case Recommended Strategy Alternative Key Configuration
Event logging / Clickstream append microbatch Partition by date, no unique_key
Customer dimension merge delete+insert unique_key='customer_id'
Daily sales aggregates insert_overwrite merge Partition by sale_date
IoT sensor data microbatch append event_time='timestamp', batch_size='hour'
Financial transactions append microbatch Immutable with unique transaction_id
User profile updates merge delete+insert unique_key='user_id', exclude created_at
Inventory snapshots insert_overwrite delete+insert Partition by snapshot_date
Application logs append microbatch Partition by log_date

Performance Optimization Tips

📋 Code Example: Performance Optimization

-- Universal Performance Tips for All Strategies -- 1. Always use appropriate filtering {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) -- Add buffer for late-arriving data and updated_at > current_timestamp - interval '1 hour' {% endif %} -- 2. Use partitioning when possible {{ config( partition_by={'field': 'created_date', 'data_type': 'date'}, cluster_by=['status', 'customer_id'] -- Frequently filtered columns ) }} -- 3. Add data quality checks select * from source_data where id is not null -- Prevent NULL unique_key issues and created_at is not null -- Prevent partition issues and {{ not_null_proportion('email', 0.95) }} -- Data quality macro -- 4. Monitor query performance {{ config(post_hook=[ "insert into metadata.model_performance select '{{ this }}', current_timestamp, '{{ invocation_id }}', {{ query_tag() }}" ]) }} -- 5. Use appropriate batch sizing {% if var('full_refresh', false) %} -- Full refresh: no filtering select * from source_data {% else %} -- Incremental: process last N days to handle late data where created_date >= current_date - {{ var('lookback_days', 3) }} {% endif %}

🚨 Common Mistakes to Avoid

  • Using append for dimensional data - leads to duplicates
  • Forgetting unique_key - prevents updates from working
  • NULL values in unique_key - causes unpredictable behavior
  • Wrong partition granularity - too fine causes many small files
  • No late data handling - causes data loss
  • Ignoring warehouse capabilities - using unsupported strategies
  • Not testing with real data patterns - surprises in production

🎯 Final Recommendations

For beginners: Start with delete+insert - it's reliable and works everywhere

For performance: Graduate to merge when you have large tables with frequent updates

For event data: Use append but be very careful about uniqueness

For batch loads: insert_overwrite is perfect for complete daily/hourly datasets

For modern stacks: microbatch provides the best balance of performance and reliability

🎓 Summary & Next Steps

Understanding incremental strategies is crucial for building efficient, scalable dbt models. Each strategy has its place:

  • delete+insert: Your reliable default choice
  • merge: The performance champion for updates
  • append: The speed demon for immutable data
  • insert_overwrite: The complete refresh for partitioned data
  • microbatch: The modern solution for large time-series

Practice Path:

  1. Start with delete+insert to learn the concepts
  2. Experiment with append for event data
  3. Try merge when you need better performance
  4. Explore insert_overwrite for time-partitioned data
  5. Test microbatch for large-scale processing

Remember: Choose based on your data patterns, not just performance. The wrong strategy can cause data corruption, while the right strategy makes your pipelines both fast and reliable.

📚 Official dbt Documentation

For the most up-to-date information and comprehensive details, refer to the official dbt documentation:

These resources provide the authoritative source for dbt incremental strategies and are regularly updated by the dbt Labs team.