Complete Guide to dbt Merge Operations

🎯 Overview: The Full Merge Challenge

In modern data warehousing, a "full merge" operation needs to handle three types of changes from your source data:

INSERT: New records that don't exist in the target
UPDATE: Existing records that have changed
DELETE: Records that no longer exist in the source

While inserts and updates are straightforward in dbt, deletions present unique performance challenges because they often require full table scans.

🔧 Basic Merge with Incremental Models

Simple Incremental (Insert + Update Only)

-- models/marts/dim_customers.sql
{{
  config(
    materialized='incremental',
    unique_key='customer_id',
    merge_exclude_columns=['created_at']
  )
}}

with source_data as (
  select
    customer_id,
    customer_name,
    email,
    phone,
    address,
    is_active,
    last_updated,
    current_timestamp as created_at
  from {{ ref('stg_customers') }}
  
  {% if is_incremental() %}
    where last_updated > (select max(last_updated) from {{ this }})
  {% endif %}
)

select * from source_data

✅ Pros

Simple and reliable
Great performance for append-mostly data
Easy to debug and monitor
Built-in dbt functionality

❌ Cons

No deletion handling
Deleted records remain forever
Can lead to data inconsistency

⚡ Deletion Strategies (Performance-Focused)

1. CDC (Change Data Capture)

Performance: ⭐⭐⭐⭐⭐

Captures actual database changes in real-time, providing INSERT/UPDATE/DELETE operations without full table scans.

-- Only process actual changes
with cdc_changes as (
  select * from {{ ref('stg_customers_cdc') }}
  where operation_type in ('INSERT', 'UPDATE', 'DELETE')
  and operation_timestamp > {{ var('last_cdc_timestamp') }}
)

select * from cdc_changes 
where operation_type != 'DELETE'

2. Partitioned Soft Deletes

Performance: ⭐⭐⭐⭐

Mark records as inactive instead of deleting them. Use partitioning and clustering for efficient filtering.

{{
  config(
    materialized='incremental',
    partition_by={'field': 'last_updated_date', 'data_type': 'date'},
    cluster_by=['is_active', 'customer_id']
  )
}}

select 
  *,
  case when is_active = false then current_timestamp else null end as deleted_at
from {{ ref('stg_customers') }}

3. Incremental Snapshots

Performance: ⭐⭐⭐

Use dbt's snapshot functionality to track changes over time, including deletions.

{% snapshot snap_customers %}
  {{
    config(
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='last_updated',
      invalidate_hard_deletes=true
    )
  }}
  select * from {{ ref('stg_customers') }}
{% endsnapshot %}

4. Batch Window Processing

Performance: ⭐⭐⭐

Check for deletions periodically (e.g., weekly) rather than on every run to balance performance and freshness.

{% if should_check_deletions() %}
  -- Only check deletions on Sundays or when forced
  delete from {{ this }}
  where customer_id not in (
    select customer_id from {{ ref('stg_customers') }}
  )
{% endif %}

5. Delta Tables / Lakehouse

Performance: ⭐⭐⭐⭐⭐

Use Delta Lake or similar technologies that provide native MERGE support with optimized deletion handling.

{{
  config(
    materialized='incremental',
    file_format='delta',
    incremental_strategy='merge'
  )
}}

-- Delta tables handle merge operations efficiently
select * from {{ ref('stg_customers') }}

6. Custom Materialization

Performance: ⭐⭐⭐⭐

Build custom dbt materialization that implements database-specific MERGE statements.

{% materialization merge, default %}
  -- Custom MERGE implementation
  MERGE {{ target_relation }} as target
  USING ({{ sql }}) as source
  ON target.{{ unique_key }} = source.{{ unique_key }}
  WHEN MATCHED THEN UPDATE SET ...
  WHEN NOT MATCHED THEN INSERT ...
  WHEN NOT MATCHED BY SOURCE THEN DELETE
{% endmaterialization %}

📊 Performance Comparison

Strategy	Performance	Complexity	Cost	Best Use Case
CDC	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	Real-time requirements, transactional sources
Delta Tables	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	Databricks/Delta Lake environments
Partitioned Soft Delete	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	Most common scenario, good balance
Incremental Snapshot	⭐⭐⭐	⭐⭐	⭐⭐⭐	Need historical tracking
Batch Window	⭐⭐⭐	⭐	⭐⭐⭐⭐	Budget-conscious, weekly cleanup OK
Full Scan Hard Delete	⭐	⭐	⭐	Small tables only (<1M rows)

🎓 Advanced Merge Patterns

Complete Example: Production-Ready Soft Delete Pattern

-- models/marts/dim_customers_production.sql
{{
  config(
    materialized='incremental',
    unique_key='customer_id',
    partition_by={
      'field': 'last_updated_date',
      'data_type': 'date'
    },
    cluster_by=['is_active', 'customer_segment', 'customer_id'],
    
    -- Data quality tests
    pre_hook="{{ audit_table_changes(this) }}",
    post_hook=["{{ data_quality_checks() }}", "{{ update_data_lineage() }}"]
  )
}}

with source_data as (
  select
    customer_id,
    customer_name,
    email,
    phone,
    address,
    customer_segment,
    is_active,
    last_updated,
    date(last_updated) as last_updated_date,
    
    -- Deletion tracking
    case 
      when is_active = false and lag(is_active) over (partition by customer_id order by last_updated) = true
      then current_timestamp
      else null 
    end as deleted_at,
    
    -- Change tracking
    case 
      when lag(customer_name) over (partition by customer_id order by last_updated) != customer_name
      then 'name_changed'
      when lag(email) over (partition by customer_id order by last_updated) != email
      then 'email_changed'
      else 'no_change'
    end as change_type,
    
    current_timestamp as dbt_updated_at
    
  from {{ ref('stg_customers') }}
  
  {% if is_incremental() %}
    where date(last_updated) >= date_sub(
      (select max(last_updated_date) from {{ this }}), 
      interval {{ var('lookback_days', 3) }} day
    )
  {% endif %}
),

-- Data quality filtering
quality_checked as (
  select * from source_data
  where customer_id is not null
    and customer_name is not null
    and email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'  -- Basic email validation
)

select * from quality_checked

Supporting Macros and Views

-- macros/audit_helpers.sql
{% macro audit_table_changes(target_table) %}
  {% if target_table %}
    insert into audit.table_changes (
      table_name, 
      run_timestamp, 
      rows_before, 
      operation_type
    )
    select 
      '{{ target_table }}',
      current_timestamp,
      count(*),
      'pre_run'
    from {{ target_table }}
  {% endif %}
{% endmacro %}

-- models/marts/dim_customers_active_only.sql
-- Optimized view for most common queries
select 
  customer_id,
  customer_name,
  email,
  phone,
  address,
  customer_segment,
  last_updated
from {{ ref('dim_customers_production') }}
where is_active = true
  and deleted_at is null

🎯 Best Practice Recommendations

🚀 For Microsoft → Modern Stack Migration

Phase 1: Start Simple

Begin with Partitioned Soft Deletes - familiar pattern from SQL Server
Use date partitioning (similar to SQL Server partitioned tables)
Implement proper clustering on filter columns

Phase 2: Add Sophistication

Implement CDC using tools like Fivetran, Stitch, or Azure Data Factory
Add automated data quality checks
Create monitoring and alerting

Phase 3: Optimize for Scale

Move to Delta Lake or similar lakehouse architecture
Implement advanced performance tuning
Add cost optimization strategies

Performance Optimization Checklist

✅ Partitioning Strategy
   - Partition by date columns when possible
   - Consider partition size (aim for 1GB+ per partition)
   - Use partition pruning in WHERE clauses

✅ Clustering Strategy  
   - Cluster on frequently filtered columns
   - Put is_active first if using soft deletes
   - Consider cardinality of cluster columns

✅ Query Optimization
   - Create filtered views for common access patterns
   - Use column pruning (select only needed columns)
   - Leverage incremental processing

✅ Monitoring & Maintenance
   - Track query performance over time
   - Monitor storage costs and growth
   - Set up alerts for failed runs or quality issues
   - Regular cleanup of old soft-deleted records

Common Pitfalls to Avoid

NULL handling: Always validate unique_key columns aren't NULL
Timezone issues: Use consistent timezone handling across all timestamp comparisons
Testing deletions: Always test delete logic with real data scenarios
Performance monitoring: Track query costs and execution times
Data quality: Implement checks before and after merge operations

When to Use Each Strategy

Small Tables (<1M rows)

Use simple incremental with occasional full refresh

Medium Tables (1M-100M rows)

Partitioned soft deletes with weekly cleanup

Large Tables (>100M rows)

CDC or Delta Lake for optimal performance

Real-time Requirements

CDC with streaming ingestion

Historical Tracking Needed

dbt snapshots or SCD Type 2 patterns

Cost-Sensitive Projects

Batch window processing with soft deletes

🎓 Next Steps for Mastery

Now that you understand the fundamentals of merge operations in dbt:

Practice: Implement the partitioned soft delete pattern first
Experiment: Try different strategies with your data
Monitor: Set up performance tracking and cost monitoring
Optimize: Tune based on your specific data patterns and requirements
Scale: Graduate to more advanced patterns as your needs evolve

Remember: Start simple, measure performance, and evolve based on actual needs rather than theoretical requirements.

📚 Official dbt Documentation

For the most up-to-date information and comprehensive details, refer to the official dbt documentation:

Incremental models - Official guide to dbt incremental materialization and merge strategies
Materializations best practices - Guide to choosing and implementing dbt materializations

These resources provide the authoritative source for dbt merge operations and are regularly updated by the dbt Labs team.

🚀 Complete Guide to dbt Merge Operations

📋 Table of Contents