🚀 Complete Guide to dbt Merge Operations

Mastering Insert, Update, and Delete Patterns for Modern Data Warehouses

🎯 Overview: The Full Merge Challenge

In modern data warehousing, a "full merge" operation needs to handle three types of changes from your source data:

  • INSERT: New records that don't exist in the target
  • UPDATE: Existing records that have changed
  • DELETE: Records that no longer exist in the source

While inserts and updates are straightforward in dbt, deletions present unique performance challenges because they often require full table scans.

🔧 Basic Merge with Incremental Models

Simple Incremental (Insert + Update Only)

-- models/marts/dim_customers.sql {{ config( materialized='incremental', unique_key='customer_id', merge_exclude_columns=['created_at'] ) }} with source_data as ( select customer_id, customer_name, email, phone, address, is_active, last_updated, current_timestamp as created_at from {{ ref('stg_customers') }} {% if is_incremental() %} where last_updated > (select max(last_updated) from {{ this }}) {% endif %} ) select * from source_data
✅ Pros
  • Simple and reliable
  • Great performance for append-mostly data
  • Easy to debug and monitor
  • Built-in dbt functionality
❌ Cons
  • No deletion handling
  • Deleted records remain forever
  • Can lead to data inconsistency

⚡ Deletion Strategies (Performance-Focused)

1. CDC (Change Data Capture)

Performance: ⭐⭐⭐⭐⭐

Captures actual database changes in real-time, providing INSERT/UPDATE/DELETE operations without full table scans.

-- Only process actual changes with cdc_changes as ( select * from {{ ref('stg_customers_cdc') }} where operation_type in ('INSERT', 'UPDATE', 'DELETE') and operation_timestamp > {{ var('last_cdc_timestamp') }} ) select * from cdc_changes where operation_type != 'DELETE'

2. Partitioned Soft Deletes

Performance: ⭐⭐⭐⭐

Mark records as inactive instead of deleting them. Use partitioning and clustering for efficient filtering.

{{ config( materialized='incremental', partition_by={'field': 'last_updated_date', 'data_type': 'date'}, cluster_by=['is_active', 'customer_id'] ) }} select *, case when is_active = false then current_timestamp else null end as deleted_at from {{ ref('stg_customers') }}

3. Incremental Snapshots

Performance: ⭐⭐⭐

Use dbt's snapshot functionality to track changes over time, including deletions.

{% snapshot snap_customers %} {{ config( unique_key='customer_id', strategy='timestamp', updated_at='last_updated', invalidate_hard_deletes=true ) }} select * from {{ ref('stg_customers') }} {% endsnapshot %}

4. Batch Window Processing

Performance: ⭐⭐⭐

Check for deletions periodically (e.g., weekly) rather than on every run to balance performance and freshness.

{% if should_check_deletions() %} -- Only check deletions on Sundays or when forced delete from {{ this }} where customer_id not in ( select customer_id from {{ ref('stg_customers') }} ) {% endif %}

5. Delta Tables / Lakehouse

Performance: ⭐⭐⭐⭐⭐

Use Delta Lake or similar technologies that provide native MERGE support with optimized deletion handling.

{{ config( materialized='incremental', file_format='delta', incremental_strategy='merge' ) }} -- Delta tables handle merge operations efficiently select * from {{ ref('stg_customers') }}

6. Custom Materialization

Performance: ⭐⭐⭐⭐

Build custom dbt materialization that implements database-specific MERGE statements.

{% materialization merge, default %} -- Custom MERGE implementation MERGE {{ target_relation }} as target USING ({{ sql }}) as source ON target.{{ unique_key }} = source.{{ unique_key }} WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ... WHEN NOT MATCHED BY SOURCE THEN DELETE {% endmaterialization %}

📊 Performance Comparison

Strategy Performance Complexity Cost Best Use Case
CDC ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ Real-time requirements, transactional sources
Delta Tables ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐ Databricks/Delta Lake environments
Partitioned Soft Delete ⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐⭐⭐ Most common scenario, good balance
Incremental Snapshot ⭐⭐⭐ ⭐⭐ ⭐⭐⭐ Need historical tracking
Batch Window ⭐⭐⭐ ⭐⭐⭐⭐ Budget-conscious, weekly cleanup OK
Full Scan Hard Delete Small tables only (<1M rows)

🎓 Advanced Merge Patterns

Complete Example: Production-Ready Soft Delete Pattern

-- models/marts/dim_customers_production.sql {{ config( materialized='incremental', unique_key='customer_id', partition_by={ 'field': 'last_updated_date', 'data_type': 'date' }, cluster_by=['is_active', 'customer_segment', 'customer_id'], -- Data quality tests pre_hook="{{ audit_table_changes(this) }}", post_hook=["{{ data_quality_checks() }}", "{{ update_data_lineage() }}"] ) }} with source_data as ( select customer_id, customer_name, email, phone, address, customer_segment, is_active, last_updated, date(last_updated) as last_updated_date, -- Deletion tracking case when is_active = false and lag(is_active) over (partition by customer_id order by last_updated) = true then current_timestamp else null end as deleted_at, -- Change tracking case when lag(customer_name) over (partition by customer_id order by last_updated) != customer_name then 'name_changed' when lag(email) over (partition by customer_id order by last_updated) != email then 'email_changed' else 'no_change' end as change_type, current_timestamp as dbt_updated_at from {{ ref('stg_customers') }} {% if is_incremental() %} where date(last_updated) >= date_sub( (select max(last_updated_date) from {{ this }}), interval {{ var('lookback_days', 3) }} day ) {% endif %} ), -- Data quality filtering quality_checked as ( select * from source_data where customer_id is not null and customer_name is not null and email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' -- Basic email validation ) select * from quality_checked

Supporting Macros and Views

-- macros/audit_helpers.sql {% macro audit_table_changes(target_table) %} {% if target_table %} insert into audit.table_changes ( table_name, run_timestamp, rows_before, operation_type ) select '{{ target_table }}', current_timestamp, count(*), 'pre_run' from {{ target_table }} {% endif %} {% endmacro %} -- models/marts/dim_customers_active_only.sql -- Optimized view for most common queries select customer_id, customer_name, email, phone, address, customer_segment, last_updated from {{ ref('dim_customers_production') }} where is_active = true and deleted_at is null

🎯 Best Practice Recommendations

🚀 For Microsoft → Modern Stack Migration

Phase 1: Start Simple

  • Begin with Partitioned Soft Deletes - familiar pattern from SQL Server
  • Use date partitioning (similar to SQL Server partitioned tables)
  • Implement proper clustering on filter columns

Phase 2: Add Sophistication

  • Implement CDC using tools like Fivetran, Stitch, or Azure Data Factory
  • Add automated data quality checks
  • Create monitoring and alerting

Phase 3: Optimize for Scale

  • Move to Delta Lake or similar lakehouse architecture
  • Implement advanced performance tuning
  • Add cost optimization strategies

Performance Optimization Checklist

✅ Partitioning Strategy - Partition by date columns when possible - Consider partition size (aim for 1GB+ per partition) - Use partition pruning in WHERE clauses ✅ Clustering Strategy - Cluster on frequently filtered columns - Put is_active first if using soft deletes - Consider cardinality of cluster columns ✅ Query Optimization - Create filtered views for common access patterns - Use column pruning (select only needed columns) - Leverage incremental processing ✅ Monitoring & Maintenance - Track query performance over time - Monitor storage costs and growth - Set up alerts for failed runs or quality issues - Regular cleanup of old soft-deleted records

Common Pitfalls to Avoid

  • NULL handling: Always validate unique_key columns aren't NULL
  • Timezone issues: Use consistent timezone handling across all timestamp comparisons
  • Testing deletions: Always test delete logic with real data scenarios
  • Performance monitoring: Track query costs and execution times
  • Data quality: Implement checks before and after merge operations

When to Use Each Strategy

Small Tables (<1M rows)

Use simple incremental with occasional full refresh

Medium Tables (1M-100M rows)

Partitioned soft deletes with weekly cleanup

Large Tables (>100M rows)

CDC or Delta Lake for optimal performance

Real-time Requirements

CDC with streaming ingestion

Historical Tracking Needed

dbt snapshots or SCD Type 2 patterns

Cost-Sensitive Projects

Batch window processing with soft deletes

🎓 Next Steps for Mastery

Now that you understand the fundamentals of merge operations in dbt:

  1. Practice: Implement the partitioned soft delete pattern first
  2. Experiment: Try different strategies with your data
  3. Monitor: Set up performance tracking and cost monitoring
  4. Optimize: Tune based on your specific data patterns and requirements
  5. Scale: Graduate to more advanced patterns as your needs evolve

Remember: Start simple, measure performance, and evolve based on actual needs rather than theoretical requirements.

📚 Official dbt Documentation

For the most up-to-date information and comprehensive details, refer to the official dbt documentation:

These resources provide the authoritative source for dbt merge operations and are regularly updated by the dbt Labs team.