📋 Table of Contents
🎯 Overview: The Full Merge Challenge
In modern data warehousing, a "full merge" operation needs to handle three types of changes from your source data:
- INSERT: New records that don't exist in the target
- UPDATE: Existing records that have changed
- DELETE: Records that no longer exist in the source
While inserts and updates are straightforward in dbt, deletions present unique performance challenges because they often require full table scans.
🔧 Basic Merge with Incremental Models
Simple Incremental (Insert + Update Only)
✅ Pros
- Simple and reliable
- Great performance for append-mostly data
- Easy to debug and monitor
- Built-in dbt functionality
❌ Cons
- No deletion handling
- Deleted records remain forever
- Can lead to data inconsistency
⚡ Deletion Strategies (Performance-Focused)
1. CDC (Change Data Capture)
Performance: ⭐⭐⭐⭐⭐
Captures actual database changes in real-time, providing INSERT/UPDATE/DELETE operations without full table scans.
2. Partitioned Soft Deletes
Performance: ⭐⭐⭐⭐
Mark records as inactive instead of deleting them. Use partitioning and clustering for efficient filtering.
3. Incremental Snapshots
Performance: ⭐⭐⭐
Use dbt's snapshot functionality to track changes over time, including deletions.
4. Batch Window Processing
Performance: ⭐⭐⭐
Check for deletions periodically (e.g., weekly) rather than on every run to balance performance and freshness.
5. Delta Tables / Lakehouse
Performance: ⭐⭐⭐⭐⭐
Use Delta Lake or similar technologies that provide native MERGE support with optimized deletion handling.
6. Custom Materialization
Performance: ⭐⭐⭐⭐
Build custom dbt materialization that implements database-specific MERGE statements.
📊 Performance Comparison
Strategy | Performance | Complexity | Cost | Best Use Case |
---|---|---|---|---|
CDC | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | Real-time requirements, transactional sources |
Delta Tables | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | Databricks/Delta Lake environments |
Partitioned Soft Delete | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | Most common scenario, good balance |
Incremental Snapshot | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | Need historical tracking |
Batch Window | ⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ | Budget-conscious, weekly cleanup OK |
Full Scan Hard Delete | ⭐ | ⭐ | ⭐ | Small tables only (<1M rows) |
🎓 Advanced Merge Patterns
Complete Example: Production-Ready Soft Delete Pattern
Supporting Macros and Views
🎯 Best Practice Recommendations
🚀 For Microsoft → Modern Stack Migration
Phase 1: Start Simple
- Begin with Partitioned Soft Deletes - familiar pattern from SQL Server
- Use date partitioning (similar to SQL Server partitioned tables)
- Implement proper clustering on filter columns
Phase 2: Add Sophistication
- Implement CDC using tools like Fivetran, Stitch, or Azure Data Factory
- Add automated data quality checks
- Create monitoring and alerting
Phase 3: Optimize for Scale
- Move to Delta Lake or similar lakehouse architecture
- Implement advanced performance tuning
- Add cost optimization strategies
Performance Optimization Checklist
Common Pitfalls to Avoid
- NULL handling: Always validate unique_key columns aren't NULL
- Timezone issues: Use consistent timezone handling across all timestamp comparisons
- Testing deletions: Always test delete logic with real data scenarios
- Performance monitoring: Track query costs and execution times
- Data quality: Implement checks before and after merge operations
When to Use Each Strategy
Small Tables (<1M rows)
Use simple incremental with occasional full refresh
Medium Tables (1M-100M rows)
Partitioned soft deletes with weekly cleanup
Large Tables (>100M rows)
CDC or Delta Lake for optimal performance
Real-time Requirements
CDC with streaming ingestion
Historical Tracking Needed
dbt snapshots or SCD Type 2 patterns
Cost-Sensitive Projects
Batch window processing with soft deletes
🎓 Next Steps for Mastery
Now that you understand the fundamentals of merge operations in dbt:
- Practice: Implement the partitioned soft delete pattern first
- Experiment: Try different strategies with your data
- Monitor: Set up performance tracking and cost monitoring
- Optimize: Tune based on your specific data patterns and requirements
- Scale: Graduate to more advanced patterns as your needs evolve
Remember: Start simple, measure performance, and evolve based on actual needs rather than theoretical requirements.
📚 Official dbt Documentation
For the most up-to-date information and comprehensive details, refer to the official dbt documentation:
- Incremental models - Official guide to dbt incremental materialization and merge strategies
- Materializations best practices - Guide to choosing and implementing dbt materializations
These resources provide the authoritative source for dbt merge operations and are regularly updated by the dbt Labs team.