dbt Modeling Guidelines and Best Practices
Overview: This guide provides comprehensive dbt modeling guidelines and best practices for building scalable, maintainable data transformation pipelines. Follow these patterns to create robust data models that serve your analytics needs effectively.
dbt Model Layers & Structure
1. Staging Layer (stg_
)
- Purpose: Clean and standardize raw source data
- Materialization:
view
(default) or ephemeral
- Naming:
stg_<source>_<object>.sql
(e.g., stg_salesforce_accounts.sql
)
- Location:
models/staging/
🔧 Actions Performed in Staging:
- Data Type Casting: Convert columns to appropriate data types (strings to dates, numeric conversions)
- Column Renaming: Standardize column names to consistent naming conventions
- Basic Data Cleaning:
- Trim whitespace from string fields
- Handle null values consistently
- Standardize case (UPPER/lower) for categorical fields
- Simple Calculations:
- Generate hash keys for deduplication
- Add row creation timestamps
- Create surrogate keys
- Source Selection: Select only required columns from source tables
📋 Staging Characteristics:
- 1:1 relationship with source tables
- Light transformations only
- No business logic or complex calculations
- No joins between different source systems
- Fast execution (views are computed on-demand)
2. Intermediate Layer (int_
)
- Purpose: Complex logic that doesn't fit staging or marts
- Materialization:
ephemeral
or view
- Naming:
int_<verb>_<object>.sql
(e.g., int_orders_pivoted.sql
)
- Location:
models/intermediate/
⚙️ Actions Performed in Intermediate:
- Complex Joins: Join multiple staging tables from different source systems
- Data Reshaping:
- Pivot/unpivot operations
- Aggregations that will be reused
- Window functions for ranking, running totals
- Advanced Calculations:
- Complex business rules that span multiple tables
- Derived metrics that don't belong in staging
- Data enrichment from lookup tables
- Data Quality Improvements:
- Deduplication logic
- Data validation and cleansing rules
- Handling of edge cases and exceptions
- Reusable Components: Create building blocks for multiple mart models
🔄 Intermediate Characteristics:
- Breaking down complex transformations into manageable steps
- Reusable components across multiple marts
- Not exposed to end users (internal building blocks)
- Ephemeral materialization saves storage space
- Focuses on data preparation rather than final business presentation
3. Marts Layer
- Purpose: Business-ready models for analytics
- Materialization:
table
or incremental
- Naming:
- Fact tables:
fct_<object>.sql
(e.g., fct_orders.sql
)
- Dimension tables:
dim_<object>.sql
(e.g., dim_customers.sql
)
- Location:
models/marts/
🎯 Actions Performed in Marts:
- Final Business Logic:
- Apply business rules and calculations
- Implement KPIs and metrics
- Create calculated fields for reporting
- Data Modeling:
- Fact Tables: Store transactional data, events, and measurable business processes
- Dimension Tables: Store descriptive attributes for filtering and grouping
- Implement star or snowflake schema designs
- Performance Optimization:
- Pre-aggregate frequently used metrics
- Denormalize for query performance
- Add indexes and partitioning hints
- Business-Friendly Formatting:
- Human-readable column names and descriptions
- Proper formatting of dates, currencies, percentages
- Categorical values with meaningful labels
- Data Governance:
- Apply data security and privacy rules
- Implement data retention policies
- Add comprehensive documentation
📊 Marts Characteristics:
- Final business logic and calculations
- Optimized for query performance (materialized as tables)
- Well documented and comprehensively tested
- Designed for end-user consumption
- Follows dimensional modeling principles
- Includes business-friendly naming and descriptions
Why Layer Separation is Critical
🎯 The Philosophy Behind dbt Layers
The layered approach in dbt follows the principle of separation of concerns, where each layer has a specific responsibility and builds upon the previous one. This creates a maintainable, testable, and scalable data transformation pipeline.
Layer |
Primary Focus |
Key Benefits |
Reduces Risk Of |
Staging |
Data Standardization |
• Consistent data types across sources
• Centralized source connection point
• Easy to debug data issues
|
• Source schema changes breaking downstream models
• Inconsistent data formatting
• Complex debugging
|
Intermediate |
Complex Logic Modularization |
• Reusable transformation components
• Easier testing of complex logic
• Faster development iteration
|
• Duplicated transformation logic
• Overly complex mart models
• Difficult maintenance
|
Marts |
Business Value Delivery |
• Optimized for end-user queries
• Business-friendly interface
• Performance-tuned storage
|
• Poor query performance
• Business logic mixed with technical details
• Difficulty in business rule changes
|
🔄 Data Flow and Transformation Logic
Raw Data → Staging → Intermediate → Marts → Analytics
- Stage 1 (Staging): "Clean the data" - Make raw data usable and consistent
- Stage 2 (Intermediate): "Transform the data" - Apply complex business logic and join operations
- Stage 3 (Marts): "Present the data" - Create business-ready, performant models for analytics
Each layer can only reference models from the same layer or layers before it, creating a clear dependency flow.
Materialization Strategy
Layer |
Default Materialization |
Alternative |
Reasoning |
Sources |
N/A |
N/A |
Raw data |
Staging |
view |
ephemeral |
Lightweight, 1:1 with sources |
Intermediate |
ephemeral |
view |
Temporary, reusable logic |
Marts |
table |
incremental |
Performance for end users |
Directory Structure (Recommended)
models/
├── staging/ # Layer 1: Data Standardization
│ ├── salesforce/
│ │ ├── _salesforce__sources.yml
│ │ ├── _salesforce__models.yml
│ │ ├── stg_salesforce__accounts.sql
│ │ ├── stg_salesforce__contacts.sql
│ │ └── stg_salesforce__opportunities.sql
│ ├── stripe/
│ │ ├── _stripe__sources.yml
│ │ ├── _stripe__models.yml
│ │ ├── stg_stripe__customers.sql
│ │ ├── stg_stripe__payments.sql
│ │ └── stg_stripe__subscriptions.sql
│ └── google_analytics/
│ ├── _ga__sources.yml
│ ├── _ga__models.yml
│ └── stg_ga__sessions.sql
├── intermediate/ # Layer 2: Complex Transformations
│ ├── int_customer_segments.sql
│ ├── int_orders_enriched.sql
│ ├── int_revenue_calculations.sql
│ └── int_user_journey_pivoted.sql
├── marts/ # Layer 3: Business-Ready Models
│ ├── core/ # Company-wide metrics
│ │ ├── dim_customers.sql
│ │ ├── dim_products.sql
│ │ ├── fct_orders.sql
│ │ └── fct_payments.sql
│ ├── marketing/ # Marketing-specific models
│ │ ├── dim_campaigns.sql
│ │ ├── fct_website_sessions.sql
│ │ └── mart_marketing_attribution.sql
│ ├── finance/ # Finance-specific models
│ │ ├── fct_revenue.sql
│ │ ├── mart_monthly_recurring_revenue.sql
│ │ └── mart_customer_lifetime_value.sql
│ └── product/ # Product analytics models
│ ├── fct_user_events.sql
│ └── mart_feature_adoption.sql
└── utilities/ # Helper models and macros
├── all_dates.sql
└── fiscal_calendar.sql
Implementation Guidelines
1. Set up your models directory structure:
models/
├── staging/ # Layer 1: Data Standardization
│ └── [source_system]/
│ ├── _[source]__sources.yml # Source definitions
│ └── stg_[source]__[table].sql # Staging models
├── intermediate/ # Layer 2: Complex Transformations
│ └── int_[description].sql # Reusable transformation logic
└── marts/ # Layer 3: Business-Ready Models
└── [business_area]/
├── dim_[entity].sql # Dimension tables
└── fct_[process].sql # Fact tables
2. Configure dbt_project.yml for layered approach:
models:
your_project_name:
staging:
+materialized: view
intermediate:
+materialized: ephemeral
marts:
+materialized: table
3. Follow consistent naming conventions:
- Sources:
source('system', 'table')
- Staging:
stg_<source>__<table>
- Intermediate:
int_<description>
- Facts:
fct_<business_process>
- Dimensions:
dim_<object>
Practical Examples of Layer Transformations
📝 Staging Layer Examples
Example 1: Salesforce Accounts Staging
-- stg_salesforce__accounts.sql
select
id as account_id,
name as account_name,
type as account_type,
industry,
cast(annual_revenue as decimal(15,2)) as annual_revenue,
cast(created_date as timestamp) as created_at,
cast(last_modified_date as timestamp) as updated_at,
upper(trim(billing_country)) as billing_country,
case
when is_deleted = 'true' then true
else false
end as is_deleted,
md5(concat(id, last_modified_date)) as row_hash
from {{ source('salesforce', 'accounts') }}
Example 2: Stripe Payments Staging
-- stg_stripe__payments.sql
select
id as payment_id,
customer as customer_id,
cast(amount as decimal(10,2)) / 100 as amount_usd,
currency,
status,
cast(created as timestamp) as created_at,
payment_method_types[0] as payment_method,
lower(trim(description)) as description
from {{ source('stripe', 'charges') }}
where status != 'failed'
⚙️ Intermediate Layer Examples
Example 1: Customer Segmentation Logic
-- int_customer_segments.sql
with customer_metrics as (
select
customer_id,
sum(amount_usd) as total_revenue,
count(*) as order_count,
max(created_at) as last_order_date,
min(created_at) as first_order_date
from {{ ref('stg_stripe__payments') }}
group by customer_id
),
customer_segments as (
select
*,
case
when total_revenue >= 10000 then 'Enterprise'
when total_revenue >= 1000 then 'Growth'
when total_revenue >= 100 then 'Standard'
else 'Basic'
end as revenue_segment,
case
when date_diff(current_date, last_order_date, day) <= 30 then 'Active'
when date_diff(current_date, last_order_date, day) <= 90 then 'At Risk'
else 'Churned'
end as activity_segment
from customer_metrics
)
select * from customer_segments
Example 2: Orders with Enriched Data
-- int_orders_enriched.sql
select
p.payment_id,
p.customer_id,
p.amount_usd,
p.created_at as order_date,
a.account_name,
a.account_type,
a.industry,
a.billing_country,
cs.revenue_segment,
cs.activity_segment,
row_number() over (
partition by p.customer_id
order by p.created_at
) as customer_order_sequence
from {{ ref('stg_stripe__payments') }} p
left join {{ ref('stg_salesforce__accounts') }} a
on p.customer_id = a.account_id
left join {{ ref('int_customer_segments') }} cs
on p.customer_id = cs.customer_id
🎯 Marts Layer Examples
Example 1: Customer Dimension Table
-- dim_customers.sql
select
customer_id,
account_name as customer_name,
account_type as customer_type,
industry,
billing_country as country,
revenue_segment,
activity_segment,
first_order_date,
last_order_date,
total_revenue,
order_count,
round(total_revenue / order_count, 2) as avg_order_value,
case
when activity_segment = 'Active' then 'Healthy'
when activity_segment = 'At Risk' and revenue_segment in ('Enterprise', 'Growth') then 'High Priority'
when activity_segment = 'Churned' then 'Lost'
else 'Monitor'
end as customer_health_status,
current_timestamp() as last_updated_at
from {{ ref('int_customer_segments') }} cs
left join {{ ref('stg_salesforce__accounts') }} a
on cs.customer_id = a.account_id
Example 2: Monthly Revenue Fact Table
-- fct_monthly_revenue.sql
select
date_trunc('month', order_date) as revenue_month,
billing_country as country,
industry,
revenue_segment,
count(distinct customer_id) as active_customers,
count(*) as total_orders,
sum(amount_usd) as total_revenue,
round(avg(amount_usd), 2) as avg_order_value,
sum(case when customer_order_sequence = 1 then amount_usd else 0 end) as new_customer_revenue,
sum(case when customer_order_sequence > 1 then amount_usd else 0 end) as repeat_customer_revenue
from {{ ref('int_orders_enriched') }}
where order_date >= '2023-01-01'
group by 1, 2, 3, 4
order by 1 desc, 5 desc
Additional Best Practices
Testing Strategy
- Staging: Basic tests (not_null, unique for primary keys)
- Marts: Comprehensive testing (relationships, accepted_values, custom tests)
Documentation
- All models should have descriptions
- Important columns should be documented
- Business logic should be explained
Performance Considerations
- Use incremental models for large, append-only datasets
- Consider partitioning and clustering for very large tables
- Monitor query performance and optimize accordingly
Code Organization
- One model per file
- Use consistent formatting and style
- Leverage macros for reusable logic
- Keep SQL readable and well-commented
📚 Official dbt Documentation
For the most up-to-date information and comprehensive details, refer to the official dbt documentation:
These resources provide the authoritative source for dbt best practices and are regularly updated by the dbt Labs team.