dbt Modeling Guidelines and Best Practices

            Overview: This guide provides comprehensive dbt modeling guidelines and best practices for building scalable, maintainable data transformation pipelines. Follow these patterns to create robust data models that serve your analytics needs effectively.
        

dbt Model Layers & Structure

1. Staging Layer (`stg_`)

Purpose: Clean and standardize raw source data
Materialization: view (default) or ephemeral
Naming: stg_<source>_<object>.sql (e.g., stg_salesforce_accounts.sql)
Location: models/staging/

🔧 Actions Performed in Staging:

Data Type Casting: Convert columns to appropriate data types (strings to dates, numeric conversions)
Column Renaming: Standardize column names to consistent naming conventions
Basic Data Cleaning:
- Trim whitespace from string fields
- Handle null values consistently
- Standardize case (UPPER/lower) for categorical fields
Simple Calculations:
- Generate hash keys for deduplication
- Add row creation timestamps
- Create surrogate keys
Source Selection: Select only required columns from source tables

📋 Staging Characteristics:

1:1 relationship with source tables
Light transformations only
No business logic or complex calculations
No joins between different source systems
Fast execution (views are computed on-demand)

2. Intermediate Layer (`int_`)

Purpose: Complex logic that doesn't fit staging or marts
Materialization: ephemeral or view
Naming: int_<verb>_<object>.sql (e.g., int_orders_pivoted.sql)
Location: models/intermediate/

⚙️ Actions Performed in Intermediate:

Complex Joins: Join multiple staging tables from different source systems
Data Reshaping:
- Pivot/unpivot operations
- Aggregations that will be reused
- Window functions for ranking, running totals
Advanced Calculations:
- Complex business rules that span multiple tables
- Derived metrics that don't belong in staging
- Data enrichment from lookup tables
Data Quality Improvements:
- Deduplication logic
- Data validation and cleansing rules
- Handling of edge cases and exceptions
Reusable Components: Create building blocks for multiple mart models

🔄 Intermediate Characteristics:

Breaking down complex transformations into manageable steps
Reusable components across multiple marts
Not exposed to end users (internal building blocks)
Ephemeral materialization saves storage space
Focuses on data preparation rather than final business presentation

3. Marts Layer

Purpose: Business-ready models for analytics
Materialization: table or incremental
Naming:
- Fact tables: fct_<object>.sql (e.g., fct_orders.sql)
- Dimension tables: dim_<object>.sql (e.g., dim_customers.sql)
Location: models/marts/

🎯 Actions Performed in Marts:

Final Business Logic:
- Apply business rules and calculations
- Implement KPIs and metrics
- Create calculated fields for reporting
Data Modeling:
- Fact Tables: Store transactional data, events, and measurable business processes
- Dimension Tables: Store descriptive attributes for filtering and grouping
- Implement star or snowflake schema designs
Performance Optimization:
- Pre-aggregate frequently used metrics
- Denormalize for query performance
- Add indexes and partitioning hints
Business-Friendly Formatting:
- Human-readable column names and descriptions
- Proper formatting of dates, currencies, percentages
- Categorical values with meaningful labels
Data Governance:
- Apply data security and privacy rules
- Implement data retention policies
- Add comprehensive documentation

📊 Marts Characteristics:

Final business logic and calculations
Optimized for query performance (materialized as tables)
Well documented and comprehensively tested
Designed for end-user consumption
Follows dimensional modeling principles
Includes business-friendly naming and descriptions

Why Layer Separation is Critical

🎯 The Philosophy Behind dbt Layers

The layered approach in dbt follows the principle of separation of concerns, where each layer has a specific responsibility and builds upon the previous one. This creates a maintainable, testable, and scalable data transformation pipeline.

Layer	Primary Focus	Key Benefits	Reduces Risk Of
Staging	Data Standardization	• Consistent data types across sources • Centralized source connection point • Easy to debug data issues	• Source schema changes breaking downstream models • Inconsistent data formatting • Complex debugging
Intermediate	Complex Logic Modularization	• Reusable transformation components • Easier testing of complex logic • Faster development iteration	• Duplicated transformation logic • Overly complex mart models • Difficult maintenance
Marts	Business Value Delivery	• Optimized for end-user queries • Business-friendly interface • Performance-tuned storage	• Poor query performance • Business logic mixed with technical details • Difficulty in business rule changes

🔄 Data Flow and Transformation Logic

Raw Data → Staging → Intermediate → Marts → Analytics

Stage 1 (Staging): "Clean the data" - Make raw data usable and consistent
Stage 2 (Intermediate): "Transform the data" - Apply complex business logic and join operations
Stage 3 (Marts): "Present the data" - Create business-ready, performant models for analytics

Each layer can only reference models from the same layer or layers before it, creating a clear dependency flow.

Materialization Strategy

Layer	Default Materialization	Alternative	Reasoning
Sources	N/A	N/A	Raw data
Staging	`view`	`ephemeral`	Lightweight, 1:1 with sources
Intermediate	`ephemeral`	`view`	Temporary, reusable logic
Marts	`table`	`incremental`	Performance for end users

Directory Structure (Recommended)

models/
├── staging/                    # Layer 1: Data Standardization
│   ├── salesforce/
│   │   ├── _salesforce__sources.yml
│   │   ├── _salesforce__models.yml
│   │   ├── stg_salesforce__accounts.sql
│   │   ├── stg_salesforce__contacts.sql
│   │   └── stg_salesforce__opportunities.sql
│   ├── stripe/
│   │   ├── _stripe__sources.yml
│   │   ├── _stripe__models.yml
│   │   ├── stg_stripe__customers.sql
│   │   ├── stg_stripe__payments.sql
│   │   └── stg_stripe__subscriptions.sql
│   └── google_analytics/
│       ├── _ga__sources.yml
│       ├── _ga__models.yml
│       └── stg_ga__sessions.sql
├── intermediate/               # Layer 2: Complex Transformations
│   ├── int_customer_segments.sql
│   ├── int_orders_enriched.sql
│   ├── int_revenue_calculations.sql
│   └── int_user_journey_pivoted.sql
├── marts/                     # Layer 3: Business-Ready Models
│   ├── core/                  # Company-wide metrics
│   │   ├── dim_customers.sql
│   │   ├── dim_products.sql
│   │   ├── fct_orders.sql
│   │   └── fct_payments.sql
│   ├── marketing/             # Marketing-specific models
│   │   ├── dim_campaigns.sql
│   │   ├── fct_website_sessions.sql
│   │   └── mart_marketing_attribution.sql
│   ├── finance/               # Finance-specific models
│   │   ├── fct_revenue.sql
│   │   ├── mart_monthly_recurring_revenue.sql
│   │   └── mart_customer_lifetime_value.sql
│   └── product/               # Product analytics models
│       ├── fct_user_events.sql
│       └── mart_feature_adoption.sql
└── utilities/                 # Helper models and macros
    ├── all_dates.sql
    └── fiscal_calendar.sql
        

Implementation Guidelines

1. Set up your models directory structure:

models/
├── staging/                    # Layer 1: Data Standardization
│   └── [source_system]/
│       ├── _[source]__sources.yml      # Source definitions
│       └── stg_[source]__[table].sql   # Staging models
├── intermediate/               # Layer 2: Complex Transformations
│   └── int_[description].sql           # Reusable transformation logic
└── marts/                     # Layer 3: Business-Ready Models
    └── [business_area]/
        ├── dim_[entity].sql            # Dimension tables
        └── fct_[process].sql           # Fact tables
        

2. Configure dbt_project.yml for layered approach:

models:
  your_project_name:
    staging:
      +materialized: view
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
        

3. Follow consistent naming conventions:

Sources: source('system', 'table')
Staging: stg_<source>__<table>
Intermediate: int_<description>
Facts: fct_<business_process>
Dimensions: dim_<object>

Practical Examples of Layer Transformations

📝 Staging Layer Examples

Example 1: Salesforce Accounts Staging

-- stg_salesforce__accounts.sql
select
    id as account_id,
    name as account_name,
    type as account_type,
    industry,
    cast(annual_revenue as decimal(15,2)) as annual_revenue,
    cast(created_date as timestamp) as created_at,
    cast(last_modified_date as timestamp) as updated_at,
    upper(trim(billing_country)) as billing_country,
    case 
        when is_deleted = 'true' then true 
        else false 
    end as is_deleted,
    md5(concat(id, last_modified_date)) as row_hash
from {{ source('salesforce', 'accounts') }}
            

Example 2: Stripe Payments Staging

-- stg_stripe__payments.sql  
select
    id as payment_id,
    customer as customer_id,
    cast(amount as decimal(10,2)) / 100 as amount_usd,
    currency,
    status,
    cast(created as timestamp) as created_at,
    payment_method_types[0] as payment_method,
    lower(trim(description)) as description
from {{ source('stripe', 'charges') }}
where status != 'failed'
            

⚙️ Intermediate Layer Examples

Example 1: Customer Segmentation Logic

-- int_customer_segments.sql
with customer_metrics as (
    select 
        customer_id,
        sum(amount_usd) as total_revenue,
        count(*) as order_count,
        max(created_at) as last_order_date,
        min(created_at) as first_order_date
    from {{ ref('stg_stripe__payments') }}
    group by customer_id
),

customer_segments as (
    select
        *,
        case 
            when total_revenue >= 10000 then 'Enterprise'
            when total_revenue >= 1000 then 'Growth'  
            when total_revenue >= 100 then 'Standard'
            else 'Basic'
        end as revenue_segment,
        case
            when date_diff(current_date, last_order_date, day) <= 30 then 'Active'
            when date_diff(current_date, last_order_date, day) <= 90 then 'At Risk' 
            else 'Churned'
        end as activity_segment
    from customer_metrics
)

select * from customer_segments
            

Example 2: Orders with Enriched Data

-- int_orders_enriched.sql
select 
    p.payment_id,
    p.customer_id,
    p.amount_usd,
    p.created_at as order_date,
    a.account_name,
    a.account_type,
    a.industry,
    a.billing_country,
    cs.revenue_segment,
    cs.activity_segment,
    row_number() over (
        partition by p.customer_id 
        order by p.created_at
    ) as customer_order_sequence
from {{ ref('stg_stripe__payments') }} p
left join {{ ref('stg_salesforce__accounts') }} a 
    on p.customer_id = a.account_id
left join {{ ref('int_customer_segments') }} cs
    on p.customer_id = cs.customer_id
            

🎯 Marts Layer Examples

Example 1: Customer Dimension Table

-- dim_customers.sql
select
    customer_id,
    account_name as customer_name,
    account_type as customer_type,
    industry,
    billing_country as country,
    revenue_segment,
    activity_segment,
    first_order_date,
    last_order_date,
    total_revenue,
    order_count,
    round(total_revenue / order_count, 2) as avg_order_value,
    case 
        when activity_segment = 'Active' then 'Healthy'
        when activity_segment = 'At Risk' and revenue_segment in ('Enterprise', 'Growth') then 'High Priority'
        when activity_segment = 'Churned' then 'Lost'
        else 'Monitor'
    end as customer_health_status,
    current_timestamp() as last_updated_at
from {{ ref('int_customer_segments') }} cs
left join {{ ref('stg_salesforce__accounts') }} a
    on cs.customer_id = a.account_id
            

Example 2: Monthly Revenue Fact Table

-- fct_monthly_revenue.sql
select
    date_trunc('month', order_date) as revenue_month,
    billing_country as country,
    industry,
    revenue_segment,
    count(distinct customer_id) as active_customers,
    count(*) as total_orders,
    sum(amount_usd) as total_revenue,
    round(avg(amount_usd), 2) as avg_order_value,
    sum(case when customer_order_sequence = 1 then amount_usd else 0 end) as new_customer_revenue,
    sum(case when customer_order_sequence > 1 then amount_usd else 0 end) as repeat_customer_revenue
from {{ ref('int_orders_enriched') }}
where order_date >= '2023-01-01'
group by 1, 2, 3, 4
order by 1 desc, 5 desc
            

Additional Best Practices

Testing Strategy

Staging: Basic tests (not_null, unique for primary keys)
Marts: Comprehensive testing (relationships, accepted_values, custom tests)

Documentation

All models should have descriptions
Important columns should be documented
Business logic should be explained

Performance Considerations

Use incremental models for large, append-only datasets
Consider partitioning and clustering for very large tables
Monitor query performance and optimize accordingly

Code Organization

One model per file
Use consistent formatting and style
Leverage macros for reusable logic
Keep SQL readable and well-commented

📚 Official dbt Documentation

For the most up-to-date information and comprehensive details, refer to the official dbt documentation:

How we structure our dbt projects - Official guide to dbt project organization and layering
About dbt models - Comprehensive documentation on dbt models and their structure
Materializations best practices - Guide to choosing and implementing dbt materializations

These resources provide the authoritative source for dbt best practices and are regularly updated by the dbt Labs team.