dbt Best Practices for Large-Scale Data Projects
Learn how to structure and maintain dbt projects at scale. From project organization to testing strategies, these best practices will help your data team succeed.
Why dbt Project Structure Matters
dbt (data build tool) has become the standard for data transformation, but as projects grow, poor structure leads to maintenance nightmares. A well-structured dbt project is:
- Easy to navigate for new team members
- Simple to test and validate
- Performant at scale
- Self-documenting
The Medallion Architecture
The most widely adopted pattern for organizing dbt models is the medallion architecture (also called bronze/silver/gold or staging/intermediate/marts):
Staging Layer (Bronze)
models/
staging/
stripe/
_stripe__models.yml
_stripe__sources.yml
stg_stripe__customers.sql
stg_stripe__payments.sql
salesforce/
_salesforce__models.yml
_salesforce__sources.yml
stg_salesforce__accounts.sql
stg_salesforce__opportunities.sql
Staging models should:
- Have a 1:1 relationship with source tables
- Perform minimal transformations (renaming, casting, basic cleaning)
- Be materialized as views (no additional storage cost)
- Use consistent naming:
stg_[source]__[entity]
Intermediate Layer (Silver)
models/
intermediate/
finance/
int_payments_pivoted_to_orders.sql
int_orders_with_payments.sql
marketing/
int_campaigns_with_attribution.sql
Intermediate models should:
- Combine and transform staging models
- Contain business logic that's reused across marts
- Be materialized as views or ephemeral
- Not be exposed directly to BI tools
Marts Layer (Gold)
models/
marts/
finance/
dim_customers.sql
fct_orders.sql
fct_revenue.sql
marketing/
dim_campaigns.sql
fct_conversions.sql
Marts should:
- Represent business entities and metrics
- Be materialized as tables for performance
- Be the primary interface for BI tools
- Follow dimensional modeling (facts and dimensions)
Naming Conventions
Consistent naming is crucial for maintainability:
| Layer | Prefix | Example |
|---|---|---|
| Staging | stg_ | stg_stripe__customers |
| Intermediate | int_ | int_orders_with_payments |
| Dimension | dim_ | dim_customers |
| Fact | fct_ | fct_orders |
Testing Strategy
Testing is non-negotiable in production dbt projects:
Schema Tests
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
Data Tests (Custom)
-- tests/assert_positive_revenue.sql
select
order_id,
revenue
from {{ ref('fct_orders') }}
where revenue < 0
Testing Coverage Goals
- Primary keys: unique + not_null on every model
- Foreign keys: relationships tests between models
- Business rules: accepted_values for enums, custom tests for logic
- Data freshness: source freshness checks
Documentation Best Practices
dbt's built-in documentation should be leveraged fully:
Column Descriptions
models:
- name: fct_orders
description: "Order-level fact table containing all completed transactions"
columns:
- name: order_id
description: "Unique identifier for the order (PK)"
- name: gross_revenue
description: "Total revenue before discounts and refunds, in USD"
Doc Blocks for Reusable Descriptions
{% docs customer_id %}
Unique identifier for a customer. Foreign key to dim_customers.
{% enddocs %}
Performance Optimization
Incremental Models
For large tables, use incremental materialization:
{{
config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge'
)
}}
select
event_id,
user_id,
event_timestamp,
event_type
from {{ source('analytics', 'events') }}
{% if is_incremental() %}
where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}
Clustering and Partitioning
On BigQuery or Snowflake, specify clustering:
{{
config(
materialized='table',
cluster_by=['customer_id', 'order_date']
)
}}
Model Contracts (dbt 1.5+)
Enforce data types and columns:
models:
- name: fct_orders
config:
contract:
enforced: true
columns:
- name: order_id
data_type: string
- name: order_total
data_type: numeric(18, 2)
Code Organization
Macros
Create reusable logic in macros:
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
({{ column_name }} / 100.0)::numeric(18, 2)
{% endmacro %}
Packages
Leverage community packages:
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: dbt-labs/codegen
version: 0.12.1
CI/CD for dbt
Implement continuous integration:
- On PR: Run
dbt build --select state:modified+against a dev schema - On merge: Run
dbt buildon production - Scheduled: Run full refreshes and data freshness checks
Key Takeaways
- Structure matters: Use medallion architecture for clarity
- Test everything: Primary keys, foreign keys, business rules
- Document as you go: Future you will thank you
- Optimize incrementally: Start with views, add tables/incremental as needed
- Use CI/CD: Catch issues before they hit production
Need Help With Your dbt Project?
Our team has implemented dbt at scale across multiple industries. Whether you're starting from scratch or refactoring an existing project, we can help you build a maintainable, performant data transformation layer.
Get in touch to discuss your project.