dbt Best Practices: Structuring Large-Scale Data Projects

Why dbt Project Structure Matters

dbt (data build tool) has become the standard for data transformation, but as projects grow, poor structure leads to maintenance nightmares. A well-structured dbt project is:

Easy to navigate for new team members
Simple to test and validate
Performant at scale
Self-documenting

The Medallion Architecture

The most widely adopted pattern for organizing dbt models is the medallion architecture (also called bronze/silver/gold or staging/intermediate/marts):

Staging Layer (Bronze)

models/
  staging/
    stripe/
      _stripe__models.yml
      _stripe__sources.yml
      stg_stripe__customers.sql
      stg_stripe__payments.sql
    salesforce/
      _salesforce__models.yml
      _salesforce__sources.yml
      stg_salesforce__accounts.sql
      stg_salesforce__opportunities.sql

Staging models should:

Have a 1:1 relationship with source tables
Perform minimal transformations (renaming, casting, basic cleaning)
Be materialized as views (no additional storage cost)
Use consistent naming: stg_[source]__[entity]

Intermediate Layer (Silver)

models/
  intermediate/
    finance/
      int_payments_pivoted_to_orders.sql
      int_orders_with_payments.sql
    marketing/
      int_campaigns_with_attribution.sql

Intermediate models should:

Combine and transform staging models
Contain business logic that's reused across marts
Be materialized as views or ephemeral
Not be exposed directly to BI tools

Marts Layer (Gold)

models/
  marts/
    finance/
      dim_customers.sql
      fct_orders.sql
      fct_revenue.sql
    marketing/
      dim_campaigns.sql
      fct_conversions.sql

Marts should:

Represent business entities and metrics
Be materialized as tables for performance
Be the primary interface for BI tools
Follow dimensional modeling (facts and dimensions)

Naming Conventions

Consistent naming is crucial for maintainability:

Layer	Prefix	Example
Staging	stg_	stg_stripe__customers
Intermediate	int_	int_orders_with_payments
Dimension	dim_	dim_customers
Fact	fct_	fct_orders

Testing Strategy

Testing is non-negotiable in production dbt projects:

Schema Tests

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']

Data Tests (Custom)

-- tests/assert_positive_revenue.sql
select
    order_id,
    revenue
from {{ ref('fct_orders') }}
where revenue < 0

Testing Coverage Goals

Primary keys: unique + not_null on every model
Foreign keys: relationships tests between models
Business rules: accepted_values for enums, custom tests for logic
Data freshness: source freshness checks

Documentation Best Practices

dbt's built-in documentation should be leveraged fully:

Column Descriptions

models:
  - name: fct_orders
    description: "Order-level fact table containing all completed transactions"
    columns:
      - name: order_id
        description: "Unique identifier for the order (PK)"
      - name: gross_revenue
        description: "Total revenue before discounts and refunds, in USD"

Doc Blocks for Reusable Descriptions

{% docs customer_id %}
Unique identifier for a customer. Foreign key to dim_customers.
{% enddocs %}

Performance Optimization

Incremental Models

For large tables, use incremental materialization:

{{
    config(
        materialized='incremental',
        unique_key='event_id',
        incremental_strategy='merge'
    )
}}

select
    event_id,
    user_id,
    event_timestamp,
    event_type
from {{ source('analytics', 'events') }}

{% if is_incremental() %}
where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}

Clustering and Partitioning

On BigQuery or Snowflake, specify clustering:

{{
    config(
        materialized='table',
        cluster_by=['customer_id', 'order_date']
    )
}}

Model Contracts (dbt 1.5+)

Enforce data types and columns:

models:
  - name: fct_orders
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: string
      - name: order_total
        data_type: numeric(18, 2)

Code Organization

Macros

Create reusable logic in macros:

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)::numeric(18, 2)
{% endmacro %}

Packages

Leverage community packages:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: dbt-labs/codegen
    version: 0.12.1

CI/CD for dbt

Implement continuous integration:

On PR: Run dbt build --select state:modified+ against a dev schema
On merge: Run dbt build on production
Scheduled: Run full refreshes and data freshness checks

Key Takeaways

Structure matters: Use medallion architecture for clarity
Test everything: Primary keys, foreign keys, business rules
Document as you go: Future you will thank you
Optimize incrementally: Start with views, add tables/incremental as needed
Use CI/CD: Catch issues before they hit production

Need Help With Your dbt Project?

Our team has implemented dbt at scale across multiple industries. Whether you're starting from scratch or refactoring an existing project, we can help you build a maintainable, performant data transformation layer.

Get in touch to discuss your project.

dbt Best Practices for Large-Scale Data Projects