Data Transformation

dbt Best Practices for Large-Scale Data Projects

D
Datapare Team
January 20, 2026
4 min read
dbt Best Practices for Large-Scale Data Projects

Learn how to structure and maintain dbt projects at scale. From project organization to testing strategies, these best practices will help your data team succeed.

Why dbt Project Structure Matters

dbt (data build tool) has become the standard for data transformation, but as projects grow, poor structure leads to maintenance nightmares. A well-structured dbt project is:

  • Easy to navigate for new team members
  • Simple to test and validate
  • Performant at scale
  • Self-documenting

The Medallion Architecture

The most widely adopted pattern for organizing dbt models is the medallion architecture (also called bronze/silver/gold or staging/intermediate/marts):

Staging Layer (Bronze)

models/
  staging/
    stripe/
      _stripe__models.yml
      _stripe__sources.yml
      stg_stripe__customers.sql
      stg_stripe__payments.sql
    salesforce/
      _salesforce__models.yml
      _salesforce__sources.yml
      stg_salesforce__accounts.sql
      stg_salesforce__opportunities.sql

Staging models should:

  • Have a 1:1 relationship with source tables
  • Perform minimal transformations (renaming, casting, basic cleaning)
  • Be materialized as views (no additional storage cost)
  • Use consistent naming: stg_[source]__[entity]

Intermediate Layer (Silver)

models/
  intermediate/
    finance/
      int_payments_pivoted_to_orders.sql
      int_orders_with_payments.sql
    marketing/
      int_campaigns_with_attribution.sql

Intermediate models should:

  • Combine and transform staging models
  • Contain business logic that's reused across marts
  • Be materialized as views or ephemeral
  • Not be exposed directly to BI tools

Marts Layer (Gold)

models/
  marts/
    finance/
      dim_customers.sql
      fct_orders.sql
      fct_revenue.sql
    marketing/
      dim_campaigns.sql
      fct_conversions.sql

Marts should:

  • Represent business entities and metrics
  • Be materialized as tables for performance
  • Be the primary interface for BI tools
  • Follow dimensional modeling (facts and dimensions)

Naming Conventions

Consistent naming is crucial for maintainability:

LayerPrefixExample
Stagingstg_stg_stripe__customers
Intermediateint_int_orders_with_payments
Dimensiondim_dim_customers
Factfct_fct_orders

Testing Strategy

Testing is non-negotiable in production dbt projects:

Schema Tests

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']

Data Tests (Custom)

-- tests/assert_positive_revenue.sql
select
    order_id,
    revenue
from {{ ref('fct_orders') }}
where revenue < 0

Testing Coverage Goals

  • Primary keys: unique + not_null on every model
  • Foreign keys: relationships tests between models
  • Business rules: accepted_values for enums, custom tests for logic
  • Data freshness: source freshness checks

Documentation Best Practices

dbt's built-in documentation should be leveraged fully:

Column Descriptions

models:
  - name: fct_orders
    description: "Order-level fact table containing all completed transactions"
    columns:
      - name: order_id
        description: "Unique identifier for the order (PK)"
      - name: gross_revenue
        description: "Total revenue before discounts and refunds, in USD"

Doc Blocks for Reusable Descriptions

{% docs customer_id %}
Unique identifier for a customer. Foreign key to dim_customers.
{% enddocs %}

Performance Optimization

Incremental Models

For large tables, use incremental materialization:

{{
    config(
        materialized='incremental',
        unique_key='event_id',
        incremental_strategy='merge'
    )
}}

select
    event_id,
    user_id,
    event_timestamp,
    event_type
from {{ source('analytics', 'events') }}

{% if is_incremental() %}
where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}

Clustering and Partitioning

On BigQuery or Snowflake, specify clustering:

{{
    config(
        materialized='table',
        cluster_by=['customer_id', 'order_date']
    )
}}

Model Contracts (dbt 1.5+)

Enforce data types and columns:

models:
  - name: fct_orders
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: string
      - name: order_total
        data_type: numeric(18, 2)

Code Organization

Macros

Create reusable logic in macros:

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)::numeric(18, 2)
{% endmacro %}

Packages

Leverage community packages:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: dbt-labs/codegen
    version: 0.12.1

CI/CD for dbt

Implement continuous integration:

  1. On PR: Run dbt build --select state:modified+ against a dev schema
  2. On merge: Run dbt build on production
  3. Scheduled: Run full refreshes and data freshness checks

Key Takeaways

  • Structure matters: Use medallion architecture for clarity
  • Test everything: Primary keys, foreign keys, business rules
  • Document as you go: Future you will thank you
  • Optimize incrementally: Start with views, add tables/incremental as needed
  • Use CI/CD: Catch issues before they hit production

Need Help With Your dbt Project?

Our team has implemented dbt at scale across multiple industries. Whether you're starting from scratch or refactoring an existing project, we can help you build a maintainable, performant data transformation layer.

Get in touch to discuss your project.

Tags

dbtbest-practicessqldata-modelinganalytics-engineering

Share this article

Need Help With Your Data Infrastructure?

Our data engineering experts can help you build robust, scalable data platforms.

Get in Touch