Dataform vs dbt: Choosing Your Data Transformation Tool

A detailed technical comparison of Dataform and dbt for data transformation, covering architecture, workflow, ecosystem, and practical considerations to help teams choose the right tool.

technical7 min readBy Klivvr Engineering
Share:

The data transformation layer sits at the center of every analytics pipeline. It takes raw, messy data from source systems and produces clean, reliable datasets that power dashboards, reports, and machine learning models. Two tools dominate this space: dbt (data build tool) and Dataform (now part of Google Cloud). Both solve the same fundamental problem, applying version-controlled SQL transformations to warehouse data, but they differ significantly in architecture, workflow, and ecosystem. This article provides a technical comparison to help teams make an informed choice, drawing on our experience using both tools in Klivvr's Data Platform.

Architecture and Execution Model

The most fundamental difference between dbt and Dataform is where and how they execute transformations.

dbt follows a CLI-first architecture. The core engine is a Python application that parses SQL templates (using Jinja), resolves dependencies, and executes queries against the warehouse. dbt itself does not store or process data; it orchestrates SQL execution in the warehouse:

-- dbt model: models/marts/finance/fct_revenue.sql
-- dbt compiles this Jinja-SQL template and executes the result against the warehouse
 
{{ config(materialized='incremental', unique_key='transaction_id') }}
 
with transactions as (
    select * from {{ ref('stg_stripe__transactions') }}
    {% if is_incremental() %}
        where created_at > (select max(created_at) from {{ this }})
    {% endif %}
),
 
enriched as (
    select
        t.transaction_id,
        t.amount_usd,
        t.created_at,
        c.customer_segment,
        {{ dbt_utils.star(ref('dim_products'), except=['created_at']) }}
    from transactions t
    join {{ ref('dim_customers') }} c using (customer_id)
    join {{ ref('dim_products') }} p using (product_id)
)
 
select * from enriched

Dataform uses a JavaScript/TypeScript-based configuration language called SQLX, which embeds SQL inside JavaScript control structures. Since Google acquired Dataform, it runs as a managed service within Google Cloud, though the open-source CLI remains available:

-- Dataform model: definitions/finance/fct_revenue.sqlx
-- SQLX combines SQL with JavaScript-based configuration
 
config {
    type: "incremental",
    uniqueKey: ["transaction_id"],
    bigquery: {
        partitionBy: "DATE(created_at)",
        clusterBy: ["customer_segment"]
    },
    assertions: {
        uniqueKey: ["transaction_id"],
        nonNull: ["transaction_id", "amount_usd", "created_at"]
    }
}
 
SELECT
    t.transaction_id,
    t.amount_usd,
    t.created_at,
    c.customer_segment,
    p.product_name,
    p.product_category
FROM ${ref("stg_stripe__transactions")} t
JOIN ${ref("dim_customers")} c USING (customer_id)
JOIN ${ref("dim_products")} p USING (product_id)
${when(incremental(), `WHERE t.created_at > (SELECT MAX(created_at) FROM ${self()})`)}

The syntax differences are immediately apparent. dbt uses Jinja2 templating ({{ }} and {% %}) while Dataform uses JavaScript template literals (${}) and helper functions like when() and ref(). For teams already comfortable with Python, dbt's Jinja feels natural. For teams working in JavaScript ecosystems, Dataform's SQLX may be more intuitive.

Dependency Management and DAG Construction

Both tools construct a directed acyclic graph (DAG) of model dependencies to determine execution order. The mechanism differs slightly.

dbt resolves dependencies through the ref() function. When a model references another with {{ ref('model_name') }}, dbt adds an edge to the DAG. The full graph is computed at compile time and can be visualized with dbt docs generate:

# dbt dependency resolution (simplified conceptual model)
# dbt parses all SQL files, extracts ref() calls, and builds a DAG
 
class DbtProject:
    def __init__(self, project_dir: str):
        self.models: dict[str, Model] = {}
        self.dag: dict[str, list[str]] = {}
 
    def parse(self):
        for sql_file in self.find_sql_files():
            model = self.parse_model(sql_file)
            self.models[model.name] = model
            # Extract ref() calls to build edges
            self.dag[model.name] = model.extract_refs()
 
    def execution_order(self) -> list[str]:
        """Topological sort of the DAG"""
        visited = set()
        order = []
 
        def visit(node: str):
            if node in visited:
                return
            visited.add(node)
            for dep in self.dag.get(node, []):
                visit(dep)
            order.append(node)
 
        for node in self.dag:
            visit(node)
        return order

Dataform uses the same ref() pattern but adds config blocks that can declare dependencies explicitly. It also supports JavaScript-based dynamic dependency generation, which is useful for programmatically creating models:

// Dataform: dynamic model generation using JavaScript
// definitions/staging/generate_staging_models.js
 
const sources = [
  { name: "payments", schema: "stripe_raw", columns: ["id", "amount", "currency", "status", "created"] },
  { name: "customers", schema: "app_raw", columns: ["id", "name", "email", "segment", "country"] },
  { name: "products", schema: "app_raw", columns: ["id", "name", "category", "price"] },
];
 
sources.forEach(source => {
  publish(`stg_${source.name}`)
    .type("view")
    .query(ctx => `
      SELECT
        ${source.columns.map(col =>
          col === "id" ? `${col} AS ${source.name.slice(0, -1)}_id` : col
        ).join(",\n        ")}
      FROM ${ctx.ref(source.schema, source.name)}
    `);
});

This programmatic model generation is a genuine advantage of Dataform's JavaScript foundation. In dbt, achieving similar dynamic model generation requires either custom macros or the codegen package, both of which are less ergonomic than native JavaScript.

Testing and Assertions

dbt's testing framework is mature and extensible. Tests are defined in YAML files and can be schema tests (applied to columns) or data tests (custom SQL queries that should return zero rows):

# dbt: models/staging/_stg_models.yml
version: 2
 
models:
  - name: stg_stripe__payments
    columns:
      - name: payment_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
      - name: status
        tests:
          - accepted_values:
              values: ['succeeded', 'failed', 'pending', 'refunded']

Dataform embeds assertions directly in the model configuration and also supports standalone assertion files:

-- Dataform: inline assertions in the model config
config {
    type: "table",
    assertions: {
        uniqueKey: ["payment_id"],
        nonNull: ["payment_id", "amount", "status"],
        rowConditions: [
            "amount >= 0",
            "status IN ('succeeded', 'failed', 'pending', 'refunded')"
        ]
    }
}
 
-- Dataform also supports standalone assertions:
-- definitions/assertions/assert_revenue_positive.sqlx
config {
    type: "assertion"
}
 
SELECT *
FROM ${ref("fct_daily_revenue")}
WHERE gross_revenue_usd < 0

dbt's testing ecosystem is more extensive thanks to packages like dbt-utils, dbt-expectations, and great-expectations. Dataform's assertion system is simpler and covers the most common cases but lacks the extensibility of dbt's package ecosystem.

Warehouse Support and Cloud Integration

This is where the tools diverge most sharply. dbt supports virtually every modern data warehouse through its adapter system: Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, DuckDB, and many others. Dataform, since its acquisition by Google, is heavily optimized for BigQuery, though it still supports Snowflake, Redshift, and PostgreSQL.

# dbt: profiles.yml supports multiple warehouse targets
# This flexibility is a key differentiator
 
# profiles.yml (simplified)
klivvr_data_platform:
  target: production
  outputs:
    production:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      database: ANALYTICS
      warehouse: TRANSFORM_WH
      schema: production
 
    development:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      database: ANALYTICS_DEV
      warehouse: DEV_WH
      schema: "dev_{{ env_var('DBT_USER') }}"

For teams committed to Google Cloud and BigQuery, Dataform's native integration is compelling. It is available as a managed service within the Google Cloud console, with built-in scheduling, monitoring, and IAM integration. There is no infrastructure to manage and no CI/CD pipeline to build for basic workflows. For multi-cloud teams or those using Snowflake, Redshift, or Databricks, dbt's adapter ecosystem makes it the safer choice.

Making the Decision

The choice between dbt and Dataform is rarely about which tool is objectively better. It depends on the team's context:

Choose dbt when your team uses Snowflake, Redshift, Databricks, or multiple warehouses; when you need the largest ecosystem of packages and community resources; when you want maximum control over the execution environment and CI/CD pipeline; or when your analytics engineers are comfortable with Python and Jinja templating.

Choose Dataform when your team is fully committed to Google Cloud and BigQuery; when you want a managed service with minimal infrastructure overhead; when your team prefers JavaScript/TypeScript over Python; or when you need programmatic model generation that feels native rather than bolted on.

At Klivvr, our Data Platform uses both tools in different contexts. Our primary analytics pipeline runs on dbt with Snowflake, where the mature ecosystem and broad warehouse support are essential. A secondary pipeline that processes BigQuery-native data uses Dataform for its seamless Google Cloud integration. The two tools coexist by operating on different warehouses and serving different consumer groups.

Conclusion

Dataform and dbt are both excellent tools that have elevated the practice of data transformation from ad-hoc SQL scripts to disciplined, testable, version-controlled engineering. The right choice depends less on the tools themselves and more on the surrounding infrastructure: which warehouse you use, which cloud provider you are committed to, and which programming language your team prefers for configuration.

Whichever tool you choose, the most important factor is the modeling practices you build around it. Layered architectures, consistent naming, comprehensive testing, and clear documentation matter far more than whether your ref function uses Jinja or JavaScript syntax. Master the fundamentals of data modeling first, and the choice of tool becomes a pragmatic decision rather than a philosophical one.

Related Articles

business

Data-Driven Decision Making in Fintech

How a well-architected data platform enables better business decisions across product, finance, and operations — with practical examples of self-service analytics and data democratization at Klivvr.

6 min read
technical

Incremental Models in dbt: Processing Data Efficiently

A deep dive into dbt's incremental materialization strategy, covering when to use incremental models, how to implement them correctly, and how to avoid the common pitfalls that lead to data inconsistencies.

8 min read
technical

Data Quality Testing: Strategies and Implementation

A comprehensive guide to implementing data quality testing across the data pipeline, from schema validation and freshness checks to statistical anomaly detection and business rule enforcement.

8 min read