Data Quality Framework: From Manual Chaos to Automated Trust - Part 1

By Dimitris Lampriadis on 22 Apr, 2026

<span id="hs_cos_wrapper_name" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" >Data Quality Framework: From Manual Chaos to Automated Trust - Part 1</span>

Data Quality Framework: From Manual Chaos to Automated Trust - Part 1
5:30

Introduction

Defining the Data Quality Framework

In the context of modern analytics engineering, a Data Quality Framework is a systematic set of automated safeguards designed to ensure that data remains accurate, reliable, and "fit for purpose" as it moves through the pipeline.

It’s a set of rules and tools that check your work at every step—from the moment you write a line of code to the moment the data hits a dashboard.

The framework embeds validation directly into the development lifecycle. By utilizing dbt for modular testing and documentation, pre-commit hooks to catch syntax errors or missing metadata before code even leaves the developer's machine, and Slim CI to dynamically test only the modified models. Engineers use these tools to fail fast.

This ensures that any data violating business logic or schema integrity is intercepted and blocked before it can ever reach a production dashboard or influence a business decision.

The framework pillars

Every company has different needs, but for this project, we built a framework that focused on two things: Trusting our Code (making sure it’s clean and following rules) and Trusting our Data (making sure the numbers actually make sense). We wanted a system that was invisible but invincible.

The "Before" state: The cost of manual chaos

Since I joined the project most of the processes were manual, everyone had their own way of writing code, and there were no real standards. This made the quality of the development process, but also the data itself unreliable and untrustworthy. Before the framework, the team spent 20% of every sprint fixing syntax errors and re-running failed production jobs. We were burning warehouse credits on code that was destined to fail.

The development process looked like this:

Technical Diagrams - Old workflow

Even after a model was developed and pushed to production there weren’t any automatic tests in place, rather than a set of manual sql scripts that would test the data integrity of the new table.

Goal

We implemented this framework to kill syntax errors at the source. We needed to enforce metadata standards and business logic transformations while ensuring continuous integration for new developments.

Trusting our Code

Pre-commit hooks

Standardization starts on the local machine. Pre-commit hooks was the to-go tool that would enable the standardization of the code, and automating the quality control during development. The tools that were selected to be included in the pre-commit hooks were:

  • SQLFluff 1: This tool enforces SQL styling and formatting. It prevents "creative" indentation and non-standard capitalization. 

  • Ruff 2: We use this for Python linting. It keeps our utility scripts clean and production ready.

  • dbt-checkpoint 3: This validates model properties. It ensures every model has a description and a primary key test before the code moves to the repository. dbt-checkpoint is a package that enables dbt related pre-commit hooks.

Part of the config file of the pre-commit hooks looked like:

fail_fast: true
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v6.0.0
    hooks:
      - id: no-commit-to-branch
        args: [--branch, dev]
        stages: [pre-commit]
      - id: check-merge-conflict
        stages: [pre-commit, manual]
        args: [--assume-in-merge]
      - id: trailing-whitespace
        stages: [pre-commit, manual]
      - id: check-case-conflict
        stages: [pre-commit, manual]
      - id: end-of-file-fixer
        stages: [pre-commit, manual]
      - id: check-toml
        stages: [pre-commit, manual]
      - id: check-yaml
        stages: [pre-commit, manual]
      - id: check-added-large-files
        stages: [pre-commit, manual]
      - id: detect-private-key
        stages: [pre-commit, manual]

  - repo: https://github.com/dbt-checkpoint/dbt-checkpoint
    rev: v2.0.7
    hooks:
      - id: dbt-parse
        stages: [pre-commit]
      files: ^<path-to-dbt-directory>/.*\.(sql|yml|yaml)$
      args: ["--cmd-flags", "++project-dir", "<path-to-dbt-directory> "]
      - id: dbt-docs-generate
        stages: [pre-commit]
      files: ^<path-to-dbt-directory>.*\.(sql|yml|yaml|md)$
        args:
          [
            "--cmd-flags",
            "++no-compile",
            "++project-dir",
          "<path-to-dbt-directory>",
            "++state",
          "<path-to-dbt-directory>/target/",
            "++vars",
          "{'disable_run_results': true, 'disable_tests_results': true, 'disable_dbt_artifacts_autoupload': true}",
          ]
      - id: dbt-test
        stages: [pre-commit]
      files: ^<path-to-dbt-directory>/.*\.(sql|yml|yaml)$
      args: [
            "--cmd-flags",
            "++project-dir",
          "<path-to-dbt-directory> ",
            "++exclude",
          "test_type:generic",
            "++vars",
          "{'disable_run_results': true, 'disable_tests_results': true, 'disable_dbt_artifacts_autoupload': true}",
            "--",
          ]
      - id: check-model-has-properties-file
        stages: [pre-commit, manual]
      files: ^<path-to-dbt-directory>/models/.*\.(sql|yml|yaml)$
      exclude: ^<path-to-dbt-directory>/models/<specific-files-to-exclude-if-is-needed>.*\.sql$
        args:
        ["--manifest", "<path-to-dbt-directory>/target/manifest.json"]
      - id: check-model-has-description
        stages: [pre-commit, manual]
      files: ^<path-to-dbt-directory>/models/.*\.(sql|yml|yaml)$
        args:
        ["--manifest", "<path-to-dbt-directory>/target/manifest.json"]

    # The following hook will only test models in a specific layer.
      - id: check-model-has-tests-by-name
        stages: [pre-commit, manual]
      files: ^<path-to-dbt-directory>/models/<specific-directory>/.*\.(sql|yml|yaml)$
        args:
          [
            "--manifest",
          "<path-to-dbt-directory>/target/manifest.json",
            "--tests",
            "unique=1",
            "not_null=11",
            "test_active_records=1",
            "--",
          ]
....

 

These checks run on every commit. If the code fails a rule, the system blocks it.  We mirrored these hooks in the CI/CD pipeline to ensure code quality remains consistent across all developer environments, which will be briefly explained further below. This strategy reduces GitLab pipelines' run times and saves compute resources in the data warehouse.

Eventually, the result of the above implementation in the development process looked like:

 

CI/CD

 Local pre-commit hooks only provide the first layer of trust. CI/CD 4 (Continuous Integration/Continuous Delivery) pipeline completes the development cycle. Local hooks provide the first layer of trust; the CI/CD pipeline completes it. The system automates the build and test cycle, stopping engineers from bypassing local checks when they open a Merge Request. The image below places the pre-commit and the CI/CD pipelines implemented at the client in a context. The client had already in place a GitLab repository.

Screenshot 2026-04-07 173922

When a developer passes all local checks and pushes code to a remote branch, they open a Merge Request. This action triggers the GitLab pipeline. The system runs checks to stop engineers from bypassing local hooks.

Slim CI and state comparison

The most critical part of this pipeline is Slim CI  5. Traditional CI pipelines often test the entire project, which wastes time and warehouse credits. Slim CI uses a "state comparison" method. It compares the new code against the last successful production run to identify only the modified or downstream models.

The pipeline executes these specific models in a dedicated "CI schema." This isolation allows the team to verify that the new business logic works correctly without affecting production data or existing dashboards. The pipeline must pass every check before the "Merge" button becomes active. This automation ensures that no broken code ever reaches the production environment.

This strategy allows the pipeline to trigger specific dbt tests based on the development stage. Understanding when these tests run requires a clear map of the testing layers.

Standardizing the development process and trusting the code stopped the chaos, but logic checks don't catch bad data from sources. Part two will examine our strategy for 'Trusting our Data.' I’ll detail how we deployed Elementary for anomaly detection and dbt-expectations to handle schema validation, finally eliminating the manual SQL scripts that used to guard our warehouse.


At DDBM we help companies build modern data architectures using tools like Snowflake, dbt, and Matillion. If you are curious about how AI-augmented data engineering could help your team, feel free to reach out.


references:
  1. https://www.sqlfluff.com/
  2. https://docs.astral.sh/ruff/
  3. https://pre-commit.com/
  4. https://www.infoworld.com/article/2269266/what-is-cicd-continuous-integration-and-continuous-delivery-explained.html
  5. https://docs.getdbt.com/best-practices/best-practice-workflows?version=1.12#run-only-modified-models-to-test-changes-slim-ci