A lightweight data validation plugin for kedro

python
mlops
In this post I highlight a lightweight data validation plugin I developed and share some insights from a development point of view.
Published

January 12, 2025

Machine learning development often involves processing data through multiple layers, which can introduce data quality issues. Implementing simple data validation early across these layers can save time in debugging and prevent subtle issues from reaching production.

I’ve developed a plugin that integrates TDDA in Kedro workflows for this purpose. Explore the project from the

What Are Kedro and TDDA?

Kedro is an open-source Python framework for creating reproducible and maintainable data science pipelines.
TDDA (Test Driven Data Analysis) is a lightweight Python library designed for simple and efficient data validation.

Why kedro_tdda?

While popular data validation tools like Great Expectations and Soda offer comprehensive solutions, they can feel too heavy for smaller projects or early-stage development.

kedro_tdda bridges this gap by integrating TDDA into Kedro workflows. This allows developers to validate data as part of their pipeline without disrupting the existing flow or introducing unnecessary complexity.

Key Features

  • TDDA Constraints: Infer data characteristics, validate (new) data and detect anomalies.
  • Lightweight Integration: Add data validation to your Kedro pipelines with minimal configuration.
  • Simplicity First: Designed for developers who need straightforward data checks, not extensive setups.
  • Ideal for Small to Medium Projects: Perfect for teams that want quick feedback on data quality during development.

Quick Start Example

For a more extensive example, please check the tutorial

Infer the constraints for datasets in your project.

kedro tdda discover
INFO - Loading data from companies (CSVDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/companies.yml
INFO - Loading data from reviews (CSVDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/reviews.yml
...

Example of a contstraints file that can be tweaked if necessary.

# ./conf/base/tdda/companies.yml
companies:
    fields:
        company_location:
            max_length: 29
            min_length: 4
            type: string
        company_rating:
            max_length: 4
            min_length: 2
            type: string
        iata_approved:
            allowed_values:
            - f
            - t
            max_length: 1
            min_length: 1
            type: string
# ...

Verify data against constraints

kedro tdda verify --dataset companies
INFO - Loading data from companies (CSVDataset)...
INFO - Verification summary `companies`: 20 passses, 0 failures

The data verification is also executed during pipeline runs. For example, when data is not valid, the run would result in

# modified constraints - expecting error
kedro run
     INFO - Kedro project kedro-iris-tdda
     INFO - Loading data from companies (CSVDataset)...
>>>  kedro_tdda.utils.TddaVerificationError: Dataset `companies` deviates from constraint specification: 
>>>  ✗ company_location: max_length
...

Finally, in case of deviations, anomalies can be detected and persisted to a csv file.

kedro tdda detect
    INFO - Loading data from companies (CSVDataset)...
>>> INFO - Detection for companies written to ./tdda_detect/companies.csv
>>> WARNING - Dataset `companies` deviates from constraint specification: 
>>>company_location: max_length