A lightweight data validation plugin for kedro
Machine learning development often involves processing data through multiple layers, which can introduce data quality issues. Implementing simple data validation early across these layers can save time in debugging and prevent subtle issues from reaching production.
I’ve developed a plugin that integrates TDDA in Kedro workflows for this purpose. Explore the project from the
What Are Kedro and TDDA?
Kedro is an open-source Python framework for creating reproducible and maintainable data science pipelines.
TDDA (Test Driven Data Analysis) is a lightweight Python library designed for simple and efficient data validation.
Why kedro_tdda
?
While popular data validation tools like Great Expectations and Soda offer comprehensive solutions, they can feel too heavy for smaller projects or early-stage development.
kedro_tdda
bridges this gap by integrating TDDA into Kedro workflows. This allows developers to validate data as part of their pipeline without disrupting the existing flow or introducing unnecessary complexity.
Key Features
- TDDA Constraints: Infer data characteristics, validate (new) data and detect anomalies.
- Lightweight Integration: Add data validation to your Kedro pipelines with minimal configuration.
- Simplicity First: Designed for developers who need straightforward data checks, not extensive setups.
- Ideal for Small to Medium Projects: Perfect for teams that want quick feedback on data quality during development.
Quick Start Example
For a more extensive example, please check the tutorial
Infer the constraints for datasets in your project.
kedro tdda discover
INFO - Loading data from companies (CSVDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/companies.yml
INFO - Loading data from reviews (CSVDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/reviews.yml
...
Example of a contstraints file that can be tweaked if necessary.
# ./conf/base/tdda/companies.yml
companies:
fields:
company_location:
max_length: 29
min_length: 4
type: string
company_rating:
max_length: 4
min_length: 2
type: string
iata_approved:
allowed_values:
- f
- t
max_length: 1
min_length: 1
type: string
# ...
Verify data against constraints
kedro tdda verify --dataset companies
INFO - Loading data from companies (CSVDataset)... INFO - Verification summary `companies`: 20 passses, 0 failures
The data verification is also executed during pipeline runs. For example, when data is not valid, the run would result in
# modified constraints - expecting error
kedro run
INFO - Kedro project kedro-iris-tdda
INFO - Loading data from companies (CSVDataset)...
>>> kedro_tdda.utils.TddaVerificationError: Dataset `companies` deviates from constraint specification:
>>> ✗ company_location: max_length ...
Finally, in case of deviations, anomalies can be detected and persisted to a csv
file.
kedro tdda detect
INFO - Loading data from companies (CSVDataset)...
>>> INFO - Detection for companies written to ./tdda_detect/companies.csv
>>> WARNING - Dataset `companies` deviates from constraint specification:
>>> ✗ company_location: max_length