Skip to content

UBC-MDS/DSCI_524_Group_30_Data_Validation

Welcome to DSCI_524_Group_30_Data_Validation

Package Latest TestPyPI Version Supported Python Versions
Meta Code of Conduct

Continuous Integration / GitHub Badges

codecov

Workflows

Build Docs

CI

Dependabot Updates

Publish to Test PyPI

Build Pages Deployment

Publish Docs

Summary

This package is an open source project which performs common data validation checks on a Pandas dataframe. This package aims to provide clear, informative, and concise output from running each function, designed to help to user learn more about their data. Functions are flexible and a variety of arguments are included within each function, to ensure full adaptibilty for data validation with any Pandas dataframe. This is useful in any data science pipeline, providing reproducibility, full functionality, and effectiveness.

Functions Included

Function Inputs Outputs Description
col_types_validate dataframe: pd.DataFrame
numeric_cols: int(default: 0)
integer_cols: int(default: 0)
float_cols: int(default: 0)
boolean_cols: int(default: 0)
categorical_cols: int(default: 0)
text_cols: int(default: 0)
datetime_cols: int(default: 0)
allow_extra_cols: bool(default: False)
column_schema: dict[str, str or type](optional)
str Validates that a DataFrame contains the expected number of each given logical column category.
missing_values_validate df: pd.DataFrame
col: str
threshold: float or int
bool Checks whether the given column in the pandas dataframe has missing values over the given threshold (0-1) or not.
outliers_validate dataframe: pd.DataFrame
col: str
lower_bound: float
upper_bound: float
threshold: float
str Validates that a DataFrame column contains an acceptable proportion of values outside a user-defined range.
categorical_validate dataframe: pd.DataFrame
column: str
num_cat: int
case: str(optional, default: None)
spaces: bool = False(default: False)
str Validate categorical column properties in a pandas DataFrame.

The project looks to re-imagine some of the functions of Pandera in a more user-friendly way. It aims mainly to improve output of the Panderas function, making it more informative and interpretable.

Setting up the Development Environment

  1. To get started, clone the repository to your local device.
git clone https://github.com/UBC-MDS/DSCI_524_Group_30_Data_Validation.git
  1. Change directory into the repository
cd DSCI_524_Group_30_Data_Validation
  1. Create the Conda environment from the lock file
conda-lock install -n project-env conda-lock.yml
  1. Activate the environment
conda activate project-env

You should now see (project-env) in your terminal prompt.

  1. Make sure you have quarto installed for viewing the documentation site. You can install from here:

https://quarto.org/docs/get-started/

Installing the Package

Either install from Test-Pypi using the following:

pip install -i https://test.pypi.org/simple/ dsci-524-group-30-data-validation

Or you can install this package from the local source into your preferred Python environment using pip:

pip install -e .

Running Tests

You can run tests which validate all functions in the package using pytest.

pytest -v

-v results in a more verbose output, showing the names of all tests and if they pass or not.

Build Documentation

Build Quartodoc Site

Quartodoc is installed in the environment.yml file

quartodoc build --verbose

Live preview locally (requires Quarto installed)

If you have Quarto installed locally, you can generate the API reference pages and preview the documentation website:

quarto preview

Documentation building / deployment is automated through GitHub Actions.

Example Use

Column Validation Function (col_types_validate)

Count-based validation only:

import pandas as pd
from dsci_524_group_30_data_validation.col_types_validate import col_types_validate
df = pd.DataFrame({
     "city": ["Vancouver", "Toronto", "Calgary", "Winnipeg"],
     "name": ["John Smith", "Bron Crift", "Pylon Gift", "Akon Sarmist"],
     "gender": ["M", "F", "F", "M"],
     "age": [25, 32, 41, 29]
     })
col_types_validate(
    dataframe=df,
    integer_cols=1,
    text_cols=3
    )

Expected output:

'All column categories present in the expected numbers.Check complete!'

Column-specific validation using logical type strings:

col_types_validate(
    dataframe=df,
    column_schema={
      "age": "integer",
      "city": "text",
      "name": "text"
    }
  )

Expected output:

'All specified columns match their expected types. Check complete!'

Combined count-based and column-specific validation:

col_types_validate(
    dataframe=df,
    integer_cols=1,
    text_cols=3,
    column_schema={
      "age": "integer"
    }
  )

Expected output:

'All column categories and specified columns are valid. Check complete!'

Missing Values Threshold Function (missing_values_validate)

Passed the threshold requirement example:

import pandas as pd
from dsci_524_group_30_data_validation.missing_values_validate import missing_values_validate
data = pd.DataFrame({
        "name": ["Alex", None, None, "Austin", None],
        "age": [21, 43, 23, None, 38],
        "sex": ["M", "F", "F", "M", "F"],
        "married": [True, False, None, None, True]})
missing_values_validate(df=data, col="age", threshold=0.25)

Expected output:

The amount of missing values are valid. Checks completed!
True

Not passing the threshold requirement example:

missing_values_validate(df=data, col="name", threshold=0.05)

Expected output:

Invalid check: the amount of missing values is 0.6, exceeding the threshold: 0.05. Checks completed!
False

Values Outlier Function (outliers_validate)

Example where outlier proportion is within the threshold:

import pandas as pd
from dsci_524_group_30_data_validation.outlier_validation import outliers_validate
df = pd.DataFrame({"age": [25, 32, 41, 29, 200]})
outliers_validate(
    dataframe=df,
    col="age",
    lower_bound=0,
    upper_bound=100,
    threshold=0.20
    )

Expected output:

'The proportion of outliers is within the acceptable threshold. Check complete!'

Example where outlier proportion exceeds the threshold:

df = pd.DataFrame({"score": [10, 12, 999, 11, 1000]})
outliers_validate(
    dataframe=df,
    col="score",
    lower_bound=0,
    upper_bound=100,
    threshold=0.10
  )

Expected output:

'The proportion of outliers exceeds the threshold 0.1. Check complete!'

Categorical Column Function (categorical_validate)

import pandas as pd
from dsci_524_group_30_data_validation.str_validate import categorical_validate
df = pd.DataFrame({
     "city": ["Vancouver", "Toronto", "Calgary", None],
     "gender": ["M", "F", "F", "M"],
     "age": [25, 32, 41, 29]
     })
categorical_validate(
    dataframe=df,
    column="city",
    num_cat=3,
    case="title",
    spaces=False
    )

Expected output:

Expected and actual number of categories are equal
All categories are in title case
'Checks completed!'

Contributors

Daniel Yorke, Cynthia Agata Limantono, Shrijaa Venkatasubramanian Subashini, and Wendy Frankel

Copyright

  • Copyright © 2026 Daniel Yorke, Cynthia Agata Limantono, Shrijaa Venkatasubramanian Subashini, and Wendy Frankel
  • Free software distributed under the MIT License.

About

2025-26 DSCI-524 Group 30 Data Validation: an open source package performs common data validation checks on Pandas dataframes.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages