Welcome to DSCI_524_Group_30_Data_Validation


Package
Meta

Continuous Integration / GitHub Badges

Workflows

Summary

This package is an open source project which performs common data validation checks on a Pandas dataframe. This package aims to provide clear, informative, and concise output from running each function, designed to help to user learn more about their data. Functions are flexible and a variety of arguments are included within each function, to ensure full adaptibilty for data validation with any Pandas dataframe. This is useful in any data science pipeline, providing reproducibility, full functionality, and effectiveness.

Functions Included

Function	Inputs	Outputs	Description
`col_types_validate`	`dataframe: pd.DataFrame` `numeric_cols: int`(default: `0`) `integer_cols: int`(default: `0`) `float_cols: int`(default: `0`) `boolean_cols: int`(default: `0`) `categorical_cols: int`(default: `0`) `text_cols: int`(default: `0`) `datetime_cols: int`(default: `0`) `allow_extra_cols: bool`(default: `False`) `column_schema: dict[str, str or type]`(optional)	`str`	Validates that a DataFrame contains the expected number of each given logical column category.
`missing_values_validate`	`df: pd.DataFrame` `col: str` `threshold: float or int`	`bool`	Checks whether the given column in the pandas dataframe has missing values over the given threshold (0-1) or not.
`outliers_validate`	`dataframe: pd.DataFrame` `col: str` `lower_bound: float` `upper_bound: float` `threshold: float`	`str`	Validates that a DataFrame column contains an acceptable proportion of values outside a user-defined range.
`categorical_validate`	`dataframe: pd.DataFrame` `column: str` `num_cat: int` `case: str`(optional, default: `None`) `spaces: bool = False`(default: `False`)	`str`	Validate categorical column properties in a pandas DataFrame.

The project looks to re-imagine some of the functions of Pandera in a more user-friendly way. It aims mainly to improve output of the Panderas function, making it more informative and interpretable.

Setting up the Development Environment

To get started, clone the repository to your local device.

git clone https://github.com/UBC-MDS/DSCI_524_Group_30_Data_Validation.git

Change directory into the repository

cd DSCI_524_Group_30_Data_Validation

Create the Conda environment from the lock file

conda-lock install -n project-env conda-lock.yml

Activate the environment

conda activate project-env

You should now see (project-env) in your terminal prompt.

Make sure you have quarto installed for viewing the documentation site. You can install from here:

https://quarto.org/docs/get-started/

Installing the Package

Either install from Test-Pypi using the following:

pip install -i https://test.pypi.org/simple/ dsci-524-group-30-data-validation

Or you can install this package from the local source into your preferred Python environment using pip:

pip install -e .

Running Tests

You can run tests which validate all functions in the package using pytest.

pytest -v

-v results in a more verbose output, showing the names of all tests and if they pass or not.

Build Documentation

Build Quartodoc Site

Quartodoc is installed in the environment.yml file

quartodoc build --verbose

Live preview locally (requires Quarto installed)

If you have Quarto installed locally, you can generate the API reference pages and preview the documentation website:

quarto preview

Documentation building / deployment is automated through GitHub Actions.

Example Use

Column Validation Function (`col_types_validate`)

Count-based validation only:

import pandas as pd
from dsci_524_group_30_data_validation.col_types_validate import col_types_validate
df = pd.DataFrame({
     "city": ["Vancouver", "Toronto", "Calgary", "Winnipeg"],
     "name": ["John Smith", "Bron Crift", "Pylon Gift", "Akon Sarmist"],
     "gender": ["M", "F", "F", "M"],
     "age": [25, 32, 41, 29]
     })
col_types_validate(
    dataframe=df,
    integer_cols=1,
    text_cols=3
    )

Expected output:

'All column categories present in the expected numbers.Check complete!'

Column-specific validation using logical type strings:

col_types_validate(
    dataframe=df,
    column_schema={
      "age": "integer",
      "city": "text",
      "name": "text"
    }
  )

Expected output:

'All specified columns match their expected types. Check complete!'

Combined count-based and column-specific validation:

col_types_validate(
    dataframe=df,
    integer_cols=1,
    text_cols=3,
    column_schema={
      "age": "integer"
    }
  )

Expected output:

'All column categories and specified columns are valid. Check complete!'

Missing Values Threshold Function (`missing_values_validate`)

Passed the threshold requirement example:

import pandas as pd
from dsci_524_group_30_data_validation.missing_values_validate import missing_values_validate
data = pd.DataFrame({
        "name": ["Alex", None, None, "Austin", None],
        "age": [21, 43, 23, None, 38],
        "sex": ["M", "F", "F", "M", "F"],
        "married": [True, False, None, None, True]})
missing_values_validate(df=data, col="age", threshold=0.25)

Expected output:

The amount of missing values are valid. Checks completed!
True

Not passing the threshold requirement example:

missing_values_validate(df=data, col="name", threshold=0.05)

Expected output:

Invalid check: the amount of missing values is 0.6, exceeding the threshold: 0.05. Checks completed!
False

Values Outlier Function (`outliers_validate`)

Example where outlier proportion is within the threshold:

import pandas as pd
from dsci_524_group_30_data_validation.outlier_validation import outliers_validate
df = pd.DataFrame({"age": [25, 32, 41, 29, 200]})
outliers_validate(
    dataframe=df,
    col="age",
    lower_bound=0,
    upper_bound=100,
    threshold=0.20
    )

Expected output:

'The proportion of outliers is within the acceptable threshold. Check complete!'

Example where outlier proportion exceeds the threshold:

df = pd.DataFrame({"score": [10, 12, 999, 11, 1000]})
outliers_validate(
    dataframe=df,
    col="score",
    lower_bound=0,
    upper_bound=100,
    threshold=0.10
  )

Expected output:

'The proportion of outliers exceeds the threshold 0.1. Check complete!'

Categorical Column Function (`categorical_validate`)

import pandas as pd
from dsci_524_group_30_data_validation.str_validate import categorical_validate
df = pd.DataFrame({
     "city": ["Vancouver", "Toronto", "Calgary", None],
     "gender": ["M", "F", "F", "M"],
     "age": [25, 32, 41, 29]
     })
categorical_validate(
    dataframe=df,
    column="city",
    num_cat=3,
    case="title",
    spaces=False
    )

Expected output:

Expected and actual number of categories are equal
All categories are in title case
'Checks completed!'

Contributors

Daniel Yorke, Cynthia Agata Limantono, Shrijaa Venkatasubramanian Subashini, and Wendy Frankel

Copyright

Free software distributed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github		.github
docs		docs
reference		reference
src/dsci_524_group_30_data_validation		src/dsci_524_group_30_data_validation
tests/unit		tests/unit
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
README.md		README.md
_quarto.yml		_quarto.yml
conda-lock.yml		conda-lock.yml
environment.yml		environment.yml
index.qmd		index.qmd
objects.json		objects.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to DSCI_524_Group_30_Data_Validation

Continuous Integration / GitHub Badges

Workflows

Summary

Functions Included

Setting up the Development Environment

Installing the Package

Running Tests

Build Documentation

Build Quartodoc Site

Live preview locally (requires Quarto installed)

Example Use

Column Validation Function (`col_types_validate`)

Missing Values Threshold Function (`missing_values_validate`)

Values Outlier Function (`outliers_validate`)

Categorical Column Function (`categorical_validate`)

Contributors

Copyright

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Welcome to DSCI_524_Group_30_Data_Validation

Continuous Integration / GitHub Badges

Workflows

Summary

Functions Included

Setting up the Development Environment

Installing the Package

Running Tests

Build Documentation

Build Quartodoc Site

Live preview locally (requires Quarto installed)

Example Use

Column Validation Function (col_types_validate)

Missing Values Threshold Function (missing_values_validate)

Values Outlier Function (outliers_validate)

Categorical Column Function (categorical_validate)

Contributors

Copyright

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Column Validation Function (`col_types_validate`)

Missing Values Threshold Function (`missing_values_validate`)

Values Outlier Function (`outliers_validate`)

Categorical Column Function (`categorical_validate`)

Packages