| Package | |
| Meta |
This package is an open source project which performs common data validation checks on a Pandas dataframe. This package aims to provide clear, informative, and concise output from running each function, designed to help to user learn more about their data. Functions are flexible and a variety of arguments are included within each function, to ensure full adaptibilty for data validation with any Pandas dataframe. This is useful in any data science pipeline, providing reproducibility, full functionality, and effectiveness.
| Function | Inputs | Outputs | Description |
|---|---|---|---|
col_types_validate |
dataframe: pd.DataFramenumeric_cols: int(default: 0)integer_cols: int(default: 0)float_cols: int(default: 0)boolean_cols: int(default: 0)categorical_cols: int(default: 0)text_cols: int(default: 0)datetime_cols: int(default: 0)allow_extra_cols: bool(default: False)column_schema: dict[str, str or type](optional) |
str |
Validates that a DataFrame contains the expected number of each given logical column category. |
missing_values_validate |
df: pd.DataFramecol: strthreshold: float or int |
bool |
Checks whether the given column in the pandas dataframe has missing values over the given threshold (0-1) or not. |
outliers_validate |
dataframe: pd.DataFramecol: strlower_bound: floatupper_bound: floatthreshold: float |
str |
Validates that a DataFrame column contains an acceptable proportion of values outside a user-defined range. |
categorical_validate |
dataframe: pd.DataFramecolumn: strnum_cat: intcase: str(optional, default: None)spaces: bool = False(default: False) |
str |
Validate categorical column properties in a pandas DataFrame. |
The project looks to re-imagine some of the functions of Pandera in a more user-friendly way. It aims mainly to improve output of the Panderas function, making it more informative and interpretable.
- To get started, clone the repository to your local device.
git clone https://github.com/UBC-MDS/DSCI_524_Group_30_Data_Validation.git- Change directory into the repository
cd DSCI_524_Group_30_Data_Validation- Create the Conda environment from the lock file
conda-lock install -n project-env conda-lock.yml- Activate the environment
conda activate project-envYou should now see (project-env) in your terminal prompt.
- Make sure you have quarto installed for viewing the documentation site. You can install from here:
https://quarto.org/docs/get-started/
Either install from Test-Pypi using the following:
pip install -i https://test.pypi.org/simple/ dsci-524-group-30-data-validationOr you can install this package from the local source into your preferred Python environment using pip:
pip install -e .You can run tests which validate all functions in the package using pytest.
pytest -v-v results in a more verbose output, showing the names of all tests and if they pass or not.
Quartodoc is installed in the environment.yml file
quartodoc build --verboseIf you have Quarto installed locally, you can generate the API reference pages and preview the documentation website:
quarto previewDocumentation building / deployment is automated through GitHub Actions.
Count-based validation only:
import pandas as pd
from dsci_524_group_30_data_validation.col_types_validate import col_types_validate
df = pd.DataFrame({
"city": ["Vancouver", "Toronto", "Calgary", "Winnipeg"],
"name": ["John Smith", "Bron Crift", "Pylon Gift", "Akon Sarmist"],
"gender": ["M", "F", "F", "M"],
"age": [25, 32, 41, 29]
})
col_types_validate(
dataframe=df,
integer_cols=1,
text_cols=3
)Expected output:
'All column categories present in the expected numbers.Check complete!'Column-specific validation using logical type strings:
col_types_validate(
dataframe=df,
column_schema={
"age": "integer",
"city": "text",
"name": "text"
}
)Expected output:
'All specified columns match their expected types. Check complete!'Combined count-based and column-specific validation:
col_types_validate(
dataframe=df,
integer_cols=1,
text_cols=3,
column_schema={
"age": "integer"
}
)Expected output:
'All column categories and specified columns are valid. Check complete!'Passed the threshold requirement example:
import pandas as pd
from dsci_524_group_30_data_validation.missing_values_validate import missing_values_validate
data = pd.DataFrame({
"name": ["Alex", None, None, "Austin", None],
"age": [21, 43, 23, None, 38],
"sex": ["M", "F", "F", "M", "F"],
"married": [True, False, None, None, True]})
missing_values_validate(df=data, col="age", threshold=0.25)Expected output:
The amount of missing values are valid. Checks completed!
TrueNot passing the threshold requirement example:
missing_values_validate(df=data, col="name", threshold=0.05)Expected output:
Invalid check: the amount of missing values is 0.6, exceeding the threshold: 0.05. Checks completed!
FalseExample where outlier proportion is within the threshold:
import pandas as pd
from dsci_524_group_30_data_validation.outlier_validation import outliers_validate
df = pd.DataFrame({"age": [25, 32, 41, 29, 200]})
outliers_validate(
dataframe=df,
col="age",
lower_bound=0,
upper_bound=100,
threshold=0.20
)Expected output:
'The proportion of outliers is within the acceptable threshold. Check complete!'Example where outlier proportion exceeds the threshold:
df = pd.DataFrame({"score": [10, 12, 999, 11, 1000]})
outliers_validate(
dataframe=df,
col="score",
lower_bound=0,
upper_bound=100,
threshold=0.10
)Expected output:
'The proportion of outliers exceeds the threshold 0.1. Check complete!'import pandas as pd
from dsci_524_group_30_data_validation.str_validate import categorical_validate
df = pd.DataFrame({
"city": ["Vancouver", "Toronto", "Calgary", None],
"gender": ["M", "F", "F", "M"],
"age": [25, 32, 41, 29]
})
categorical_validate(
dataframe=df,
column="city",
num_cat=3,
case="title",
spaces=False
)Expected output:
Expected and actual number of categories are equal
All categories are in title case
'Checks completed!'Daniel Yorke, Cynthia Agata Limantono, Shrijaa Venkatasubramanian Subashini, and Wendy Frankel
- Copyright © 2026 Daniel Yorke, Cynthia Agata Limantono, Shrijaa Venkatasubramanian Subashini, and Wendy Frankel
- Free software distributed under the MIT License.