This project implements an ETL (Extract, Transform, Load) pipeline to process environmental crime data in Colombia. It is designed to extract data from a public JSON source, transform it for analysis, and load it into both a PostgreSQL database and Amazon S3 for cloud storage.
The data used in this ETL pipeline comes from the Environmental Crimes Dataset, which provides information on various environmental crimes reported across Colombia.
- API Integration: Uses the
requestslibrary to retrieve data from the open API, ensuring data freshness with each pipeline run. - Error Handling: Implements robust exception handling to manage connectivity issues and data inconsistencies during extraction.
- DataFrame Manipulation: Applies
pandasto clean and reshape the data for analysis. Column names are standardized, and data types are adjusted to enhance usability. - Date Parsing: Converts date fields into a consistent format, facilitating accurate analysis and storage.
- PostgreSQL Loading: Configures and loads transformed data into a PostgreSQL database using
psycopg2. If the table does not exist, it is created, ensuring seamless integration. - ** AWS S3 Storage:** Uploads processed data to an Amazon S3 bucket, making it accessible for downstream applications and analytics.
- Centralized Logging: Sets up a logging system to track the progress of each ETL step. Logs are saved both to the console and to a file, enabling troubleshooting and monitoring.
- Automated Testing for ETL Stages: Uses
unittestandunittest.mockto validate each ETL phase, ensuring data integrity and reliability. The tests cover extraction, transformation, and loading functions, simulating various scenarios.
The aws-etl directory contains code to deploy the ETL pipeline on AWS Lambda, enabling automatic data updates in the cloud.
1. Clone the repository:
git clone git@github.com:LiliValGo/ETL-Pipeline-Environmental-Crimes.git
cd environmental-crimes-col
2. Install dependencies:
pip install -r requirements.txt
3. Configure Database and S3 Access: Edit the config.ini file with your PostgreSQL and AWS credentials.
4. Run the ETL Pipeline:
python etl-logging/logs.py
5. Run Unit Tests:
python -m unittest discover -s tests