ETL Pipeline for Environmental Crime Analysis in Colombia

This project implements an ETL (Extract, Transform, Load) pipeline to process environmental crime data in Colombia. It is designed to extract data from a public JSON source, transform it for analysis, and load it into both a PostgreSQL database and Amazon S3 for cloud storage.

Data Source

The data used in this ETL pipeline comes from the Environmental Crimes Dataset, which provides information on various environmental crimes reported across Colombia.

Steps and technologies

1. Data Extraction

API Integration: Uses the requests library to retrieve data from the open API, ensuring data freshness with each pipeline run.
Error Handling: Implements robust exception handling to manage connectivity issues and data inconsistencies during extraction.

2. Data Transformation

DataFrame Manipulation: Applies pandas to clean and reshape the data for analysis. Column names are standardized, and data types are adjusted to enhance usability.
Date Parsing: Converts date fields into a consistent format, facilitating accurate analysis and storage.

3. Data Loading

PostgreSQL Loading: Configures and loads transformed data into a PostgreSQL database using psycopg2. If the table does not exist, it is created, ensuring seamless integration.
** AWS S3 Storage:** Uploads processed data to an Amazon S3 bucket, making it accessible for downstream applications and analytics.

4. Logging and Monitoring

Centralized Logging: Sets up a logging system to track the progress of each ETL step. Logs are saved both to the console and to a file, enabling troubleshooting and monitoring.

5. Unit Testing

Automated Testing for ETL Stages: Uses unittest and unittest.mock to validate each ETL phase, ensuring data integrity and reliability. The tests cover extraction, transformation, and loading functions, simulating various scenarios.

6. AWS Lambda Integration

The aws-etl directory contains code to deploy the ETL pipeline on AWS Lambda, enabling automatic data updates in the cloud.

Setup and Execution

1. Clone the repository:

git clone git@github.com:LiliValGo/ETL-Pipeline-Environmental-Crimes.git
cd environmental-crimes-col

2. Install dependencies:

pip install -r requirements.txt

3. Configure Database and S3 Access: Edit the config.ini file with your PostgreSQL and AWS credentials.

4. Run the ETL Pipeline:

python etl-logging/logs.py

5. Run Unit Tests:

python -m unittest discover -s tests

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
environmental-crimes-col		environmental-crimes-col
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline for Environmental Crime Analysis in Colombia

Data Source

Steps and technologies

1. Data Extraction

2. Data Transformation

3. Data Loading

4. Logging and Monitoring

5. Unit Testing

6. AWS Lambda Integration

Setup and Execution

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline for Environmental Crime Analysis in Colombia

Data Source

Steps and technologies

1. Data Extraction

2. Data Transformation

3. Data Loading

4. Logging and Monitoring

5. Unit Testing

6. AWS Lambda Integration

Setup and Execution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages