This project develops an intelligent Retrieval-Augmented Generation (RAG) chatbot designed to answer user queries based on a knowledge base of customer complaint narratives. By combining efficient information retrieval with a language model, the chatbot aims to provide relevant and coherent responses to inquiries in the finance sector.
- Data Preprocessing: Cleans and prepares raw complaint data for analysis.
- Text Chunking & Embedding: Breaks down large narratives into smaller, semantically rich chunks and converts them into numerical embeddings.
- FAISS Indexing: Utilizes FAISS for fast and efficient similarity search of text embeddings.
- RAG Pipeline: Retrieves relevant information from the knowledge base and uses it to augment language model responses.
- Interactive Dashboards: Provides Streamlit-based interactive visualizations for demonstrating the chatbot's capabilities and exploring complaint data.
- CI/CD: Automated testing and deployment workflows using GitHub Actions.
The repository is organized as follows:
.github/
workflows/ # GitHub Actions CI/CD workflows
dashboards/ # Streamlit applications for interactive dashboards
docs/ # Project documentation and images
models/ # (Optional) Pre-trained models or checkpoints
notebooks/ # Jupyter notebooks for EDA, prototyping, and evaluation
src/ # Core Python source code
data_preprocessing.py # Scripts for cleaning and preparing data
rag_pipeline.py # Implementation of the RAG pipeline
text_chunking_embedding.py # Handles text chunking and embedding generation
utils.py # Utility functions
tests/ # Unit and integration tests
vectorstore/ # FAISS index and metadata storage
.gitignore # Specifies intentionally untracked files to ignore
README.md # Project README file
design-doc.md # Detailed design document
requirements.txt # Python dependencies
To set up the project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/your-username/CrediTrust-Complaint-RAG-Chatbot.git cd CrediTrust-Complaint-RAG-Chatbot -
Create a virtual environment (recommended):
python -m venv env # On Windows .\env\Scripts\activate # On macOS/Linux source env/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Download necessary data: Ensure you have your
filtered_complaints.csvfile in thedata/directory. This file is crucial for thetext_chunking_embedding.pyscript. -
Test the setup:
python test_setup.py
First, you need to process the complaint data, generate embeddings, and create the FAISS index. This can be done by running the text_chunking_embedding.py script:
python src/text_chunking_embedding.pyThis script will:
- Load data from
data/filtered_complaints.csv. - Chunk the narratives and generate embeddings.
- Save the chunks, embeddings, and metadata to the
vectorstore/directory.
Use the startup script to choose your interface:
python run_app.pyStreamlit (Recommended - Better UI):
streamlit run dashboards/streamlit_app.pyGradio (Alternative):
python dashboards/app.pyBoth applications will open in your web browser, allowing you to interact with the chatbot and visualize the data.
To run the project's tests, use pytest (ensure it's installed via requirements.txt):
pytest tests/The project uses GitHub Actions for continuous integration and continuous deployment. The workflow defined in .github/workflows/ci.yml automatically builds and tests the application on every push to main and on pull requests targeting main.
-
Import Errors:
- Ensure all dependencies are installed:
pip install -r requirements.txt - Check that you're using Python 3.8 or higher
- Verify your virtual environment is activated
- Ensure all dependencies are installed:
-
Model File Not Found:
- The app will automatically use fallback models if the main GGUF model is missing
- For best performance, download the Mistral model to the
models/directory
-
Data Loading Issues:
- Ensure
filtered_complaints.csvexists in thedata/directory - Check that the CSV has the required columns: 'Consumer complaint narrative', 'Product', 'Complaint ID'
- Ensure
-
Performance Issues:
- The app uses CPU by default for compatibility
- For better performance, ensure you have sufficient RAM (8GB+ recommended)
-
Running the Setup Test:
python test_setup.py
This will identify any configuration issues.
Contributions are welcome! Please refer to the design-doc.md for a deeper understanding of the project's architecture and components.
This project is licensed under the MIT License. See the LICENSE file for details.
