Skip to content

YiboLi1986/RAGFUNCMENTORCHATBOT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAGFunctionMentorChatbot

Function-Level Knowledge RAG System for Legacy Codebases (Local Execution)


📌 Background & Motivation

In many enterprise systems—especially in test automation, system operations, and industrial or infrastructure-related domains —a large amount of critical logic still exists in the form of legacy scripts or DSL-style code (e.g., .inc files).

These codebases typically have the following characteristics:

  • Written in non-mainstream or legacy scripting languages
  • Inconsistent naming conventions and insufficient documentation
  • Spread across many projects and directories
  • Difficult for new engineers or cross-team developers to understand

Traditional code search tools (file name or keyword-based) are insufficient to answer questions such as:

  • “What does this function actually do?”
  • “When should this function be used?”
  • “Which part of the system handles this behavior?”

This project was created to address exactly this problem.


🎯 Project Goal

The goal of this project is to build a function-level knowledge question-answering system for legacy codebases , with a focus on:

  1. Automatically understanding legacy code
    • Extracting complete function or logic blocks from .inc and similar scripts
    • Using local open-source LLMs (e.g., Gemma, LLaMA) to generate semantic explanations and documentation
  2. Transforming code into searchable knowledge assets
    • Structuring “function code + LLM-generated explanation + metadata” into JSON
    • Preparing high-quality input for vector search and RAG pipelines
  3. Building a local RAG-based chatbot
    • Storing embeddings in a local vector database (Chroma)
    • Retrieving the most relevant functions based on user questions
    • Using an LLM to generate clear, contextual answers

🧠 Core Design Philosophy

**The system does not aim to generate new code directly.

Instead, it focuses on understanding, explaining, and making existing code queryable.**

Key principles include:

  • Function-level granularity
  • Explainability over raw generation
  • Low-risk evolution (original code remains untouched)
  • Fully local execution (no cloud dependency)

🧩 End-to-End Workflow

Legacy .inc / Script Code ↓ Function / Code Block Extraction ↓ LLM-Based Semantic Explanation Generation ↓ Annotated Files + Structured JSON Output ↓ Vector Storage (Chroma) ↓ RAG Retrieval + Local LLM Answer Generation ↓ Interactive Chatbot Q&A

🔧 Key Components

1️⃣ Function-Level Code Extraction

  • Scans .inc files and similar scripts
  • Identifies function or logic block boundaries using rules and regex
  • Excludes control-flow constructs (if, switch, etc.) to avoid incorrect segmentation
  • Ensures each extracted block is a self-contained, interpretable unit

📂 Related directory:

src/backend/function_extractor/

2️⃣ LLM-Based Annotation & Explanation

  • Supports local open-source models:
    • Gemma
    • LLaMA 3 (8B Instruct)
  • Uses carefully designed prompt templates to generate:
    • Semantic descriptions
    • Input / output explanations
    • Typical usage scenarios
  • Generated explanations are:
    • Inserted into annotated versions of the source files
    • Exported as structured JSON for downstream processing

📂 Related directories:

src/backend/function_extractor/models/ src/backend/function_extractor/prompt/template/

3️⃣ Structured Function Knowledge Assets

For each extracted function, the system records:

  • Project name
  • File path
  • Function name and parameters
  • Original code block
  • LLM-generated semantic explanation

These JSON artifacts form the core knowledge base for the RAG system.


4️⃣ RAG-Based Retrieval & Question Answering

  • Uses Chroma as a local vector database
  • Embeds function explanations and code semantics
  • User query → vector similarity search → relevant functions retrieved
  • Retrieved context is passed to an LLM to generate a clear and concise answer

📂 Related directory:

src/backend/rag/

💬 Example Use Cases

  • “Which function handles Windows time synchronization?”
  • “Where is the logic related to pump startup implemented?”
  • “What does this function in an .inc file actually do?”
  • “Is there an existing function that implements a similar test flow?”

📁 Project Structure Overview

RAGFunctionMentorChatbot/ ├── src/ │ ├── backend/ │ │ ├── function_extractor/# Function extraction & annotation │ │ ├── rag/# RAG retrieval & QA │ │ └── data_io/# File read/write utilities │ ├── frontend/# Chatbot / UI (extensible) │ └── misc/ ├── data/# Raw code and intermediate artifacts ├── resources/ ├── tests/ ├── requirements.txt ├── run.sh └── README.md

🚀 Running the Project Locally

1️⃣ Install Dependencies

pip install -r requirements.txt

Or manually:

pip install langchain chromadb fastembed streamlit streamlit-chat

2️⃣ Set Up Local LLM (Ollama)

Refer to:

Download and run Gemma or LLaMA models locally.


3️⃣ Start the RAG Chatbot

streamlit run frontend/streamlit_app.py

Then open:

http://localhost:8501

🔒 Design Advantages

  • ✅ Original code remains unchanged (annotated versions are separate)
  • ✅ Fully local execution; no data leaves the machine
  • ✅ Designed for real-world legacy systems, not idealized greenfield projects
  • ✅ Lays the foundation for refactoring, governance, and onboarding

🧭 Project Status

  • ✅ Function-level code extraction implemented
  • ✅ LLM-based annotation and explanation validated
  • ✅ Structured JSON knowledge assets generated
  • 🧪 RAG retrieval and chatbot prototype in progress

One-Sentence Summary

A local RAG system that transforms legacy script-based code into a searchable, function-level knowledge base, enabling engineers to understand and query complex systems through natural language.

About

A local RAG system that transforms legacy script-based code into a searchable, function-level knowledge base, enabling engineers to understand and query complex systems through natural language.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors