RAGFunctionMentorChatbot

Function-Level Knowledge RAG System for Legacy Codebases (Local Execution)

📌 Background & Motivation

In many enterprise systems—especially in test automation, system operations, and industrial or infrastructure-related domains —a large amount of critical logic still exists in the form of legacy scripts or DSL-style code (e.g., .inc files).

These codebases typically have the following characteristics:

Written in non-mainstream or legacy scripting languages
Inconsistent naming conventions and insufficient documentation
Spread across many projects and directories
Difficult for new engineers or cross-team developers to understand

Traditional code search tools (file name or keyword-based) are insufficient to answer questions such as:

“What does this function actually do?”
“When should this function be used?”
“Which part of the system handles this behavior?”

This project was created to address exactly this problem.

🎯 Project Goal

The goal of this project is to build a function-level knowledge question-answering system for legacy codebases , with a focus on:

Automatically understanding legacy code
- Extracting complete function or logic blocks from .inc and similar scripts
- Using local open-source LLMs (e.g., Gemma, LLaMA) to generate semantic explanations and documentation
Transforming code into searchable knowledge assets
- Structuring “function code + LLM-generated explanation + metadata” into JSON
- Preparing high-quality input for vector search and RAG pipelines
Building a local RAG-based chatbot
- Storing embeddings in a local vector database (Chroma)
- Retrieving the most relevant functions based on user questions
- Using an LLM to generate clear, contextual answers

🧠 Core Design Philosophy

**The system does not aim to generate new code directly.

Instead, it focuses on understanding, explaining, and making existing code queryable.**

Key principles include:

Function-level granularity
Explainability over raw generation
Low-risk evolution (original code remains untouched)
Fully local execution (no cloud dependency)

🧩 End-to-End Workflow


Legacy .inc / Script Code
        ↓
Function / Code Block Extraction
        ↓
LLM-Based Semantic Explanation Generation
        ↓
Annotated Files + Structured JSON Output
        ↓
Vector Storage (Chroma)
        ↓
RAG Retrieval + Local LLM Answer Generation
        ↓
Interactive Chatbot Q&A

🔧 Key Components

1️⃣ Function-Level Code Extraction

Scans .inc files and similar scripts
Identifies function or logic block boundaries using rules and regex
Excludes control-flow constructs (if, switch, etc.) to avoid incorrect segmentation
Ensures each extracted block is a self-contained, interpretable unit

📂 Related directory:


src/backend/function_extractor/

2️⃣ LLM-Based Annotation & Explanation

Supports local open-source models:
- Gemma
- LLaMA 3 (8B Instruct)
Uses carefully designed prompt templates to generate:
- Semantic descriptions
- Input / output explanations
- Typical usage scenarios
Generated explanations are:
- Inserted into annotated versions of the source files
- Exported as structured JSON for downstream processing

📂 Related directories:


src/backend/function_extractor/models/
src/backend/function_extractor/prompt/template/

3️⃣ Structured Function Knowledge Assets

For each extracted function, the system records:

Project name
File path
Function name and parameters
Original code block
LLM-generated semantic explanation

These JSON artifacts form the core knowledge base for the RAG system.

4️⃣ RAG-Based Retrieval & Question Answering

Uses Chroma as a local vector database
Embeds function explanations and code semantics
User query → vector similarity search → relevant functions retrieved
Retrieved context is passed to an LLM to generate a clear and concise answer

📂 Related directory:


src/backend/rag/

💬 Example Use Cases

“Which function handles Windows time synchronization?”
“Where is the logic related to pump startup implemented?”
“What does this function in an .inc file actually do?”
“Is there an existing function that implements a similar test flow?”

📁 Project Structure Overview


RAGFunctionMentorChatbot/
├── src/
│   ├── backend/
│   │   ├── function_extractor/# Function extraction & annotation
│   │   ├── rag/# RAG retrieval & QA
│   │   └── data_io/# File read/write utilities
│   ├── frontend/# Chatbot / UI (extensible)
│   └── misc/
├── data/# Raw code and intermediate artifacts
├── resources/
├── tests/
├── requirements.txt
├── run.sh
└── README.md

🚀 Running the Project Locally

1️⃣ Install Dependencies


pip install -r requirements.txt

Or manually:


pip install langchain chromadb fastembed streamlit streamlit-chat

2️⃣ Set Up Local LLM (Ollama)

Refer to:

https://www.sysgeek.cn/ollama-on-windows/

Download and run Gemma or LLaMA models locally.

3️⃣ Start the RAG Chatbot


streamlit run frontend/streamlit_app.py

Then open:


http://localhost:8501

🔒 Design Advantages

✅ Original code remains unchanged (annotated versions are separate)
✅ Fully local execution; no data leaves the machine
✅ Designed for real-world legacy systems, not idealized greenfield projects
✅ Lays the foundation for refactoring, governance, and onboarding

🧭 Project Status

✅ Function-level code extraction implemented
✅ LLM-based annotation and explanation validated
✅ Structured JSON knowledge assets generated
🧪 RAG retrieval and chatbot prototype in progress

One-Sentence Summary

A local RAG system that transforms legacy script-based code into a searchable, function-level knowledge base, enabling engineers to understand and query complex systems through natural language.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_cn.md		README_cn.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGFunctionMentorChatbot

Function-Level Knowledge RAG System for Legacy Codebases (Local Execution)

📌 Background & Motivation

🎯 Project Goal

🧠 Core Design Philosophy

🧩 End-to-End Workflow

🔧 Key Components

1️⃣ Function-Level Code Extraction

2️⃣ LLM-Based Annotation & Explanation

3️⃣ Structured Function Knowledge Assets

4️⃣ RAG-Based Retrieval & Question Answering

💬 Example Use Cases

📁 Project Structure Overview

🚀 Running the Project Locally

1️⃣ Install Dependencies

2️⃣ Set Up Local LLM (Ollama)

3️⃣ Start the RAG Chatbot

🔒 Design Advantages

🧭 Project Status

One-Sentence Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAGFunctionMentorChatbot

Function-Level Knowledge RAG System for Legacy Codebases (Local Execution)

📌 Background & Motivation

🎯 Project Goal

🧠 Core Design Philosophy

🧩 End-to-End Workflow

🔧 Key Components

1️⃣ Function-Level Code Extraction

2️⃣ LLM-Based Annotation & Explanation

3️⃣ Structured Function Knowledge Assets

4️⃣ RAG-Based Retrieval & Question Answering

💬 Example Use Cases

📁 Project Structure Overview

🚀 Running the Project Locally

1️⃣ Install Dependencies

2️⃣ Set Up Local LLM (Ollama)

3️⃣ Start the RAG Chatbot

🔒 Design Advantages

🧭 Project Status

One-Sentence Summary

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages