In many enterprise systems—especially in test automation, system operations, and industrial or infrastructure-related domains —a large amount of critical logic still exists in the form of legacy scripts or DSL-style code (e.g., .inc files).
These codebases typically have the following characteristics:
- Written in non-mainstream or legacy scripting languages
- Inconsistent naming conventions and insufficient documentation
- Spread across many projects and directories
- Difficult for new engineers or cross-team developers to understand
Traditional code search tools (file name or keyword-based) are insufficient to answer questions such as:
- “What does this function actually do?”
- “When should this function be used?”
- “Which part of the system handles this behavior?”
This project was created to address exactly this problem.
The goal of this project is to build a function-level knowledge question-answering system for legacy codebases , with a focus on:
- Automatically understanding legacy code
- Extracting complete function or logic blocks from
.incand similar scripts - Using local open-source LLMs (e.g., Gemma, LLaMA) to generate semantic explanations and documentation
- Extracting complete function or logic blocks from
- Transforming code into searchable knowledge assets
- Structuring “function code + LLM-generated explanation + metadata” into JSON
- Preparing high-quality input for vector search and RAG pipelines
- Building a local RAG-based chatbot
- Storing embeddings in a local vector database (Chroma)
- Retrieving the most relevant functions based on user questions
- Using an LLM to generate clear, contextual answers
**The system does not aim to generate new code directly.
Instead, it focuses on understanding, explaining, and making existing code queryable.**
Key principles include:
- Function-level granularity
- Explainability over raw generation
- Low-risk evolution (original code remains untouched)
- Fully local execution (no cloud dependency)
Legacy .inc / Script Code ↓ Function / Code Block Extraction ↓ LLM-Based Semantic Explanation Generation ↓ Annotated Files + Structured JSON Output ↓ Vector Storage (Chroma) ↓ RAG Retrieval + Local LLM Answer Generation ↓ Interactive Chatbot Q&A
- Scans
.incfiles and similar scripts - Identifies function or logic block boundaries using rules and regex
- Excludes control-flow constructs (
if,switch, etc.) to avoid incorrect segmentation - Ensures each extracted block is a self-contained, interpretable unit
📂 Related directory:
src/backend/function_extractor/
- Supports local open-source models:
- Gemma
- LLaMA 3 (8B Instruct)
- Uses carefully designed prompt templates to generate:
- Semantic descriptions
- Input / output explanations
- Typical usage scenarios
- Generated explanations are:
- Inserted into annotated versions of the source files
- Exported as structured JSON for downstream processing
📂 Related directories:
src/backend/function_extractor/models/ src/backend/function_extractor/prompt/template/
For each extracted function, the system records:
- Project name
- File path
- Function name and parameters
- Original code block
- LLM-generated semantic explanation
These JSON artifacts form the core knowledge base for the RAG system.
- Uses Chroma as a local vector database
- Embeds function explanations and code semantics
- User query → vector similarity search → relevant functions retrieved
- Retrieved context is passed to an LLM to generate a clear and concise answer
📂 Related directory:
src/backend/rag/
- “Which function handles Windows time synchronization?”
- “Where is the logic related to pump startup implemented?”
- “What does this function in an
.incfile actually do?” - “Is there an existing function that implements a similar test flow?”
RAGFunctionMentorChatbot/ ├── src/ │ ├── backend/ │ │ ├── function_extractor/# Function extraction & annotation │ │ ├── rag/# RAG retrieval & QA │ │ └── data_io/# File read/write utilities │ ├── frontend/# Chatbot / UI (extensible) │ └── misc/ ├── data/# Raw code and intermediate artifacts ├── resources/ ├── tests/ ├── requirements.txt ├── run.sh └── README.md
pip install -r requirements.txt
Or manually:
pip install langchain chromadb fastembed streamlit streamlit-chat
Refer to:
Download and run Gemma or LLaMA models locally.
streamlit run frontend/streamlit_app.py
Then open:
http://localhost:8501
- ✅ Original code remains unchanged (annotated versions are separate)
- ✅ Fully local execution; no data leaves the machine
- ✅ Designed for real-world legacy systems, not idealized greenfield projects
- ✅ Lays the foundation for refactoring, governance, and onboarding
- ✅ Function-level code extraction implemented
- ✅ LLM-based annotation and explanation validated
- ✅ Structured JSON knowledge assets generated
- 🧪 RAG retrieval and chatbot prototype in progress
A local RAG system that transforms legacy script-based code into a searchable, function-level knowledge base, enabling engineers to understand and query complex systems through natural language.