A production-ready zero-shot legal document classification system powered by Mistral-7B and FAISS vector similarity validation. This hybrid approach combines the reasoning capabilities of Large Language Models with the precision of embedding-based validation to achieve high-accuracy document classification.
- Zero-Shot Classification: Leverages Mistral-7B for flexible category inference without training data
- Hybrid Validation: FAISS vector store validation ensures classification accuracy
- Production-Ready Architecture:
- FastAPI async endpoints with comprehensive middleware
- JWT authentication and rate limiting
- Performance monitoring and logging
- Current Performance (as of Feb 11, 2025):
- Response time: ~33.18s per request
- Classification accuracy: 100% on latest tests
- GPU utilization: Not optimal
- Throughput: ~1.8 requests per minute
src/
βββ app/
β βββ auth/ # JWT authentication and token handling
β βββ models/ # Core classification models
β βββ middleware/ # Auth and rate limiting
β βββ routers/ # API endpoints and routing
tests/ # Test suite
- Hardware Requirements:
- NVIDIA GPU with 4GB+ VRAM
- 4+ CPU cores
- 16GB+ system RAM
- Expected Performance:
- Response Time: ~33s average
- Throughput: 1-2 RPM
- Classification Accuracy: 100%
-
Minimum Configuration (g5.xlarge):
- NVIDIA A10G GPU (24GB VRAM)
- Response Time: 3-4s
- Throughput: 30-40 RPM per instance
- Classification Accuracy: 85-90%
-
Target Configuration (g5.2xlarge or higher):
- Response Time: ~2s
- Throughput: 150+ RPM (with load balancing)
- Classification Accuracy: 90-95%
- High Availability: 99.9%
-
Classification Engine
- Mistral-7B integration via Ollama
- GPU-accelerated inference
- FAISS similarity validation (0.85 threshold)
- Response caching (1-hour TTL)
-
API Layer
- Async endpoint structure
- JWT authentication
- Rate limiting (1000 req/min)
- Detailed error handling
Current implementation status:
β Core Features
- Classification engine with Mistral-7B
- FAISS validation layer
- Performance monitoring and logging
β API & Security
- JWT authentication
- Rate limiting middleware
- FastAPI async endpoints
β Testing & Quality
- Basic test coverage
- Error handling
- Input validation
π§ Optimization Goals
- Response time improvement (Current: ~33s β Target: <2s)
- GPU utilization optimization
- Throughput enhancement (Current: ~1.8 RPM β Target: 150 RPM)
- Production deployment setup
Optimization Strategy:
-
Performance Enhancement
- Response caching implementation
- Batch processing optimization
- GPU utilization improvements
-
Production Deployment
- AWS g5.xlarge/g5.2xlarge setup
- Load balancing configuration
- Auto-scaling implementation
-
Documentation & Monitoring
- Detailed benchmark reports
- Performance monitoring dashboards
- Production deployment guides
See BENCHMARKS.md for detailed performance analysis and optimization plans.
- NVIDIA GPU with 4GB+ VRAM
- 4+ CPU cores
- 16GB+ system RAM
- Python 3.10+
- Conda (recommended for environment management)
- Clone the repository
git clone https://github.com/yourusername/hybrid-llm-classifier.git
cd hybrid-llm-classifier- Set up the environment
# Create and activate environment
make setup
# Install development dependencies
make install-dev- Install and start Ollama
- Follow instructions at Ollama.ai
- Pull Mistral model:
ollama pull mistral - Verify GPU support:
nvidia-smi
We use make to standardize development commands. Here are the available targets:
# Run basic tests
make test
# Run tests with coverage report
make test-coverage
# Run tests in watch mode (auto-rerun on changes)
make test-watch
# Run tests with verbose output
make test-verbose# Run full benchmark suite
make benchmark
# Run continuous benchmark monitoring
make benchmark-watch
# Run memory and line profiling
make benchmark-profile# Format code (black + isort)
make format
# Run all linters
make lint# Start development server with hot reload
make run# Remove all build artifacts and cache files
make cleanFor a complete list of available commands:
make helpCurrent test suite includes:
- Unit tests for core classification
- Integration tests for API endpoints
- Authentication and rate limiting tests
- Performance metrics validation
- Error handling scenarios
- Benchmark tests
Test coverage metrics:
- Line coverage: 90%+
- Branch coverage: 85%+
- All critical paths covered
All tests are async-compatible and use pytest-asyncio for proper async testing.
Development Environment:
- Keep documents under 2,048 tokens
- Expect ~10s response time
- 5-10 requests per minute
- Memory usage: ~3.5GB VRAM
Production Environment:
- AWS g5.xlarge or higher recommended
- Load balancing for high throughput
- Auto-scaling configuration
- Regional deployment for latency optimization
See BENCHMARKS.md for detailed performance analysis and optimization experiments.
Development Environment (Current):
- Average response time: ~33.18s
- Classification accuracy: 100%
- GPU utilization: Not optimal
- Throughput: ~1.8 requests/minute
Production Targets (AWS g5.2xlarge):
- Response time: <2s
- Throughput: 150+ RPM
- Accuracy: 90-95%
- High availability: 99.9%
Optimization Roadmap:
-
Response Caching
- In-memory caching for repeated queries
- Configurable TTL
- Cache hit monitoring
-
Performance Optimization
- Response streaming
- Batch processing
- Memory usage optimization
-
Infrastructure
- Docker containerization
- AWS deployment
- Load balancing setup
- Monitoring integration
-
Core Functionality (Day 1)
- Optimize classification engine β
- Implement caching layer
- Document performance baselines
-
API & Performance (Day 2)
- Security hardening
- Response optimization
- Load testing
-
Production Ready (Day 3)
- AWS deployment
- Documentation
- Final testing
This project is licensed under the MIT License - see the LICENSE file for details.
While this project is primarily for demonstration purposes, we welcome feedback and suggestions. Please open an issue to discuss potential improvements.
Note: This project is under active development. Core functionality is implemented and tested, with performance optimizations in progress.
