El Diario de Avisos is a comprehensive digital archive and semantic search platform designed to preserve and make accessible historical publications, essays, and articles. The project modernizes a legacy database system into a performant, vector-enabled web application.
web/: The modern Next.js web application. Handles the UI, database queries via Drizzle ORM, and AI/Vector search logic.scripts/: Utilities for extracting and transforming data from legacy MDB (Microsoft Access) files into a format suitable for PostgreSQL.data/: Intermediate storage for extracted CSVs and SQL dumps.music/: Assets for the application's audio playback features.docs/: Technical documentation and Architecture Decision Records (ADRs).
Follow the mandatory Verification Policy before reporting any task as complete. Assistants must pass type-check and lint project-wide.
- Historical Search: Full-text and semantic search across 19th-century articles.
- AI Historian: RAG-powered chatbot that answers questions based on archive content.
- Essays & Collections: Curated essays linking to relevant historical articles.
- Audio Experience: Background music playback for immersive reading.
The system utilizes a Next.js App Router architecture integrated with a vector-capable PostgreSQL database for semantic search capabilities.
graph TD
Client[Web Browser]
subgraph Frontend [Next.js Web App]
Pages[App Router Pages]
Components[Shadcn/UI Components]
Providers[Context Providers]
end
subgraph BackendLayer [Server Actions & Libs]
ServerActions[Server Actions]
VectorStore[Vector Store Lib]
DrizzleClient[Drizzle Client]
end
subgraph Persistence [PostgreSQL + pgvector]
RelationalData["Relational Data<br/>(Articles, Authors)"]
VectorData["Vector Embeddings<br/>(768 dims)"]
end
subgraph Pipeline [Data Ingestion Pipeline]
Legacy[Legacy MDB Files]
Shell[Bash Scripts]
Python[Python Processing]
IngestScript[Vector Ingest Script]
end
Client --> Pages
Pages --> ServerActions
ServerActions --> DrizzleClient
ServerActions --> VectorStore
DrizzleClient --> RelationalData
VectorStore --> VectorData
Legacy --> Shell
Shell --> Python
Python --> RelationalData
RelationalData --> IngestScript
IngestScript --> VectorData
The database is fully normalized and managed via Drizzle ORM. Below is a simplified Entity-Relationship Diagram (ERD) of the core modules.
erDiagram
PUBLICATIONS ||--o{ ISSUES : "publishes"
ISSUES ||--o{ ARTICLES : "contains"
AUTHORS ||--o{ ARTICLES : "writes"
ARTICLES ||--o{ ARTICLE_TOPICS : "classified by"
SUBJECTS ||--o{ ARTICLE_TOPICS : "defines"
ARTICLES ||--o{ ARTICLE_IMAGES : "includes"
IMAGES ||--o{ ARTICLE_IMAGES : "displayed in"
ARTICLES ||--o| ARTICLE_EMBEDDINGS : "has vector"
MEMBERS ||--o{ ESSAYS : "authors"
ESSAYS ||--o{ ESSAY_ARTICLES : "links to"
ARTICLES ||--|{ ESSAY_ARTICLES : "linked from"
ARTICLES {
int id PK
varchar title
varchar date "YYYY-MM-DD"
bytea content "RTF/Text"
boolean isEditable
}
AUTHORS {
int id PK
varchar name
varchar reference
}
PUBLICATIONS {
int id PK
varchar name
timestamp foundedDate
}
ISSUES {
int id PK
varchar number
timestamp date
}
SUBJECTS {
int id PK
varchar name
boolean isSubject
}
ARTICLE_EMBEDDINGS {
int articleId FK
vector embedding "768 dim"
}
The application implements PostgreSQL's native full-text search with Spanish language support, indexing full article content for comprehensive search coverage.
- Content Indexing: Searches across full article content, not just titles
- RTF Stripping: Automatically removes RTF formatting for clean text indexing
- Spanish Language Support: Uses PostgreSQL's
spanishtext search configuration for proper stemming and stop words - Accent-Insensitive: Searches for "musica" will find "MÚSICA" automatically
- Relevance Ranking: Results sorted by
ts_rankscore with weighted fields - High Performance: GIN index enables sub-second searches across thousands of articles
- Automatic Updates: Trigger-based search vector maintenance requires no manual intervention
The search functionality indexes article titles and full content in a tsvector column with weighted priorities:
- Title (weight 'A'): Highest priority - exact title matches rank first
- Content (weight 'C'): Lower priority - content matches rank after title matches
RTF Processing: Article content is stored as RTF or plain text in bytea format. Before indexing, a PostgreSQL function (strip_rtf_content) automatically:
- Decodes from Windows-1252 encoding
- Detects RTF vs plain text format
- Removes RTF control words (
\rtf1,\ansi, etc.) - Removes RTF hex sequences (
\'e1for accented characters) - Normalizes whitespace
- Returns clean, searchable text
When you search, PostgreSQL:
- Converts your query to a
tsqueryusing Spanish stemming - Matches against the indexed
search_vectorusing the@@operator - Ranks results by relevance using
ts_rank - Returns sorted results with most relevant first
| Feature | Coverage |
|---|---|
| Articles Indexed | 9,911 / 9,911 (100%) |
| Content Searchable | Full article text |
| RTF Handling | Automatic stripping |
| Performance | Sub-second searches |
| Language Support | Spanish stemming & accents |
The search vector is automatically maintained via a PostgreSQL trigger:
CREATE TRIGGER articulos_search_vector_trigger
BEFORE INSERT OR UPDATE OF arti_titulo, arti_contenido
ON articulos
FOR EACH ROW EXECUTE FUNCTION articulos_search_vector_update();The trigger function combines title and cleaned content with weights:
NEW.search_vector :=
setweight(to_tsvector('spanish', coalesce(NEW.arti_titulo, '')), 'A') ||
setweight(to_tsvector('spanish', strip_rtf_content(NEW.arti_contenido)), 'C');To set up full-text search on a new database:
cd web
npx tsx scripts/migrate-search.tsThis will:
- Add the
search_vectorcolumn - Create the GIN index
- Create the RTF stripping function
- Set up the trigger function
- Populate search vectors for existing articles
The application implements Retrieval-Augmented Generation (RAG) using a Hybrid Search approach that combines:
- Vector Search: Semantic understanding using Google's
text-embedding-004(768 dims). - Keyword Search: Precise matching using PostgreSQL's
unaccentandtsvector(see ADR-001). - LLM Generation: Answers are synthesized by Llama 3 (via Groq) using the retrieved context.
This ensures the chatbot can answer questions about specific people (e.g., "José ánjel Montero") even if the semantic embedding is slightly off, while still handling abstract queries.
The npm run ingest command runs a vector embedding generation pipeline that:
- Identifies unprocessed articles - Finds articles without embeddings (up to 500 per run)
- Processes RTF content - Converts RTF to clean plain text with proper character encoding
- Generates AI embeddings - Creates 768-dimensional vector representations using Google's
text-embedding-004model - Stores in database - Saves embeddings to the
article_embeddingstable for semantic search
graph LR
A[Articles Table] -->|Find unprocessed| B[RTF Processing]
B -->|Clean text| C[Batch Processing]
C -->|100 articles/batch| D[Google AI API]
D -->|768-dim vectors| E[article_embeddings Table]
E -->|Enable| F[Semantic Search]
Processing Pipeline:
-
RTF to Plain Text (
processRtffunction):- Decodes Windows-1252 encoding for special characters (accents, ñ, etc.)
- Detects RTF format vs plain text
- Converts RTF escape sequences (e.g.,
\'e1→ 'á') - Strips HTML tags
- Combines title + content (max 8000 chars)
-
Embedding Generation (
generateEmbeddingsBatch):- Processes in batches of 100 to respect API rate limits
- Uses Google Generative AI
text-embedding-004model - Returns 768-dimensional vectors representing semantic meaning
-
Database Storage:
- Stores vectors in
article_embeddingstable - Uses
pgvectorextension for efficient vector operations - Handles conflicts with
onConflictDoUpdate(safe to re-run)
- Stores vectors in
Run the ingestion script when:
- After importing new articles - New articles need embeddings for semantic search
- After database restoration - Embeddings may be missing after restoring from backup
- To update existing embeddings - Re-run to regenerate embeddings with improved models
cd web
npm run ingestThe script is idempotent - it only processes articles without embeddings, so it's safe to run multiple times.
- Environment Variable:
GEMINI_KEYin.env - Database Extension: PostgreSQL with
pgvectorextension enabled - Network Access: Requires internet connection to call Google AI API
| Metric | Value |
|---|---|
| Batch Size | 100 articles per API call |
| Max Articles per Run | 500 articles |
| Embedding Dimensions | 768 |
| Processing Time | ~2-5 seconds per 100 articles |
Vector embeddings enable searches like:
- Conceptual matching: Search "obituary" finds articles about "necrología"
- Multilingual understanding: Handles Spanish and English semantic similarity
- Context awareness: Understands phrases and context, not just keywords
- Typo tolerance: Similar vectors even with spelling variations
No articles processed:
✅ All articles already have embeddings.
This is normal - all articles are already indexed.
API errors:
# Add to .env.local
GEMINI_KEY=your_key_here- Check
GEMINI_KEYis set correctly in.env - Verify API quota hasn't been exceeded
- Ensure network connectivity
Character encoding issues:
- The script handles Windows-1252 encoding automatically
- RTF escape sequences are converted to proper characters
- If you see garbled text, check the source RTF format
- Node.js 18+
- PostgreSQL (with
pgvectorextension) - Python 3+ (for migration scripts)
The web client is located in the web/ directory.
-
Install dependencies:
cd web npm install -
Environment Setup: Copy
.env.example(if available) or create a.envfile with your database credentials:DATABASE_URL="postgresql://user:password@host:port/dbname" -
Run Development Server:
npm run dev
To populate the database from legacy sources:
-
Extract Data: Use the scripts in
scripts/to convert MDB files to CSV.cd scripts ./extract_mdb_data.sh -
Database Migration & Seeding: The project uses Drizzle ORM for schema management and seeding.
# From web directory npm run db:push -
Date Normalization: Historical dates are extracted from article content using a custom PostgreSQL function (see ADR-005). This happens automatically during migration.
-
Generate Embeddings: Process RTF content and generate vector embeddings for semantic search.
npm run ingest
The application uses a robust fallback strategy to ensure high availability. It attempts to use providers in the following order:
- Groq (Primary - Fast)
- Cerebras (High Throughput)
- Google Gemini (Reliable Free Tier)
- Moonshot AI (Large Context)
- DeepSeek (Cost Effective)
- OpenAI (Ultimate Reliability)
To enable the full fallback chain, configure the following keys in your .env or .env.local file:
- Groq: Sign up at console.groq.com
- Cerebras: Get key from cloud.cerebras.ai
- Google (Gemini): Get key from aistudio.google.com
- Moonshot (Kimi): Register at platform.moonshot.cn (Requires ~$1 recharge)
- DeepSeek: Sign up at platform.deepseek.com
- OpenAI: Get key from platform.openai.com
GROQ_KEY=...
Cerebras_KEY=...
GOOGLE_GENERATIVE_AI_API_KEY=...
MOONSHOT_API_KEY=...
DEEPSEEK_API_KEY=...
OPENAI_API_KEY=...