El Diario de Avisos

El Diario de Avisos is a comprehensive digital archive and semantic search platform designed to preserve and make accessible historical publications, essays, and articles. The project modernizes a legacy database system into a performant, vector-enabled web application.

📂 Project Structure

web/: The modern Next.js web application. Handles the UI, database queries via Drizzle ORM, and AI/Vector search logic.
scripts/: Utilities for extracting and transforming data from legacy MDB (Microsoft Access) files into a format suitable for PostgreSQL.
data/: Intermediate storage for extracted CSVs and SQL dumps.
music/: Assets for the application's audio playback features.
docs/: Technical documentation and Architecture Decision Records (ADRs).

🤖 AI Assistant Rules

Follow the mandatory Verification Policy before reporting any task as complete. Assistants must pass type-check and lint project-wide.

✨ Key Features

Historical Search: Full-text and semantic search across 19th-century articles.
AI Historian: RAG-powered chatbot that answers questions based on archive content.
Essays & Collections: Curated essays linking to relevant historical articles.
Audio Experience: Background music playback for immersive reading.

🏗️ Architecture

The system utilizes a Next.js App Router architecture integrated with a vector-capable PostgreSQL database for semantic search capabilities.

graph TD
    Client[Web Browser]

    subgraph Frontend [Next.js Web App]
        Pages[App Router Pages]
        Components[Shadcn/UI Components]
        Providers[Context Providers]
    end

    subgraph BackendLayer [Server Actions & Libs]
        ServerActions[Server Actions]
        VectorStore[Vector Store Lib]
        DrizzleClient[Drizzle Client]
    end

    subgraph Persistence [PostgreSQL + pgvector]
        RelationalData["Relational Data<br/>(Articles, Authors)"]
        VectorData["Vector Embeddings<br/>(768 dims)"]
    end

    subgraph Pipeline [Data Ingestion Pipeline]
        Legacy[Legacy MDB Files]
        Shell[Bash Scripts]
        Python[Python Processing]
        IngestScript[Vector Ingest Script]
    end

    Client --> Pages
    Pages --> ServerActions
    ServerActions --> DrizzleClient
    ServerActions --> VectorStore

    DrizzleClient --> RelationalData
    VectorStore --> VectorData

    Legacy --> Shell
    Shell --> Python
    Python --> RelationalData
    RelationalData --> IngestScript
    IngestScript --> VectorData

🗄️ Database Schema

The database is fully normalized and managed via Drizzle ORM. Below is a simplified Entity-Relationship Diagram (ERD) of the core modules.

erDiagram
    PUBLICATIONS ||--o{ ISSUES : "publishes"
    ISSUES ||--o{ ARTICLES : "contains"
    AUTHORS ||--o{ ARTICLES : "writes"

    ARTICLES ||--o{ ARTICLE_TOPICS : "classified by"
    SUBJECTS ||--o{ ARTICLE_TOPICS : "defines"

    ARTICLES ||--o{ ARTICLE_IMAGES : "includes"
    IMAGES ||--o{ ARTICLE_IMAGES : "displayed in"

    ARTICLES ||--o| ARTICLE_EMBEDDINGS : "has vector"

    MEMBERS ||--o{ ESSAYS : "authors"
    ESSAYS ||--o{ ESSAY_ARTICLES : "links to"
    ARTICLES ||--|{ ESSAY_ARTICLES : "linked from"

    ARTICLES {
        int id PK
        varchar title
        varchar date "YYYY-MM-DD"
        bytea content "RTF/Text"
        boolean isEditable
    }

    AUTHORS {
        int id PK
        varchar name
        varchar reference
    }

    PUBLICATIONS {
        int id PK
        varchar name
        timestamp foundedDate
    }

    ISSUES {
        int id PK
        varchar number
        timestamp date
    }

    SUBJECTS {
        int id PK
        varchar name
        boolean isSubject
    }

    ARTICLE_EMBEDDINGS {
        int articleId FK
        vector embedding "768 dim"
    }

🔍 Full-Text Search

The application implements PostgreSQL's native full-text search with Spanish language support, indexing full article content for comprehensive search coverage.

Features

Content Indexing: Searches across full article content, not just titles
RTF Stripping: Automatically removes RTF formatting for clean text indexing
Spanish Language Support: Uses PostgreSQL's spanish text search configuration for proper stemming and stop words
Accent-Insensitive: Searches for "musica" will find "MÚSICA" automatically
Relevance Ranking: Results sorted by ts_rank score with weighted fields
High Performance: GIN index enables sub-second searches across thousands of articles
Automatic Updates: Trigger-based search vector maintenance requires no manual intervention

How It Works

The search functionality indexes article titles and full content in a tsvector column with weighted priorities:

Title (weight 'A'): Highest priority - exact title matches rank first
Content (weight 'C'): Lower priority - content matches rank after title matches

RTF Processing: Article content is stored as RTF or plain text in bytea format. Before indexing, a PostgreSQL function (strip_rtf_content) automatically:

Decodes from Windows-1252 encoding
Detects RTF vs plain text format
Removes RTF control words (\rtf1, \ansi, etc.)
Removes RTF hex sequences (\'e1 for accented characters)
Normalizes whitespace
Returns clean, searchable text

When you search, PostgreSQL:

Converts your query to a tsquery using Spanish stemming
Matches against the indexed search_vector using the @@ operator
Ranks results by relevance using ts_rank
Returns sorted results with most relevant first

Benefits

Feature	Coverage
Articles Indexed	9,911 / 9,911 (100%)
Content Searchable	Full article text
RTF Handling	Automatic stripping
Performance	Sub-second searches
Language Support	Spanish stemming & accents

Technical Implementation

The search vector is automatically maintained via a PostgreSQL trigger:

CREATE TRIGGER articulos_search_vector_trigger
BEFORE INSERT OR UPDATE OF arti_titulo, arti_contenido
ON articulos
FOR EACH ROW EXECUTE FUNCTION articulos_search_vector_update();

The trigger function combines title and cleaned content with weights:

NEW.search_vector :=
  setweight(to_tsvector('spanish', coalesce(NEW.arti_titulo, '')), 'A') ||
  setweight(to_tsvector('spanish', strip_rtf_content(NEW.arti_contenido)), 'C');

Migration

To set up full-text search on a new database:

cd web
npx tsx scripts/migrate-search.ts

This will:

Add the search_vector column
Create the GIN index
Create the RTF stripping function
Set up the trigger function
Populate search vectors for existing articles

🤖 Vector Embeddings & RAG

The application implements Retrieval-Augmented Generation (RAG) using a Hybrid Search approach that combines:

Vector Search: Semantic understanding using Google's text-embedding-004 (768 dims).
Keyword Search: Precise matching using PostgreSQL's unaccent and tsvector (see ADR-001).
LLM Generation: Answers are synthesized by Llama 3 (via Groq) using the retrieved context.

This ensures the chatbot can answer questions about specific people (e.g., "José ánjel Montero") even if the semantic embedding is slightly off, while still handling abstract queries.

What is the Ingestion Script?

The npm run ingest command runs a vector embedding generation pipeline that:

Identifies unprocessed articles - Finds articles without embeddings (up to 500 per run)
Processes RTF content - Converts RTF to clean plain text with proper character encoding
Generates AI embeddings - Creates 768-dimensional vector representations using Google's text-embedding-004 model
Stores in database - Saves embeddings to the article_embeddings table for semantic search

How It Works

graph LR
    A[Articles Table] -->|Find unprocessed| B[RTF Processing]
    B -->|Clean text| C[Batch Processing]
    C -->|100 articles/batch| D[Google AI API]
    D -->|768-dim vectors| E[article_embeddings Table]
    E -->|Enable| F[Semantic Search]

Processing Pipeline:

RTF to Plain Text (processRtf function):
- Decodes Windows-1252 encoding for special characters (accents, ñ, etc.)
- Detects RTF format vs plain text
- Converts RTF escape sequences (e.g., \'e1 → 'á')
- Strips HTML tags
- Combines title + content (max 8000 chars)
Embedding Generation (generateEmbeddingsBatch):
- Processes in batches of 100 to respect API rate limits
- Uses Google Generative AI text-embedding-004 model
- Returns 768-dimensional vectors representing semantic meaning
Database Storage:
- Stores vectors in article_embeddings table
- Uses pgvector extension for efficient vector operations
- Handles conflicts with onConflictDoUpdate (safe to re-run)

When to Run

Run the ingestion script when:

After importing new articles - New articles need embeddings for semantic search
After database restoration - Embeddings may be missing after restoring from backup
To update existing embeddings - Re-run to regenerate embeddings with improved models

cd web
npm run ingest

The script is idempotent - it only processes articles without embeddings, so it's safe to run multiple times.

Requirements

Environment Variable: GEMINI_KEY in .env
Database Extension: PostgreSQL with pgvector extension enabled
Network Access: Requires internet connection to call Google AI API

Performance

Metric	Value
Batch Size	100 articles per API call
Max Articles per Run	500 articles
Embedding Dimensions	768
Processing Time	~2-5 seconds per 100 articles

Semantic Search Benefits

Vector embeddings enable searches like:

Conceptual matching: Search "obituary" finds articles about "necrología"
Multilingual understanding: Handles Spanish and English semantic similarity
Context awareness: Understands phrases and context, not just keywords
Typo tolerance: Similar vectors even with spelling variations

Troubleshooting

No articles processed:

✅ All articles already have embeddings.

This is normal - all articles are already indexed.

API errors:

# Add to .env.local
GEMINI_KEY=your_key_here

Check GEMINI_KEY is set correctly in .env
Verify API quota hasn't been exceeded
Ensure network connectivity

Character encoding issues:

The script handles Windows-1252 encoding automatically
RTF escape sequences are converted to proper characters
If you see garbled text, check the source RTF format

🚀 Getting Started

Prerequisites

Node.js 18+
PostgreSQL (with pgvector extension)
Python 3+ (for migration scripts)

Web Application

The web client is located in the web/ directory.

Install dependencies:
```
cd web
npm install
```
Environment Setup: Copy .env.example (if available) or create a .env file with your database credentials:
```
DATABASE_URL="postgresql://user:password@host:port/dbname"
```
Run Development Server:
```
npm run dev
```

Data Migration

To populate the database from legacy sources:

Extract Data: Use the scripts in scripts/ to convert MDB files to CSV.
```
cd scripts
./extract_mdb_data.sh
```
Database Migration & Seeding: The project uses Drizzle ORM for schema management and seeding.
```
# From web directory
npm run db:push
```
Date Normalization: Historical dates are extracted from article content using a custom PostgreSQL function (see ADR-005). This happens automatically during migration.
Generate Embeddings: Process RTF content and generate vector embeddings for semantic search.
```
npm run ingest
```

🤖 AI Provider Setup

The application uses a robust fallback strategy to ensure high availability. It attempts to use providers in the following order:

Groq (Primary - Fast)
Cerebras (High Throughput)
Google Gemini (Reliable Free Tier)
Moonshot AI (Large Context)
DeepSeek (Cost Effective)
OpenAI (Ultimate Reliability)

Obtaining API Keys

To enable the full fallback chain, configure the following keys in your .env or .env.local file:

Groq: Sign up at console.groq.com
Cerebras: Get key from cloud.cerebras.ai
Google (Gemini): Get key from aistudio.google.com
Moonshot (Kimi): Register at platform.moonshot.cn (Requires ~$1 recharge)
DeepSeek: Sign up at platform.deepseek.com
OpenAI: Get key from platform.openai.com

GROQ_KEY=...
Cerebras_KEY=...
GOOGLE_GENERATIVE_AI_API_KEY=...
MOONSHOT_API_KEY=...
DEEPSEEK_API_KEY=...
OPENAI_API_KEY=...

Name		Name	Last commit message	Last commit date
Latest commit History 420 Commits
.agent		.agent
.cspell		.cspell
.eslint-rules		.eslint-rules
.github		.github
.husky		.husky
.jules		.jules
.vscode		.vscode
cypress		cypress
docs		docs
drizzle		drizzle
public		public
scripts		scripts
src		src
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.prettierrc		.prettierrc
PROJECT_RULES.md		PROJECT_RULES.md
README.md		README.md
coverage_results.txt		coverage_results.txt
cspell.json		cspell.json
cypress.config.ts		cypress.config.ts
drizzle.config.ts		drizzle.config.ts
eslint.config.mjs		eslint.config.mjs
implementation_plan.md		implementation_plan.md
next.config.ts		next.config.ts
next_output.log		next_output.log
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
sentry.client.config.ts		sentry.client.config.ts
sentry.edge.config.ts		sentry.edge.config.ts
sentry.server.config.ts		sentry.server.config.ts
sonar-project.properties		sonar-project.properties
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

El Diario de Avisos

📂 Project Structure

🤖 AI Assistant Rules

✨ Key Features

🏗️ Architecture

🗄️ Database Schema

🔍 Full-Text Search

Features

How It Works

Benefits

Technical Implementation

Migration

🤖 Vector Embeddings & RAG

What is the Ingestion Script?

How It Works

When to Run

Requirements

Performance

Semantic Search Benefits

Troubleshooting

🚀 Getting Started

Prerequisites

Web Application

Data Migration

🤖 AI Provider Setup

Obtaining API Keys

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

El Diario de Avisos

📂 Project Structure

🤖 AI Assistant Rules

✨ Key Features

🏗️ Architecture

🗄️ Database Schema

🔍 Full-Text Search

Features

How It Works

Benefits

Technical Implementation

Migration

🤖 Vector Embeddings & RAG

What is the Ingestion Script?

How It Works

When to Run

Requirements

Performance

Semantic Search Benefits

Troubleshooting

🚀 Getting Started

Prerequisites

Web Application

Data Migration

🤖 AI Provider Setup

Obtaining API Keys

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages