PDF Research Assistant

A local PDF research assistant built with PaperQA2. It can use either a manifest-controlled document library or all PDFs under a chosen root folder.

Core indexing and querying work on Windows, macOS, and Linux. The Streamlit Copy answer clipboard button is currently Windows-only.

Important PaperQA2 uses retrieval-augmented generation (RAG). It returns real source passages and page references, but it can still overstate, paraphrase, or extrapolate beyond what the source explicitly says. Treat answers as a starting point for exploration, not as a citable summary. Always verify important claims against the source passages and the original PDF.

Getting Started

Requirements

Python 3.11 or newer
An OpenAI API key

Install the Python dependencies:

cd ~/gitrepos/pdf-research-assistant
python -m pip install -e .

More info: where do these commands come from?

Running python -m pip install -e . installs this project in editable mode.

That does two useful things:

it makes the pdf_research_assistant package importable in your current Python environment
it creates the pdf-research and pdf-research-rebuild console commands defined in pyproject.toml

These are three ways to run the rebuild flow:

Installed console command: pdf-research-rebuild
Module form: python -m pdf_research_assistant.rebuild
Direct file path form: python src/pdf_research_assistant/rebuild.py

The first is the normal user-facing command. The second is a Python module form many developers prefer. The third is the old-school "run this file directly" form.

Create an OpenAI API key in the OpenAI dashboard:

https://platform.openai.com/api-keys

Setup

Copy .env.example to .env.
Set OPENAI_API_KEY in .env.
Set PAPER_DIR in .env to the root folder containing your PDFs.
Optional: copy manifest.example.csv to manifest.csv if you want curated scope and metadata.
If you use manifest.csv, replace the example rows with paths relative to your chosen PAPER_DIR.

Project Layout

This project uses a standard Python src layout.

src/pdf_research_assistant/ contains the application code
tests/ contains the test suite
pyproject.toml defines dependencies and the pdf-research and pdf-research-rebuild commands

The pdf_research_assistant folder under src is the importable package, so code uses imports such as from pdf_research_assistant.bootstrap import build_settings.

Usage

Streamlit UI

cd ~/gitrepos/pdf-research-assistant
streamlit run src/pdf_research_assistant/app.py

Streamlit usually opens the app in your browser automatically and prints the local URL in the terminal. By default, it uses http://localhost:8501 unless that port is already in use.

The Streamlit sidebar shows:

query count for the current session
total session cost
a button to clear the chat

Each assistant response also includes:

a Copy answer button that copies the full answer text to the clipboard on Windows
a Show source passages expander with the retrieved evidence passages used for the answer

Clipboard support for the Copy answer button is currently implemented for Windows only.

On first use, there is no search index yet. The first query builds it, which can take a while for a large PDF library.

Each question runs in a fresh helper process so repeated questions in the same session start with clean query state.

CLI

cd ~/gitrepos/pdf-research-assistant
pdf-research

Type a question at the prompt to search your indexed PDFs and return a cited answer with page references. Type quit to exit.

Like the Streamlit app, the CLI runs each question in a fresh helper process so repeated questions start with clean query state.

Rebuild Index

cd ~/gitrepos/pdf-research-assistant
pdf-research-rebuild

Use this when:

running the project for the first time and you want to build the index explicitly
you add new PDFs and want to rebuild the index before querying again
you want a terminal-only indexing run instead of letting the first query build the index

On a clean rebuild, it is normal to see Manifest PDFs: <n> and Indexed before run: 0 before indexing starts.

See windows-helper-commands.example.md for optional PowerShell commands that help check index build progress and troubleshoot rebuild issues on Windows. If you want a version with your own local paths ready to copy and paste, create windows-helper-commands.md from it.

Environment Variables

The app and CLI read settings from environment variables and load values from .env when present.

By default, INDEX_DIR and MANIFEST_PATH are resolved relative to the repository root, so the project can be moved without changing code. PAPER_DIR can point to any folder on your system using normal paths for your OS. If manifest.csv is present, it stores paths relative to that root folder. If manifest.csv is absent, the app indexes all PDFs under PAPER_DIR.

Variable	Purpose	Default
`OPENAI_API_KEY`	OpenAI API key for PaperQA2 queries	unset
`PAPER_DIR`	Common root folder containing your PDFs	required
`INDEX_DIR`	Optional override for where the local PaperQA index is stored	`<repo-root>/index`
`MANIFEST_PATH`	Optional CSV manifest of allowed PDFs	`<repo-root>/manifest.csv`
`PDF_RESEARCH_ASSISTANT_SYNC_DIR`	Optional destination folder for post-push copies of `.env`, `manifest.csv`, and your private `windows-helper-commands.md` notes	unset

See .env.example for the expected keys. Leave INDEX_DIR and MANIFEST_PATH unset to use the repo-root defaults.

If PDF_RESEARCH_ASSISTANT_SYNC_DIR is set, the tracked post-push hook copies .env and manifest.csv when present. It also copies windows-helper-commands.md if you created a private local version for your own use.

Additional Notes

The search index is stored in the folder set by INDEX_DIR.
If manifest.csv exists, the app uses it to decide which PDFs are in scope and which metadata to use.
If manifest.csv does not exist, the app indexes all PDFs under PAPER_DIR.
windows-helper-commands.example.md is a safe-to-share template. If you want a version with your own local paths ready to copy and paste, create windows-helper-commands.md from it.
The app automatically ignores the broken loopback proxy placeholder 127.0.0.1:9 if it appears in HTTP_PROXY, HTTPS_PROXY, or ALL_PROXY.
src/pdf_research_assistant/query_once.py is an internal helper used by the app and CLI; it is not intended as a separate user entry point.
If PaperQA reports that a PDF is empty but the file opens normally, it may be image-only and need OCR before it can be indexed.
Answers cite specific pages from your PDFs when available.

Feature Requests

Planned improvements are tracked in GitHub Issues.

The current implementation order is tracked in the pinned roadmap issue: Roadmap: Current Implementation Order

If an issue matters to you, please add a thumbs-up reaction to the issue instead of commenting +1. That makes it easier to see which ideas are most useful to potential users.

Browse open ideas: GitHub Issues
Use a thumbs-up reaction to vote for an idea
Add a comment if you have a specific use case or suggestion

Adding New PDFs

Add the PDF under your configured PAPER_DIR.
If you are using a manifest, add a row to manifest.csv.
Rebuild the index with pdf-research-rebuild before querying again.
Start the UI or CLI and ask a question about your PDFs.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.githooks		.githooks
.github		.github
images		images
src/pdf_research_assistant		src/pdf_research_assistant
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TESTING.md		TESTING.md
manifest.example.csv		manifest.example.csv
pyproject.toml		pyproject.toml
windows-helper-commands.example.md		windows-helper-commands.example.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Research Assistant

Getting Started

Requirements

Setup

Project Layout

Usage

Streamlit UI

CLI

Rebuild Index

Environment Variables

Additional Notes

Feature Requests

Adding New PDFs

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Research Assistant

Getting Started

Requirements

Setup

Project Layout

Usage

Streamlit UI

CLI

Rebuild Index

Environment Variables

Additional Notes

Feature Requests

Adding New PDFs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages