A local PDF research assistant built with PaperQA2. It can use either a manifest-controlled document library or all PDFs under a chosen root folder.
Core indexing and querying work on Windows, macOS, and Linux. The Streamlit Copy answer clipboard button is currently Windows-only.
Important PaperQA2 uses retrieval-augmented generation (RAG). It returns real source passages and page references, but it can still overstate, paraphrase, or extrapolate beyond what the source explicitly says. Treat answers as a starting point for exploration, not as a citable summary. Always verify important claims against the source passages and the original PDF.
- Python 3.11 or newer
- An OpenAI API key
Install the Python dependencies:
cd ~/gitrepos/pdf-research-assistant
python -m pip install -e .More info: where do these commands come from?
Running python -m pip install -e . installs this project in editable mode.
That does two useful things:
- it makes the
pdf_research_assistantpackage importable in your current Python environment - it creates the
pdf-researchandpdf-research-rebuildconsole commands defined inpyproject.toml
These are three ways to run the rebuild flow:
- Installed console command:
pdf-research-rebuild - Module form:
python -m pdf_research_assistant.rebuild - Direct file path form:
python src/pdf_research_assistant/rebuild.py
The first is the normal user-facing command. The second is a Python module form many developers prefer. The third is the old-school "run this file directly" form.
Create an OpenAI API key in the OpenAI dashboard:
- Copy
.env.exampleto.env. - Set
OPENAI_API_KEYin.env. - Set
PAPER_DIRin.envto the root folder containing your PDFs. - Optional: copy
manifest.example.csvtomanifest.csvif you want curated scope and metadata. - If you use
manifest.csv, replace the example rows with paths relative to your chosenPAPER_DIR.
This project uses a standard Python src layout.
src/pdf_research_assistant/contains the application codetests/contains the test suitepyproject.tomldefines dependencies and thepdf-researchandpdf-research-rebuildcommands
The pdf_research_assistant folder under src is the importable package, so code uses imports such as from pdf_research_assistant.bootstrap import build_settings.
cd ~/gitrepos/pdf-research-assistant
streamlit run src/pdf_research_assistant/app.pyStreamlit usually opens the app in your browser automatically and prints the local URL in the terminal. By default, it uses http://localhost:8501 unless that port is already in use.
The Streamlit sidebar shows:
- query count for the current session
- total session cost
- a button to clear the chat
Each assistant response also includes:
- a
Copy answerbutton that copies the full answer text to the clipboard on Windows - a
Show source passagesexpander with the retrieved evidence passages used for the answer
Clipboard support for the Copy answer button is currently implemented for Windows only.
On first use, there is no search index yet. The first query builds it, which can take a while for a large PDF library.
Each question runs in a fresh helper process so repeated questions in the same session start with clean query state.
cd ~/gitrepos/pdf-research-assistant
pdf-researchType a question at the prompt to search your indexed PDFs and return a cited answer with page references. Type quit to exit.
Like the Streamlit app, the CLI runs each question in a fresh helper process so repeated questions start with clean query state.
cd ~/gitrepos/pdf-research-assistant
pdf-research-rebuildUse this when:
- running the project for the first time and you want to build the index explicitly
- you add new PDFs and want to rebuild the index before querying again
- you want a terminal-only indexing run instead of letting the first query build the index
On a clean rebuild, it is normal to see Manifest PDFs: <n> and Indexed before run: 0 before indexing starts.
See windows-helper-commands.example.md for optional PowerShell commands that help check index build progress and troubleshoot rebuild issues on Windows. If you want a version with your own local paths ready to copy and paste, create windows-helper-commands.md from it.
The app and CLI read settings from environment variables and load values from .env when present.
By default, INDEX_DIR and MANIFEST_PATH are resolved relative to the repository root, so the project can be moved without changing code. PAPER_DIR can point to any folder on your system using normal paths for your OS. If manifest.csv is present, it stores paths relative to that root folder. If manifest.csv is absent, the app indexes all PDFs under PAPER_DIR.
| Variable | Purpose | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key for PaperQA2 queries | unset |
PAPER_DIR |
Common root folder containing your PDFs | required |
INDEX_DIR |
Optional override for where the local PaperQA index is stored | <repo-root>/index |
MANIFEST_PATH |
Optional CSV manifest of allowed PDFs | <repo-root>/manifest.csv |
PDF_RESEARCH_ASSISTANT_SYNC_DIR |
Optional destination folder for post-push copies of .env, manifest.csv, and your private windows-helper-commands.md notes |
unset |
See .env.example for the expected keys. Leave INDEX_DIR and MANIFEST_PATH unset to use the repo-root defaults.
If PDF_RESEARCH_ASSISTANT_SYNC_DIR is set, the tracked post-push hook copies .env and manifest.csv when present. It also copies windows-helper-commands.md if you created a private local version for your own use.
- The search index is stored in the folder set by
INDEX_DIR. - If
manifest.csvexists, the app uses it to decide which PDFs are in scope and which metadata to use. - If
manifest.csvdoes not exist, the app indexes all PDFs underPAPER_DIR. windows-helper-commands.example.mdis a safe-to-share template. If you want a version with your own local paths ready to copy and paste, createwindows-helper-commands.mdfrom it.- The app automatically ignores the broken loopback proxy placeholder
127.0.0.1:9if it appears inHTTP_PROXY,HTTPS_PROXY, orALL_PROXY. src/pdf_research_assistant/query_once.pyis an internal helper used by the app and CLI; it is not intended as a separate user entry point.- If PaperQA reports that a PDF is empty but the file opens normally, it may be image-only and need OCR before it can be indexed.
- Answers cite specific pages from your PDFs when available.
Planned improvements are tracked in GitHub Issues.
The current implementation order is tracked in the pinned roadmap issue: Roadmap: Current Implementation Order
If an issue matters to you, please add a thumbs-up reaction to the issue instead of commenting +1. That makes it easier to see which ideas are most useful to potential users.
- Browse open ideas: GitHub Issues
- Use a thumbs-up reaction to vote for an idea
- Add a comment if you have a specific use case or suggestion
- Add the PDF under your configured
PAPER_DIR. - If you are using a manifest, add a row to
manifest.csv. - Rebuild the index with
pdf-research-rebuildbefore querying again. - Start the UI or CLI and ask a question about your PDFs.
