Add: built-in agent skill for AI coding assistants (agentskills.io) by hyunhee-jo · Pull Request #395 · opendataloader-project/opendataloader-pdf

hyunhee-jo · 2026-04-09T02:01:47Z

Objective

AI coding assistants (Claude Code, Codex, Cursor, Gemini CLI, VS Code, etc.) have no way to know how to use opendataloader-pdf optimally. Users must manually figure out which of 26 options, 7 formats, and 3 install methods to use for their specific PDF type and downstream pipeline. This is the most common friction point for new users.

Approach

Ship a built-in agent skill at skills/odl-pdf/ following the agentskills.io specification, compatible with 26+ AI platforms.

The skill acts as a Document Intelligence Engineer persona with a 5-phase workflow:

Phase	Description
Discover	Runs `detect-env.sh` to detect OS, Java, Python, Node.js, ODL installation
Prescribe	Recommends optimal install method, options, format, and mode via decision trees
Execute	Generates ready-to-run commands (Guide mode) or runs conversions directly (Action mode)
Diagnose	Identifies quality issues with metric-driven escalation (local → cluster → hybrid → hybrid-mode full)
Optimize	Tunes batch processing, RAG integration, and performance for production pipelines

Key design decisions:

Dual-path option reference — built-in summary in options-matrix.md + dynamic options.json lookup, so pip-install users (no source code) also get full guidance
Progressive disclosure — L1 description (always loaded) → L2 SKILL.md (on activation) → L3 references (on demand), minimizing token usage
CI drift check — skill-drift-check.yml automatically detects when CLI options change in Java but options-matrix.md isn't updated

New files

File	Purpose
`skills/odl-pdf/SKILL.md`	Main skill — persona, 5-phase workflow, decision trees, gotchas
`skills/odl-pdf/references/options-matrix.md`	26 options summarized in 6 categories + interaction rules
`skills/odl-pdf/references/hybrid-guide.md`	Hybrid mode setup, triage modes, troubleshooting
`skills/odl-pdf/references/format-guide.md`	7 output formats with downstream use mapping
`skills/odl-pdf/references/installation-matrix.md`	3 install methods + 2 variants decision tree
`skills/odl-pdf/references/eval-metrics.md`	NID, TEDS, MHS metric definitions + diagnostic guide
`skills/odl-pdf/scripts/detect-env.sh`	Cross-platform environment detection (Windows/macOS/Linux)
`skills/odl-pdf/scripts/hybrid-health.sh`	Hybrid server health check
`skills/odl-pdf/scripts/quick-eval.py`	Quick quality eval (difflib, optional rapidfuzz)
`skills/odl-pdf/scripts/sync-skill-refs.py`	Drift detection between options.json and skill references
`skills/odl-pdf/evals/evals.json`	5 eval test scenarios
`skills/README.md`	Skill creation and maintenance guide
`.claude-plugin/marketplace.json`	Plugin registry (v1.0.0)
`.github/workflows/skill-drift-check.yml`	CI: auto-detect option/skill drift

Updated files

README.md — Added Agent Skills section with install command and phase table
CLAUDE.md — Added skill development notes for internal contributors
CONTRIBUTING.md — Added skill maintenance guide for external contributors

Evidence

Ran 5 eval scenarios via independent AI agents with zero prior knowledge of opendataloader-pdf. Each agent loaded only SKILL.md and references, then responded to user queries.

Scenario	must_mention	must_not_mention	Result
500 scientific papers, RAG pipeline, source citation	hybrid, json, bounding box, batch	text as primary, unwarned loop	PASS
Korean government scanned PDFs, M1 Mac	hybrid, --force-ocr, --ocr-lang, ko,en, java	local sufficient, GPU	PASS
Broken tables, default settings	--table-method cluster, hybrid, TEDS	no workaround, --use-struct-tree as fix	PASS
Windows + Node.js, hybrid setup	pip install, opendataloader-pdf-hybrid, two terminals, java	npm-only hybrid, GPU	PASS
Math formulas + chart descriptions	--enrich-formula, --enrich-picture-description, --hybrid-mode full	auto mode works, client-side enrichments	PASS

Drift check: sync-skill-refs.py confirmed 26/26 options in sync between options.json and options-matrix.md.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Adds an opendataloader-pdf agent skill and marketplace manifest, plus CLI helpers for environment detection, hybrid-server health checks, and quick extraction quality evaluation; includes evaluation scenarios/benchmarks.
Documentation
- Large docs added/updated: skill guide, installation matrix, hybrid/format/eval guides, options matrix, README, CONTRIBUTING, and CLAUDE notes.
Chores
- CI workflow to detect drift between CLI options and docs; minor .gitignore cleanup.

coderabbitai · 2026-04-09T02:02:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds an Agent Skills subsystem for opendataloader-pdf: a Claude marketplace manifest, skill documentation and references, evaluation scenarios, utility CLI/scripts (env detection, hybrid health, quick-eval, sync-skill-refs), and a CI workflow to detect options drift; minor .gitignore and top-level docs updates.

Changes

Cohort / File(s)	Summary
Marketplace manifest `.claude-plugin/marketplace.json`	New Claude marketplace manifest registering `opendataloader-pdf` and the `odl-pdf-skills` entry pointing to `./skills/odl-pdf`.
CI / Workflow / Ignore `.github/workflows/skill-drift-check.yml`, `.gitignore`	New GitHub Actions workflow running `skills/odl-pdf/scripts/sync-skill-refs.py` on pushes/PRs affecting `options.json`; trivial `.gitignore` trailing-line removal.
Top-level docs `CLAUDE.md`, `CONTRIBUTING.md`, `README.md`	Added "Agent Skills" sections and contributor guidance requiring updates to `skills/odl-pdf/references/options-matrix.md` when CLI options change; documents CI drift-check.
Skills authoring guide `skills/README.md`	New skills authoring and loading model, directory layout, testing steps, options reference strategy, and CI drift-check instructions.
Skill content & evals `skills/odl-pdf/SKILL.md`, `skills/odl-pdf/evals/evals.json`	Comprehensive SKILL.md (five-phase workflow, modes, gotchas) and an eval manifest with five scenario-driven tests.
Reference docs `skills/odl-pdf/references/...` `.../options-matrix.md`, `installation-matrix.md`, `format-guide.md`, `hybrid-guide.md`, `eval-metrics.md`	Added reference docs: 26-option matrix with interaction rules, installation matrix, output format guide, hybrid-mode guide, and evaluation metrics/diagnostics.
Utility scripts `skills/odl-pdf/scripts/detect-env.sh`, `skills/odl-pdf/scripts/hybrid-health.sh`, `skills/odl-pdf/scripts/quick-eval.py`, `skills/odl-pdf/scripts/sync-skill-refs.py`	Added environment detector (key=value), hybrid health probe (key=value), quick-eval text-similarity CLI (rapidfuzz fallback), and sync-skill-refs drift-check script (JSON ⇄ matrix comparison).

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer (PR)
    participant GH as GitHub Actions
    participant Repo as Repository
    participant Sync as sync-skill-refs.py

    Dev->>GH: Push / open PR (affecting `options.json`)
    GH->>Repo: checkout
    GH->>Sync: run `python skills/odl-pdf/scripts/sync-skill-refs.py`
    Sync->>Repo: read `options.json`
    Sync->>Repo: read `skills/odl-pdf/references/options-matrix.md`
    Sync-->>GH: exit 0 (match) or exit 1 + drift report

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

bundolee
MaximPlusov
LonelyMidoriya

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding a built-in agent skill for AI coding assistants compatible with the agentskills.io standard, which is the primary objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-09T02:05:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Objective: AI coding assistants (Claude Code, Codex, Cursor, etc.) have no way to know how to use opendataloader-pdf optimally. Users must manually figure out which of 26 options, 7 formats, and 3 install methods to use for their specific PDF type and downstream pipeline. Approach: Ship a built-in agent skill at skills/odl-pdf/ following the agentskills.io spec. The skill acts as a Document Intelligence Engineer persona that detects the user's environment, recommends optimal options via decision trees, executes conversions directly, and diagnoses quality issues with metric-driven escalation (local → cluster → hybrid → hybrid-mode full). Key design decisions: - Dual-path option reference (built-in summary + dynamic options.json) so pip-install users without source code also get full guidance - Progressive disclosure (L1 description → L2 SKILL.md → L3 references) to minimize token usage - CI drift check (skill-drift-check.yml) catches option/skill mismatch when CLI options change in Java New files: - skills/odl-pdf/ — SKILL.md, 5 references, 3 scripts, evals - skills/README.md — skill creation and maintenance guide - .claude-plugin/marketplace.json — plugin registry - .github/workflows/skill-drift-check.yml — CI drift detection Updated: README.md (Agent Skills section), CLAUDE.md (skill dev notes), CONTRIBUTING.md (skill maintenance guide) Evidence: Ran 5 eval scenarios via independent AI agents with zero prior knowledge of opendataloader-pdf. Each agent loaded only SKILL.md and responded to user queries. All 5 passed must_mention/must_not_mention checks covering RAG pipelines, Korean OCR, table diagnostics, Windows Node.js hybrid setup, and formula+chart enrichment. Drift check (sync-skill-refs.py) confirmed 26/26 options in sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 21

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 5-18: Add explicit least-privilege workflow permissions by adding
a top-level permissions block (before jobs) that limits GITHUB_TOKEN to only
what's required for the read-only drift check; for this workflow add something
like permissions: contents: read (so the check-drift job can read repository
files but nothing else). Update the .github/workflows/skill-drift-check.yml file
and ensure the permissions block is present at the workflow root above the
check-drift job.
- Around line 9-13: The workflow trigger is too narrow because the current
'paths' filters only include 'options.json' for both push and pull_request
events; update the GitHub Actions workflow
(.github/workflows/skill-drift-check.yml) to include the related docs file(s)
that can introduce drift (e.g., add
'skills/odl-pdf/references/options-matrix.md' or a broader pattern like
'skills/**' to the 'paths' lists) so the 'push' and 'pull_request' triggers run
when those files change; modify the existing 'paths' entries under both the
top-level event and the 'pull_request' block rather than adding a new event
block.
- Around line 27-33: The workflow message references a non-existent --fix flag
for the script sync-skill-refs.py; either update the message in the workflow to
instruct contributors to run the correct command (remove --fix or replace it
with the actual supported flag from sync-skill-refs.py) or add a --fix/--apply
option to the sync-skill-refs.py CLI (update its argument parser and handler) so
the workflow guidance is accurate; search for sync-skill-refs.py and the
workflow snippet that echoes "Run 'python
skills/odl-pdf/scripts/sync-skill-refs.py --fix' ..." and change the guidance or
implement the flag accordingly.

In @.gitignore:
- Line 79: The .gitignore contains a redundant specific entry
"skills/odl-pdf/scripts/__pycache__/" which is already covered by the broader
pattern "**/__pycache__/"; remove the specific
"skills/odl-pdf/scripts/__pycache__/" line (or keep only the global
"**/__pycache__/" entry) to eliminate duplication and keep the file DRY.

In `@skills/odl-pdf/references/eval-metrics.md`:
- Around line 182-186: Update the documentation in
skills/odl-pdf/references/eval-metrics.md to accurately reflect quick-eval.py's
CLI: state that quick-eval.py requires two positional arguments (extracted and
ground_truth) and optionally supports --verbose, and replace the incorrect
description about subset/per-metric benchmarking with a precise one-line summary
that the script compares a single extracted file against a single ground-truth
file and reports a single similarity score (and optional diff snippets when
--verbose is used); reference the script name quick-eval.py and the positional
argument names extracted and ground_truth so the change is easy to locate.

In `@skills/odl-pdf/references/format-guide.md`:
- Around line 47-49: The fenced code blocks containing the CLI examples
"opendataloader-pdf input.pdf --format markdown,json" and "opendataloader-pdf
input.pdf --format text --to-stdout | my-indexer" are missing language
identifiers; update those triple-backtick fences to include a language tag
(e.g., bash) so markdownlint MD040 is satisfied and the snippets are highlighted
correctly.

In `@skills/odl-pdf/references/installation-matrix.md`:
- Line 7: The fenced code blocks in installation-matrix.md are missing language
specifiers causing linter warnings; update each triple-backtick block that
contains CLI output or commands to include an appropriate language tag (e.g.,
replace ``` with ```text for prompt/decision text like "Do you have Python
available?" and use ```bash for command lines such as "java -version" and
"opendataloader-pdf --version"); apply the same change to the other occurrences
called out (lines with the same blocks at 33 and 92) so all fenced blocks have
proper language specifiers.
- Around line 78-85: Update the installation matrix table rows for the runtimes
so they match the package manifests: change the "Minimum Runtime" for the
pip/pip [hybrid]/pip langchain rows to "Python 3.10+" (per pyproject.toml
requires-python = ">=3.10"), change the "Minimum Runtime" for the npm row to
"Node.js 20.19.0+" (per package.json "node": ">=20.19.0"), and leave the Maven
row as "Java 11+"; update the pip/npm table cells in installation-matrix.md
(rows identified by "pip", "pip [hybrid]", "pip langchain", "npm", and "Maven")
so the documented versions match the manifests.

In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 18: The fallback branch in detect-env.sh currently maps unknown platforms
to "linux" (the case pattern `*) echo "linux" ;;`), which can mislead callers;
change the fallback to emit "unknown" instead so callers can handle
unsupported/undetected OSes explicitly—update the `*)` case in the script (the
fallback case in the main OS-detection logic) to return or print "unknown"
rather than "linux".
- Line 37: The grep pipelines that extract values (for example the ver
assignment using grep -oE '"[^"]+"') can cause the script to exit under set -euo
pipefail when there is no match; update each such pipeline (the ver assignment
and the other grep-based extractions at the same pattern locations) to append
"|| true" so that grep failures do not abort the script and the subsequent
${...:-none} fallbacks and conditional checks remain reachable.

In `@skills/odl-pdf/scripts/hybrid-health.sh`:
- Line 55: The message hardcodes "--port 5002"; update the echo to use the
script's resolved host/port or URL variables instead (e.g., the variable(s)
populated when parsing --url/--port such as HYBRID_URL or HYBRID_PORT) so the
hint reflects the actual endpoint the script is checking; modify the echo in
hybrid-health.sh (the stopped-server hint line) to print the resolved URL/port
(or suggest using "--url <resolved-url>" if a custom URL was provided) rather
than the literal 5002.
- Around line 15-17: When handling the --url case in the argument parser, guard
against missing/empty values before assigning HYBRID_URL="$2" (this currently
fails under set -u). Change the --url branch so it first checks that the next
positional parameter exists and is not another flag (e.g. test that "${2-}" is
non-empty and does not start with '-') and if the check fails emit a clear error
and exit non-zero; otherwise assign HYBRID_URL="$2" and shift 2. Ensure you
reference the --url case and the HYBRID_URL variable when making this change.

In `@skills/odl-pdf/scripts/quick-eval.py`:
- Around line 146-155: The snippet builder currently can append a pre-hunk file
header (lines like '--- ' or '+++ ') as its own snippet; update the loop that
iterates over diff to skip unified-diff file header lines when current_hunk is
empty (ignore lines starting with '--- ' or '+++ ' before the first '@@'), or
alternatively, before appending a current_hunk to snippets check and discard
hunks that contain only header lines; modify the logic around the variables
diff, current_hunk, snippets, and max_snippets so only real hunks starting with
'@@' are captured as snippets.

In `@skills/odl-pdf/scripts/sync-skill-refs.py`:
- Around line 177-185: The two print calls that currently use pointless
f-strings should be changed to plain string literals: replace print(f"\nNEW
options (in options.json, not in skill):") and print(f"\nREMOVED options (in
skill, not in options.json):") with print("\nNEW options (in options.json, not
in skill):") and print("\nREMOVED options (in skill, not in options.json):")
respectively so the linter warning is resolved; these lines appear near the
handling of the new_options and removed_options variables in sync-skill-refs.py.
- Around line 31-35: The try-except-pass around sys.stdout.reconfigure can be
simplified using contextlib.suppress for clearer intent: replace the try/except
block that calls sys.stdout.reconfigure(encoding="utf-8") with a
contextlib.suppress(Exception) context manager so any exception is silently
ignored; reference the existing symbol sys.stdout.reconfigure and import
contextlib at the top if not already present.

In `@skills/odl-pdf/SKILL.md`:
- Around line 558-559: The pipeline in the example uses opendataloader-pdf with
--format markdown and then pipes to jq, which fails because jq expects JSON;
update the example to either emit JSON by changing --format markdown to --format
json (e.g., use opendataloader-pdf input.pdf --format json --to-stdout | jq .)
or remove the jq pipe and keep markdown output; modify the command shown in the
diff to use the chosen approach so the output type matches the consumer (jq or
stdout).
- Around line 13-14: The SKILL.md file has markdownlint warnings (MD022, MD031,
MD040) caused by missing blank lines around headings and fenced code blocks and
by fence blocks lacking a language tag; update SKILL.md by inserting a single
blank line before and after each heading (fix MD022), ensure fenced code blocks
are separated from surrounding text with blank lines (fix MD031), and add
appropriate language identifiers to all fenced code blocks (e.g., ```bash,
```json) to resolve MD040; check the repeated regions referenced in the comment
and apply the same fixes to headings and fences throughout the document.
- Around line 63-66: The SKILL.md environment table is inconsistent with
skills/odl-pdf/scripts/detect-env.sh and can break Phase 1 parsing; update the
table entries for the environment fields to match detect-env.sh semantics: use
"none" (not "missing") for absent interpreters/runtimes for PYTHON and NODE, and
specify that `JAVA` contains only the major Java version (e.g., `17`) or `none`
rather than a full version string; explicitly mention the `OS` field values
remain (linux, macos, windows) and reference detect-env.sh so future edits stay
in sync.
- Around line 605-608: The multiline Bash example for the opendataloader-pdf
command is broken because the line continuation backslash is followed by spaces
and an inline comment after --hybrid-mode full, so the next line runs as a
separate command; fix by ensuring the backslash is the last character on the
continuation line (remove trailing spaces and move the comment to its own line
or above the command) for the opendataloader-pdf invocation and its flags
(--hybrid, --hybrid-mode, --format) so the entire command executes as one shell
statement.
- Around line 272-284: The example in SKILL.md calls await at top-level while
using CommonJS require(), which will throw; wrap the async work in an async
function (e.g., async function run() { ... } then run()) or use an async IIFE so
that PdfConverter construction and the await on
converter.convert(['file1.pdf','file2.pdf']) run inside an async context; update
the snippet around the PdfConverter instantiation and converter.convert call
accordingly so top-level await is removed but functionality is unchanged.

In `@skills/README.md`:
- Line 7: The README has plain triple-backtick fenced code blocks for directory
tree snippets; update those fences to include a language specifier (e.g., change
opening ``` to ```text) for each snippet in skills/README.md (including the
snippet around the top-level tree and the snippet at the my-skill example around
line 52) so the blocks become ```text ... ``` and render with the correct
formatting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 150ff9d6-34f6-4f68-8a2a-72cc403814dd

📥 Commits

Reviewing files that changed from the base of the PR and between e134531 and 665a28d.

📒 Files selected for processing (18)

.claude-plugin/marketplace.json
.github/workflows/skill-drift-check.yml
.gitignore
CLAUDE.md
CONTRIBUTING.md
README.md
skills/README.md
skills/odl-pdf/SKILL.md
skills/odl-pdf/evals/evals.json
skills/odl-pdf/references/eval-metrics.md
skills/odl-pdf/references/format-guide.md
skills/odl-pdf/references/hybrid-guide.md
skills/odl-pdf/references/installation-matrix.md
skills/odl-pdf/references/options-matrix.md
skills/odl-pdf/scripts/detect-env.sh
skills/odl-pdf/scripts/hybrid-health.sh
skills/odl-pdf/scripts/quick-eval.py
skills/odl-pdf/scripts/sync-skill-refs.py

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (10)

skills/odl-pdf/scripts/hybrid-health.sh (2)

15-17: ⚠️ Potential issue | 🟠 Major

Guard --url against missing values.

With set -u enabled, HYBRID_URL="$2" will cause an unbound variable error if --url is passed without a value.

🛠️ Proposed fix

     --url)
+      if [[ $# -lt 2 || "$2" == --* ]]; then
+        echo "Missing value for --url" >&2
+        echo "Usage: $0 [--url <url>]" >&2
+        exit 1
+      fi
       HYBRID_URL="$2"
       shift 2
       ;;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/hybrid-health.sh` around lines 15 - 17, The --url)
case assigns HYBRID_URL="$2" without checking that a value exists; with set -u
this will raise an unbound variable error when --url is passed with no argument.
Update the --url) branch in the option-parsing switch to validate the next
positional ($2) before assigning to HYBRID_URL: if the next token is missing or
looks like another option (starts with '-'), print a clear error and exit
nonzero; otherwise assign HYBRID_URL="$2" and shift 2. Ensure you reference the
--url) case and the HYBRID_URL variable so reviewers can find the change.

55-55: ⚠️ Potential issue | 🟡 Minor

Avoid hardcoded port in the stopped-server hint.

Line 55 always suggests --port 5002, which is misleading when a custom --url was provided. Use the resolved URL instead.

🛠️ Proposed fix

-  echo "Hybrid server is not running. Start it with: opendataloader-pdf-hybrid --port 5002"
+  echo "Hybrid server is not running at ${HYBRID_URL}. Start it with: opendataloader-pdf-hybrid [--port <port>]"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/hybrid-health.sh` at line 55, The echo currently
hardcodes "--port 5002" in the stopped-server hint; replace that literal with
the script's resolved hybrid server URL variable (the same variable used earlier
to build/check the URL, e.g. HYBRID_URL or RESOLVED_URL) so the message prints
the actual URL/port the script used when a custom --url was provided; update the
echo in the block that prints "Hybrid server is not running..." to reference
that variable (e.g. echo "Hybrid server is not running. Start it with:
opendataloader-pdf-hybrid --url $HYBRID_URL").

skills/odl-pdf/scripts/detect-env.sh (1)

18-18: ⚠️ Potential issue | 🟠 Major

Do not classify unknown OS as Linux.

Defaulting unknown platforms to linux can trigger incorrect install/runtime guidance. Emit unknown and let the caller handle it explicitly.
🛠️ Proposed fix
-    *)                echo "linux"   ;;  # best-effort fallback
+    *)                echo "unknown" ;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/detect-env.sh` at line 18, The script detect-env.sh
currently maps the case default pattern "* )" to echo "linux" which
misclassifies unknown platforms; change the default branch in the case statement
(the "* )" entry) to emit "unknown" instead of "linux" and update the comment
accordingly so callers can handle unknown OS explicitly.

skills/odl-pdf/scripts/quick-eval.py (1)

143-157: ⚠️ Potential issue | 🟡 Minor

Skip unified-diff file headers when building snippet hunks.

Current logic can emit a first "snippet" containing only ---/+++ headers (before the first @@), which wastes snippet slots and may confuse output.

🛠️ Proposed fix

     # Collect individual hunks (separated by @@ markers)
     snippets: list[str] = []
     current_hunk: list[str] = []
+    in_hunk = False
     for line in diff:
-        if line.startswith("@@") and current_hunk:
-            snippets.append("\n".join(current_hunk))
-            current_hunk = [line]
-            if len(snippets) >= max_snippets:
-                break
-        else:
-            current_hunk.append(line)
+        if line.startswith("@@"):
+            if in_hunk and current_hunk:
+                snippets.append("\n".join(current_hunk))
+                if len(snippets) >= max_snippets:
+                    break
+            current_hunk = [line]
+            in_hunk = True
+        elif in_hunk:
+            current_hunk.append(line)
     if current_hunk and len(snippets) < max_snippets:
         snippets.append("\n".join(current_hunk))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/quick-eval.py` around lines 143 - 157, The snippet
builder can produce a first hunk that contains only unified-diff file headers
(lines like "diff ", "index ", "--- ", "+++ ") before the first "@@" which
wastes slots; update the loop that fills snippets/current_hunk so it ignores
header lines until the first "@@" is seen (e.g., only begin a new current_hunk
when line.startswith("@@") or append to current_hunk only after an "@@" has been
encountered), and ensure the final append only adds hunks that contain an "@@"
(or at least non-header content) and still respects max_snippets; refer to
variables/arrays snippets, current_hunk, diff and max_snippets to locate the
change.

skills/odl-pdf/references/eval-metrics.md (1)

182-186: ⚠️ Potential issue | 🟠 Major

quick-eval.py command and behavior are still documented incorrectly.

Line 183 omits required positional arguments, and Line 186 describes subset/per-metric benchmarking that this script does not perform.
Proposed fix
 ```bash
-python skills/odl-pdf/scripts/quick-eval.py
+python skills/odl-pdf/scripts/quick-eval.py extracted.md ground-truth.md
-This script runs a subset evaluation suitable for rapid iteration. It processes a small representative sample and reports per-metric scores without requiring the full benchmark corpus.
+This script compares one extracted file against one ground-truth file and reports a single similarity score (plus optional diff snippets with --verbose).
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @skills/odl-pdf/references/eval-metrics.md around lines 182 - 186, Update the
docs for skills/odl-pdf/scripts/quick-eval.py to show the required positional
arguments and correct behavior: change the example invocation to require two
files (e.g., "python skills/odl-pdf/scripts/quick-eval.py extracted.md
ground-truth.md") and replace the description text to state that quick-eval.py
compares one extracted file against one ground-truth file and reports a single
similarity score, noting that optional verbose output (e.g., --verbose) can
include diff snippets; reference the script name quick-eval.py and the sample
filenames extracted.md and ground-truth.md so the change is unambiguous.
</details>

</blockquote></details>
<details>
<summary>skills/odl-pdf/references/installation-matrix.md (2)</summary><blockquote>

`7-7`: _⚠️ Potential issue_ | _🟡 Minor_

**Add language specifiers to fenced code blocks.**

Line 7, Line 33, and Line 92 should include fence languages (`text`/`bash`) to satisfy markdown lint and improve rendering.

 
<details>
<summary>Proposed fix</summary>

```diff
-```
+```text
 Do you have Python available?
 ...
-```
+```

-```
+```bash
 java -version
-```
+```

-```
+```bash
 opendataloader-pdf --version
-```
+```
Also applies to: 33-33, 92-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/references/installation-matrix.md` at line 7, Add language
specifiers to the three fenced code blocks that currently lack them: the block
containing "Do you have Python available?" should be ```text, the block with the
`java -version` command should be ```bash, and the block with
`opendataloader-pdf --version` should be ```bash; update each opening fence to
include the appropriate language to satisfy markdown linting and improve
rendering.
80-83: ⚠️ Potential issue | 🟠 Major

Version compatibility table appears outdated versus package constraints.

Line 80-Line 83 likely understate minimum versions (Python 3.8+, Node.js 16+), which can lead users to installation failures.
#!/bin/bash
# Verify documented runtime requirements against package manifests.
set -euo pipefail

echo "Searching declared runtime constraints..."
rg -n -- 'requires-python|python_requires|\"node\"\\s*:' pyproject.toml setup.cfg setup.py package.json || true

echo ""
echo "Current installation-matrix table rows:"
rg -n -- '^\| pip |^\| pip \[hybrid\] |^\| pip langchain |^\| npm ' skills/odl-pdf/references/installation-matrix.md
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/references/installation-matrix.md` around lines 80 - 83, The
installation matrix rows for "| pip | Python 3.8+ | Yes |", "| pip [hybrid] |
Python 3.8+ | Yes |", "| pip langchain | Python 3.8+, LangChain 0.1+ | Yes |",
and "| npm | Node.js 16+ | Yes |" are likely out of sync with package manifests;
run the provided verification (search python_requires/requires-python in
pyproject.toml/setup.cfg/setup.py and "engines"/node constraint in
package.json), determine the true minimum Python and Node versions, then update
these exact table rows in skills/odl-pdf/references/installation-matrix.md to
reflect the manifest values (and adjust the "LangChain" version text if package
metadata dictates), ensuring the documented constraints exactly match the
package manifests.

.github/workflows/skill-drift-check.yml (3)

27-33: ⚠️ Potential issue | 🟠 Major

Fix non-idiomatic exit handling and invalid --fix remediation text.

Line 29 uses indirect $? checking, and Line 32 instructs users to run --fix, which is not supported by sync-skill-refs.py.

#!/bin/bash
# Verify whether --fix is supported and where it's referenced.
set -euo pipefail

echo "Checking workflow references..."
rg -n -- '--fix|sync-skill-refs.py' .github/workflows/skill-drift-check.yml

echo ""
echo "Checking script CLI surface..."
rg -n -- '--fix|argparse|add_argument|parse_args' skills/odl-pdf/scripts/sync-skill-refs.py

Proposed fix

       - name: Check skill drift
         run: |
-          python skills/odl-pdf/scripts/sync-skill-refs.py
-          if [ $? -ne 0 ]; then
+          if ! python skills/odl-pdf/scripts/sync-skill-refs.py; then
             echo ""
             echo "Drift detected: skill references are out of sync with options.json."
-            echo "Run 'python skills/odl-pdf/scripts/sync-skill-refs.py --fix' locally to update them."
+            echo "Update skills/odl-pdf/references/options-matrix.md to match options.json, then re-run the check locally."
             exit 1
           fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/skill-drift-check.yml around lines 27 - 33, Replace the
non-idiomatic "$?" check with a direct command test (use a negated command form:
"if ! python ...; then") so the script exit is checked immediately and reliably,
and update the remediation message to not reference the unsupported "--fix"
flag; instead instruct users to run the same script (sync-skill-refs.py) locally
to view/fix differences or point to the correct supported flag if the script
provides one. Target the workflow step invoking sync-skill-refs.py in
skill-drift-check.yml and the script name sync-skill-refs.py when making these
changes.

5-7: ⚠️ Potential issue | 🟠 Major

Add explicit least-privilege workflow permissions.

There is no top-level permissions block, so GITHUB_TOKEN defaults are broader than needed for this read-only check.

Proposed fix

 name: Skill Drift Check
+
+permissions:
+  contents: read
 
 on:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/skill-drift-check.yml around lines 5 - 7, Add an explicit
top-level permissions block to the "Skill Drift Check" workflow so GITHUB_TOKEN
is scoped to least privilege; update the workflow YAML to include a permissions
section (e.g., permissions: contents: read, checks: read) instead of relying on
defaults so the job only has the read rights it needs.

9-13: ⚠️ Potential issue | 🟠 Major

paths filter is too narrow to catch all drift cases.

Line 10 and Line 13 only watch options.json; drift can also be introduced by edits to skills/odl-pdf/references/options-matrix.md or the checker script itself.

Proposed fix

 on:
   push:
     paths:
       - 'options.json'
+      - 'skills/odl-pdf/references/options-matrix.md'
+      - 'skills/odl-pdf/scripts/sync-skill-refs.py'
   pull_request:
     paths:
       - 'options.json'
+      - 'skills/odl-pdf/references/options-matrix.md'
+      - 'skills/odl-pdf/scripts/sync-skill-refs.py'
   workflow_dispatch:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/skill-drift-check.yml around lines 9 - 13, The workflow's
paths filter currently only lists 'options.json' under the paths keys and misses
other sources of drift; update the paths arrays (the paths key entries) to
include the options-matrix markdown (options-matrix.md) and the drift checker
script plus broader globs (e.g., skills/** and scripts/** or a specific checker
script name) so changes to those files trigger the job; ensure both the
top-level push and pull_request paths arrays are updated consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/odl-pdf/references/format-guide.md`:
- Around line 36-39: The docs for the image-output option are inconsistent:
replace the example value `image-output=dir` with the authoritative value
`image-output=external` wherever it appears (notably in the line describing
`image-output` and in any examples referencing `image-output`), and ensure the
description of `image-dir` clearly states it applies when
`image-output=external`; update references near `image-output`, `image-dir`, and
related `*-page-separator` notes to match the valid values `off`, `embedded`,
`external` documented in options-matrix.

In `@skills/odl-pdf/references/hybrid-guide.md`:
- Line 84: The table entry for `--hybrid-timeout` is inconsistent: replace the
default marker "—" with `0` so it matches the documented default in
`options-matrix.md` (Line 59) and ensures the hybrid-guide.md `|
--hybrid-timeout <ms> | Integer | 0 | Request timeout in milliseconds... |`
reflects that `0` disables the timeout.

In `@skills/odl-pdf/references/installation-matrix.md`:
- Around line 68-74: Update the Maven dependency snippet in
installation-matrix.md that currently shows groupId "io.opendataloader" and
artifactId "opendataloader-pdf" so it matches the canonical published
coordinates used elsewhere: change groupId to "org.opendataloader" and
artifactId to "opendataloader-pdf-core" (leave the <version> tag for you to pin
to a specific release rather than "LATEST"); edit the dependency block
referenced by the groupId/artifactId lines to reflect these exact identifiers so
it aligns with the README and Maven Central badge.

In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 37: The pipelines that extract versions (assignments to ver, major, and
version variables such as the ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' |
tr -d '"' | head -1)" line and the other ver/major/version extractions) can
return non-zero when grep finds no match and cause the script to exit under set
-euo pipefail; fix by appending "|| true" to those grep-containing pipelines so
they never fail the pipeline (apply to the shown ver assignment and the other
locations referenced: major=..., ver=... at line 70, and the version=... lines
at ~114 and ~122), leaving the fallback logic intact.
- Around line 42-50: The script currently returns only the major version
(variable ver -> echoing BASH_REMATCH[1] or extracting major) which conflicts
with SKILL.md expecting the full version string; update the logic in the
detect-env.sh snippet so both branches return the full semantic version (e.g.,
17.0.9) instead of just the major: for the legacy 1.x match stop echoing
BASH_REMATCH[1] and echo the original ver (or a normalized full version), and in
the else branch replace the major-only extraction (variable major with grep -oE
'^[0-9]+') with a pattern that captures the full dotted version prefix (e.g.,
grep -oE '^[0-9]+(\.[0-9]+)*') so the function returns the complete version
string.

---

Duplicate comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 27-33: Replace the non-idiomatic "$?" check with a direct command
test (use a negated command form: "if ! python ...; then") so the script exit is
checked immediately and reliably, and update the remediation message to not
reference the unsupported "--fix" flag; instead instruct users to run the same
script (sync-skill-refs.py) locally to view/fix differences or point to the
correct supported flag if the script provides one. Target the workflow step
invoking sync-skill-refs.py in skill-drift-check.yml and the script name
sync-skill-refs.py when making these changes.
- Around line 5-7: Add an explicit top-level permissions block to the "Skill
Drift Check" workflow so GITHUB_TOKEN is scoped to least privilege; update the
workflow YAML to include a permissions section (e.g., permissions: contents:
read, checks: read) instead of relying on defaults so the job only has the read
rights it needs.
- Around line 9-13: The workflow's paths filter currently only lists
'options.json' under the paths keys and misses other sources of drift; update
the paths arrays (the paths key entries) to include the options-matrix markdown
(options-matrix.md) and the drift checker script plus broader globs (e.g.,
skills/** and scripts/** or a specific checker script name) so changes to those
files trigger the job; ensure both the top-level push and pull_request paths
arrays are updated consistently.

In `@skills/odl-pdf/references/eval-metrics.md`:
- Around line 182-186: Update the docs for skills/odl-pdf/scripts/quick-eval.py
to show the required positional arguments and correct behavior: change the
example invocation to require two files (e.g., "python
skills/odl-pdf/scripts/quick-eval.py extracted.md ground-truth.md") and replace
the description text to state that quick-eval.py compares one extracted file
against one ground-truth file and reports a single similarity score, noting that
optional verbose output (e.g., --verbose) can include diff snippets; reference
the script name quick-eval.py and the sample filenames extracted.md and
ground-truth.md so the change is unambiguous.

In `@skills/odl-pdf/references/installation-matrix.md`:
- Line 7: Add language specifiers to the three fenced code blocks that currently
lack them: the block containing "Do you have Python available?" should be
```text, the block with the `java -version` command should be ```bash, and the
block with `opendataloader-pdf --version` should be ```bash; update each opening
fence to include the appropriate language to satisfy markdown linting and
improve rendering.
- Around line 80-83: The installation matrix rows for "| pip | Python 3.8+ | Yes
|", "| pip [hybrid] | Python 3.8+ | Yes |", "| pip langchain | Python 3.8+,
LangChain 0.1+ | Yes |", and "| npm | Node.js 16+ | Yes |" are likely out of
sync with package manifests; run the provided verification (search
python_requires/requires-python in pyproject.toml/setup.cfg/setup.py and
"engines"/node constraint in package.json), determine the true minimum Python
and Node versions, then update these exact table rows in
skills/odl-pdf/references/installation-matrix.md to reflect the manifest values
(and adjust the "LangChain" version text if package metadata dictates), ensuring
the documented constraints exactly match the package manifests.

In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 18: The script detect-env.sh currently maps the case default pattern "*
)" to echo "linux" which misclassifies unknown platforms; change the default
branch in the case statement (the "* )" entry) to emit "unknown" instead of
"linux" and update the comment accordingly so callers can handle unknown OS
explicitly.

In `@skills/odl-pdf/scripts/hybrid-health.sh`:
- Around line 15-17: The --url) case assigns HYBRID_URL="$2" without checking
that a value exists; with set -u this will raise an unbound variable error when
--url is passed with no argument. Update the --url) branch in the option-parsing
switch to validate the next positional ($2) before assigning to HYBRID_URL: if
the next token is missing or looks like another option (starts with '-'), print
a clear error and exit nonzero; otherwise assign HYBRID_URL="$2" and shift 2.
Ensure you reference the --url) case and the HYBRID_URL variable so reviewers
can find the change.
- Line 55: The echo currently hardcodes "--port 5002" in the stopped-server
hint; replace that literal with the script's resolved hybrid server URL variable
(the same variable used earlier to build/check the URL, e.g. HYBRID_URL or
RESOLVED_URL) so the message prints the actual URL/port the script used when a
custom --url was provided; update the echo in the block that prints "Hybrid
server is not running..." to reference that variable (e.g. echo "Hybrid server
is not running. Start it with: opendataloader-pdf-hybrid --url $HYBRID_URL").

In `@skills/odl-pdf/scripts/quick-eval.py`:
- Around line 143-157: The snippet builder can produce a first hunk that
contains only unified-diff file headers (lines like "diff ", "index ", "--- ",
"+++ ") before the first "@@" which wastes slots; update the loop that fills
snippets/current_hunk so it ignores header lines until the first "@@" is seen
(e.g., only begin a new current_hunk when line.startswith("@@") or append to
current_hunk only after an "@@" has been encountered), and ensure the final
append only adds hunks that contain an "@@" (or at least non-header content) and
still respects max_snippets; refer to variables/arrays snippets, current_hunk,
diff and max_snippets to locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fdd6b05b-ad83-4507-8990-82c61279d8d7

📥 Commits

Reviewing files that changed from the base of the PR and between 665a28d and 14cf497.

📒 Files selected for processing (18)

.claude-plugin/marketplace.json
.github/workflows/skill-drift-check.yml
.gitignore
CLAUDE.md
CONTRIBUTING.md
README.md
skills/README.md
skills/odl-pdf/SKILL.md
skills/odl-pdf/evals/evals.json
skills/odl-pdf/references/eval-metrics.md
skills/odl-pdf/references/format-guide.md
skills/odl-pdf/references/hybrid-guide.md
skills/odl-pdf/references/installation-matrix.md
skills/odl-pdf/references/options-matrix.md
skills/odl-pdf/scripts/detect-env.sh
skills/odl-pdf/scripts/hybrid-health.sh
skills/odl-pdf/scripts/quick-eval.py
skills/odl-pdf/scripts/sync-skill-refs.py

Objective: Code review identified 6 issues across CI workflow, shell scripts, SKILL.md references, and .gitignore. Approach: Accept 6 items, reject 4 items with technical reasoning. - CI: remove nonexistent --fix flag from error message, fix unreachable exit-code logic with set +e - Shell: add executable bit (100755) to detect-env.sh, hybrid-health.sh - SKILL.md: add eval-metrics.md to reference table, fix quick-eval.py usage from --input/--reference flags to positional args - .gitignore: remove duplicate __pycache__ entry (already covered by **/__pycache__/ on line 53) Evidence: git ls-tree confirms 100755 for .sh files. grep confirms no remaining --fix references in CI. SKILL.md quick-eval.py example matches argparse positional interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (11)

skills/odl-pdf/SKILL.md (4)

272-284: ⚠️ Potential issue | 🟠 Major

Node.js example uses invalid top-level await in CommonJS.

The snippet uses require() plus top-level await, which will throw in CommonJS execution.

Proposed patch

 const { PdfConverter } = require('@opendataloader/pdf');
@@
-// Batch all files in one call
-const results = await converter.convert(['file1.pdf', 'file2.pdf']);
-results.forEach(r => console.log(r.markdown));
+async function main() {
+  // Batch all files in one call
+  const results = await converter.convert(['file1.pdf', 'file2.pdf']);
+  results.forEach((r) => console.log(r.markdown));
+}
+
+main().catch(console.error);

#!/bin/bash
# Verify CommonJS top-level await failure mode.
set -euo pipefail
cat >/tmp/odl-skill-commonjs-check.cjs <<'EOF'
const { PdfConverter } = require('@opendataloader/pdf');
const results = await Promise.resolve([]);
console.log(results);
EOF
node --check /tmp/odl-skill-commonjs-check.cjs || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/SKILL.md` around lines 272 - 284, The snippet mixes CommonJS
require() with top-level await which fails in CommonJS; update the example using
either an async IIFE (e.g., wrap the PdfConverter creation and await
converter.convert(...) inside an immediately-invoked async function) or convert
the snippet to ESM (use import and keep top-level await) so that the call to
PdfConverter.convert can await correctly; reference PdfConverter and
converter.convert to locate and modify the example accordingly.

603-606: ⚠️ Potential issue | 🟠 Major

Fix broken Bash line continuation in enrichment command.

The backslash on Line 605 is followed by spaces/comment text, so the command continuation is broken.

Proposed patch

 opendataloader-pdf input.pdf \
   --hybrid docling-fast \
-  --hybrid-mode full \        # required for enrichments
+  --hybrid-mode full \
+  # required for enrichments
   --format markdown

Based on learnings, enrichment flags must be paired with --hybrid-mode full, so this command must remain copy-paste safe.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/SKILL.md` around lines 603 - 606, The Bash line continuation
for the enrichment example using opendataloader-pdf is broken because the
backslash after "--hybrid-mode full" is followed by spaces and a comment; update
the SKILL.md example so the continuation is copy-paste safe by removing the
trailing spaces/comment after the backslash (or place the comment on its own
line) so the command that includes "--hybrid docling-fast", "--hybrid-mode full"
and "--format markdown" continues correctly; ensure the flags remain paired with
"--hybrid-mode full" and the example is executable.

64-66: ⚠️ Potential issue | 🟠 Major

Align environment field semantics with detect-env.sh.

These rows currently document missing and full Java version, but skills/odl-pdf/scripts/detect-env.sh emits none and Java major version. This mismatch can break Phase 1 parsing behavior.

Proposed patch

-| `JAVA` | Java version detected (e.g., `17.0.9`) or `missing` |
-| `PYTHON` | Python version or `missing` |
-| `NODE` | Node.js version or `missing` |
+| `JAVA` | Java major version detected (e.g., `17`) or `none` |
+| `PYTHON` | Python version or `none` |
+| `NODE` | Node.js version or `none` |

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/SKILL.md` around lines 64 - 66, The SKILL.md table rows for
environment detection are out of sync with scripts/detect-env.sh; update the
`JAVA`, `PYTHON`, and `NODE` row descriptions in skills/odl-pdf/SKILL.md to
match detect-env.sh's output semantics (use `none` instead of `missing`, and for
`JAVA` document the reported Java major version like `17` rather than a full
runtime string), ensuring the wording matches the exact tokens emitted by
scripts/detect-env.sh so Phase 1 parsing remains compatible.

556-557: ⚠️ Potential issue | 🟠 Major

jq cannot parse Markdown output in this example.

This command emits Markdown and pipes it to jq, which expects JSON.

Proposed patch

-opendataloader-pdf input.pdf --format markdown --to-stdout | jq .
+opendataloader-pdf input.pdf --format json --to-stdout | jq .

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/SKILL.md` around lines 556 - 557, The example pipes Markdown
into jq which fails because jq expects JSON; update the example command string
"opendataloader-pdf input.pdf --format markdown --to-stdout | jq ." to either
emit JSON (change --format markdown to --format json) when piping to jq or
remove the " | jq ." pipe and leave the Markdown output; ensure the example
shows a matching output processor for the chosen --format.

skills/odl-pdf/scripts/detect-env.sh (2)

18-18: ⚠️ Potential issue | 🟠 Major

Return unknown for unmapped OS instead of forcing linux.

Line 18 currently misclassifies unknown platforms as Linux, which can produce incorrect setup guidance.

Proposed patch

-    *)                echo "linux"   ;;  # best-effort fallback
+    *)                echo "unknown" ;;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/detect-env.sh` at line 18, The default case in the OS
detection case statement currently returns "linux" (pattern '*)'), which
misclassifies unknown platforms; update the default branch (the '*)' pattern in
detect-env.sh) to echo "unknown" instead of "linux" and update the inline
comment to reflect it's an unknown-platform fallback so callers get a clear
"unknown" result rather than being forced to Linux.

37-37: ⚠️ Potential issue | 🔴 Critical

Guard grep-based pipelines under strict mode to avoid premature exits.

With set -euo pipefail, these grep pipelines can terminate the script when no match exists, bypassing your none fallback logic.

Proposed patch

-  ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1)"
+  ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1 || true)"
...
-    major="$(printf '%s' "${ver}" | grep -oE '^[0-9]+')"
+    major="$(printf '%s' "${ver}" | grep -oE '^[0-9]+' || true)"
...
-  ver="$(printf '%s' "${raw}" | grep -oE '[0-9]+\.[0-9]+(\.[0-9]+)?'| head -1)"
+  ver="$(printf '%s' "${raw}" | grep -oE '[0-9]+\.[0-9]+(\.[0-9]+)?' | head -1 || true)"
...
-    version="$(printf '%s' "${cli_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)"
+    version="$(printf '%s' "${cli_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1 || true)"
...
-      version="$(printf '%s' "${mod_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)"
+      version="$(printf '%s' "${mod_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1 || true)"

#!/bin/bash
# Verifies strict-mode pipeline failure behavior (no repo modification).
set -euo pipefail
bash -lc '
  set -euo pipefail
  raw="no-version-token"
  # This exits non-zero without "|| true"
  ver="$(printf "%s" "$raw" | grep -oE "\"[^\"]+\"" | tr -d "\"" | head -1)"
  echo "ver=$ver"
' || echo "Confirmed: pipeline aborts under strict mode when grep has no match."

Also applies to: 48-48, 70-70, 114-114, 122-122

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/detect-env.sh` at line 37, The grep-based pipeline
that sets ver (ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' |
head -1)") will cause the script to exit under set -euo pipefail when grep finds
no match; update the pipeline in the ver assignment (and the other similar
pipelines at the mentioned locations) to guard against non-zero grep exits by
appending a safe fallback (for example append "|| true" or use "grep ... || :"
before head or capture via a conditional) so the variable falls back to your
existing "none" logic instead of aborting the script; locate the ver assignment
and analogous pipelines (the grep -oE '"[^"]+"' | tr -d '"' | head -1 sequences)
and make them resilient to no-match failures.

skills/odl-pdf/scripts/hybrid-health.sh (2)

15-17: ⚠️ Potential issue | 🟠 Major

Guard --url for missing values before reading $2.

With set -u, passing --url without a value can terminate the script with an unbound variable.

Proposed patch

     --url)
+      if [[ $# -lt 2 || "${2-}" == --* || -z "${2-}" ]]; then
+        echo "Missing value for --url" >&2
+        echo "Usage: $0 [--url <url>]" >&2
+        exit 1
+      fi
       HYBRID_URL="$2"
       shift 2
       ;;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/hybrid-health.sh` around lines 15 - 17, The --url case
reads $2 directly which under set -u will fail if the option is provided without
a value; update the --url branch in hybrid-health.sh to first validate that a
next argument exists and is not another option (e.g., check that ${2-} is
non-empty and does not start with '-') before assigning HYBRID_URL="$2",
otherwise print a helpful error and exit; keep the existing shift 2 behavior
only after the validation succeeds so HYBRID_URL and shifts are safe.

55-55: ⚠️ Potential issue | 🟡 Minor

Avoid hardcoded --port 5002 in the stopped-server hint.

This message is inaccurate when users pass a custom --url.

Proposed patch

-  echo "Hybrid server is not running. Start it with: opendataloader-pdf-hybrid --port 5002"
+  echo "Hybrid server is not running at ${HYBRID_URL}. Start it with: opendataloader-pdf-hybrid [--port <port>] or use --url ${HYBRID_URL}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/hybrid-health.sh` at line 55, The hint message in
hybrid-health.sh currently hardcodes "--port 5002" which is wrong when a custom
--url is used; update the echo to construct the hint dynamically from the
script's configured variables (use the variable that holds the configured
URL/port — e.g., HYBRID_URL or HYBRID_PORT or URL used by the health check
logic) and fall back to the default (5002) so the printed start command reflects
the actual configured endpoint used by the script.

.github/workflows/skill-drift-check.yml (3)

5-7: ⚠️ Potential issue | 🟠 Major

Add explicit least-privilege workflow permissions.

This workflow currently relies on default GITHUB_TOKEN permissions. Pin it to read-only for this check.

Proposed patch

 name: Skill Drift Check
+
+permissions:
+  contents: read
 
 on:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/skill-drift-check.yml around lines 5 - 7, The workflow
"Skill Drift Check" is using default GITHUB_TOKEN permissions; restrict it to
least privilege by adding an explicit top-level permissions block that sets
GITHUB_TOKEN to read-only (e.g., permissions: contents: read) so the workflow
only has the minimum required access; update the workflow header near the "name:
Skill Drift Check" entry and ensure the block includes the specific
permission(s) your job needs (reference GITHUB_TOKEN) rather than relying on
defaults.

9-13: ⚠️ Potential issue | 🟠 Major

Expand path triggers so drift checks run when skill docs/scripts change.

Only watching options.json misses drift introduced by editing skills/odl-pdf/references/options-matrix.md or skills/odl-pdf/scripts/sync-skill-refs.py.

Proposed patch

 on:
   push:
     paths:
       - 'options.json'
+      - 'skills/odl-pdf/references/options-matrix.md'
+      - 'skills/odl-pdf/scripts/sync-skill-refs.py'
   pull_request:
     paths:
       - 'options.json'
+      - 'skills/odl-pdf/references/options-matrix.md'
+      - 'skills/odl-pdf/scripts/sync-skill-refs.py'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/skill-drift-check.yml around lines 9 - 13, The workflow's
trigger only watches 'options.json' under the 'paths' arrays for push and the
'pull_request' job, so edits to skill docs or scripts (e.g., references markdown
or sync scripts) won't run drift checks; update the 'paths' entries in
.github/workflows/skill-drift-check.yml to include glob patterns that cover
skill documentation and scripts in the skills tree (for example add patterns for
skills/**/*.md and skills/**/scripts/** or similar) alongside the existing
'options.json' so pushes and PRs that change skill docs/scripts will trigger the
drift check.

31-35: ⚠️ Potential issue | 🟡 Minor

Differentiate drift (exit 1) from execution/input errors (exit 2).

sync-skill-refs.py uses distinct exit codes; treating all non-zero as drift makes failures harder to triage.

Proposed patch

           python skills/odl-pdf/scripts/sync-skill-refs.py
           EXIT_CODE=$?
-          if [ $EXIT_CODE -ne 0 ]; then
+          if [ "$EXIT_CODE" -eq 1 ]; then
             echo ""
             echo "Drift detected: skill references are out of sync with options.json."
             echo "Update skills/odl-pdf/references/options-matrix.md to match options.json."
             exit 1
+          elif [ "$EXIT_CODE" -ne 0 ]; then
+            echo ""
+            echo "Skill drift check failed to run (exit code: $EXIT_CODE)."
+            exit "$EXIT_CODE"
           fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/skill-drift-check.yml around lines 31 - 35, The workflow
currently treats any non-zero EXIT_CODE from sync-skill-refs.py as drift
(exiting 1); change the branch that handles non-zero EXIT_CODE so it preserves
the script's semantics: if EXIT_CODE equals 2, re-exit with code 2 (and an
appropriate message about execution/input error), otherwise exit 1 for actual
drift; reference the EXIT_CODE variable and the sync-skill-refs.py invocation
and update the echo/exit behavior in that conditional block.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/odl-pdf/scripts/detect-env.sh`:
- Around line 140-157: The detect_hybrid_extras function currently probes for a
non-existent module docling_serve which makes HYBRID_EXTRAS incorrectly false;
update detect_hybrid_extras to test importing the actual required packages
(docling, fastapi, uvicorn) instead of docling_serve, i.e., run a Python import
check that attempts "import docling, fastapi, uvicorn" (or import them
individually) and set HYBRID_EXTRAS=true only if those imports succeed, while
preserving the existing python/pycmd detection and error suppression behavior.

---

Duplicate comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 5-7: The workflow "Skill Drift Check" is using default
GITHUB_TOKEN permissions; restrict it to least privilege by adding an explicit
top-level permissions block that sets GITHUB_TOKEN to read-only (e.g.,
permissions: contents: read) so the workflow only has the minimum required
access; update the workflow header near the "name: Skill Drift Check" entry and
ensure the block includes the specific permission(s) your job needs (reference
GITHUB_TOKEN) rather than relying on defaults.
- Around line 9-13: The workflow's trigger only watches 'options.json' under the
'paths' arrays for push and the 'pull_request' job, so edits to skill docs or
scripts (e.g., references markdown or sync scripts) won't run drift checks;
update the 'paths' entries in .github/workflows/skill-drift-check.yml to include
glob patterns that cover skill documentation and scripts in the skills tree (for
example add patterns for skills/**/*.md and skills/**/scripts/** or similar)
alongside the existing 'options.json' so pushes and PRs that change skill
docs/scripts will trigger the drift check.
- Around line 31-35: The workflow currently treats any non-zero EXIT_CODE from
sync-skill-refs.py as drift (exiting 1); change the branch that handles non-zero
EXIT_CODE so it preserves the script's semantics: if EXIT_CODE equals 2, re-exit
with code 2 (and an appropriate message about execution/input error), otherwise
exit 1 for actual drift; reference the EXIT_CODE variable and the
sync-skill-refs.py invocation and update the echo/exit behavior in that
conditional block.

In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 18: The default case in the OS detection case statement currently returns
"linux" (pattern '*)'), which misclassifies unknown platforms; update the
default branch (the '*)' pattern in detect-env.sh) to echo "unknown" instead of
"linux" and update the inline comment to reflect it's an unknown-platform
fallback so callers get a clear "unknown" result rather than being forced to
Linux.
- Line 37: The grep-based pipeline that sets ver (ver="$(printf '%s' "${raw}" |
grep -oE '"[^"]+"' | tr -d '"' | head -1)") will cause the script to exit under
set -euo pipefail when grep finds no match; update the pipeline in the ver
assignment (and the other similar pipelines at the mentioned locations) to guard
against non-zero grep exits by appending a safe fallback (for example append "||
true" or use "grep ... || :" before head or capture via a conditional) so the
variable falls back to your existing "none" logic instead of aborting the
script; locate the ver assignment and analogous pipelines (the grep -oE
'"[^"]+"' | tr -d '"' | head -1 sequences) and make them resilient to no-match
failures.

In `@skills/odl-pdf/scripts/hybrid-health.sh`:
- Around line 15-17: The --url case reads $2 directly which under set -u will
fail if the option is provided without a value; update the --url branch in
hybrid-health.sh to first validate that a next argument exists and is not
another option (e.g., check that ${2-} is non-empty and does not start with '-')
before assigning HYBRID_URL="$2", otherwise print a helpful error and exit; keep
the existing shift 2 behavior only after the validation succeeds so HYBRID_URL
and shifts are safe.
- Line 55: The hint message in hybrid-health.sh currently hardcodes "--port
5002" which is wrong when a custom --url is used; update the echo to construct
the hint dynamically from the script's configured variables (use the variable
that holds the configured URL/port — e.g., HYBRID_URL or HYBRID_PORT or URL used
by the health check logic) and fall back to the default (5002) so the printed
start command reflects the actual configured endpoint used by the script.

In `@skills/odl-pdf/SKILL.md`:
- Around line 272-284: The snippet mixes CommonJS require() with top-level await
which fails in CommonJS; update the example using either an async IIFE (e.g.,
wrap the PdfConverter creation and await converter.convert(...) inside an
immediately-invoked async function) or convert the snippet to ESM (use import
and keep top-level await) so that the call to PdfConverter.convert can await
correctly; reference PdfConverter and converter.convert to locate and modify the
example accordingly.
- Around line 603-606: The Bash line continuation for the enrichment example
using opendataloader-pdf is broken because the backslash after "--hybrid-mode
full" is followed by spaces and a comment; update the SKILL.md example so the
continuation is copy-paste safe by removing the trailing spaces/comment after
the backslash (or place the comment on its own line) so the command that
includes "--hybrid docling-fast", "--hybrid-mode full" and "--format markdown"
continues correctly; ensure the flags remain paired with "--hybrid-mode full"
and the example is executable.
- Around line 64-66: The SKILL.md table rows for environment detection are out
of sync with scripts/detect-env.sh; update the `JAVA`, `PYTHON`, and `NODE` row
descriptions in skills/odl-pdf/SKILL.md to match detect-env.sh's output
semantics (use `none` instead of `missing`, and for `JAVA` document the reported
Java major version like `17` rather than a full runtime string), ensuring the
wording matches the exact tokens emitted by scripts/detect-env.sh so Phase 1
parsing remains compatible.
- Around line 556-557: The example pipes Markdown into jq which fails because jq
expects JSON; update the example command string "opendataloader-pdf input.pdf
--format markdown --to-stdout | jq ." to either emit JSON (change --format
markdown to --format json) when piping to jq or remove the " | jq ." pipe and
leave the Markdown output; ensure the example shows a matching output processor
for the chosen --format.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 33aec8d3-c2e3-4797-9349-9097189a0188

📥 Commits

Reviewing files that changed from the base of the PR and between 14cf497 and faf526c.

📒 Files selected for processing (5)

.github/workflows/skill-drift-check.yml
.gitignore
skills/odl-pdf/SKILL.md
skills/odl-pdf/scripts/detect-env.sh
skills/odl-pdf/scripts/hybrid-health.sh

💤 Files with no reviewable changes (1)

.gitignore

Objective: CodeRabbit and CodeQL identified 24 issues across skill files — factual errors, security, and edge cases. Approach: Accept 15 items, reject 9 with technical reasoning. Critical fixes: - Python/Node.js/Java API examples replaced with actual APIs (opendataloader_pdf.convert, convert from @opendataloader/pdf, OpenDataLoaderPDF.processFile) - NID renamed to Normalized Indel Distance (was Information/Inversion) - MHS renamed to Markdown Heading Similarity (was Mean) - Hybrid install corrected to pip install "opendataloader-pdf[hybrid]" (was separate pip install opendataloader-pdf-hybrid) - detect-env.sh: grep pipefail fix (|| true), unknown OS fallback, hybrid detection via import docling (was docling_serve) Other fixes: - CI: added permissions { contents: read } - installation-matrix.md: Python 3.10+, Node 20.19+, Maven coordinates - hybrid-guide.md: timeout default 0, docling-serve description - SKILL.md: jq pipe format, bash continuation, env table none/missing - hybrid-health.sh: --url guard, dynamic port hint - format-guide.md: image-output values corrected - installation-matrix.md: GitHub URL corrected Rejected (with push back): - CI trigger expansion to options-matrix.md (false positives) - markdownlint code block language tags (no markdownlint CI) - SKILL.md top-level await concern (example snippet, not executed) - contextlib.suppress suggestion (reviewer agreed acceptable) - f-string placeholder linting (no functional impact) - hybrid-health.sh exit code (key=value parsing, no caller) - CI auto-PR creation (v1.0 scope, tracked as follow-up) Evidence: opus audit verified 17 previously-flagged checks + 9 fresh scan checks all PASS. Drift check 26/26 options in sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (8)

skills/odl-pdf/scripts/hybrid-health.sh (1)
15-25: ⚠️ Potential issue | 🟡 Minor

Harden --url parsing for empty/flag-like values.

--url --foo and --url= currently pass parsing and lead to invalid health endpoints instead of a clear argument error.
Proposed fix
     --url)
-      if [[ $# -lt 2 ]]; then
+      if [[ $# -lt 2 || -z "${2-}" || "${2}" == --* ]]; then
         echo "Error: --url requires a value" >&2
         exit 1
       fi
       HYBRID_URL="$2"
       shift 2
       ;;
     --url=*)
-      HYBRID_URL="${1#--url=}"
+      HYBRID_URL="${1#--url=}"
+      if [[ -z "${HYBRID_URL}" ]]; then
+        echo "Error: --url requires a non-empty value" >&2
+        exit 1
+      fi
       shift
       ;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/hybrid-health.sh` around lines 15 - 25, The --url
parsing must reject empty or flag-like values: when handling the separate-arg
branch (case for --url) validate that "$2" is non-empty and does not start with
'-' before assigning HYBRID_URL and shifting; when handling the --url=* branch
validate that the extracted value (from "${1#--url=}") is non-empty before
assigning HYBRID_URL and shifting; on validation failure print an error ("--url
requires a non-empty value") to stderr and exit 1 to prevent creating invalid
endpoints.
.github/workflows/skill-drift-check.yml (1)
9-13: ⚠️ Potential issue | 🟠 Major

Broaden path triggers to avoid missed drift checks.

This workflow can miss drift when only skills/odl-pdf/references/options-matrix.md or skills/odl-pdf/scripts/sync-skill-refs.py changes.
Proposed fix
 on:
   push:
     paths:
       - 'options.json'
+      - 'skills/odl-pdf/references/options-matrix.md'
+      - 'skills/odl-pdf/scripts/sync-skill-refs.py'
   pull_request:
     paths:
       - 'options.json'
+      - 'skills/odl-pdf/references/options-matrix.md'
+      - 'skills/odl-pdf/scripts/sync-skill-refs.py'
   workflow_dispatch:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/skill-drift-check.yml around lines 9 - 13, The workflow's
path filters only watch 'options.json' so changes to skill files like
'skills/odl-pdf/references/options-matrix.md' or
'skills/odl-pdf/scripts/sync-skill-refs.py' don't trigger the drift check;
update the push and pull_request path patterns (the 'paths' entries in the
workflow) to include the relevant skill directories or files (for example
broaden to 'skills/**', or add the specific paths
'skills/odl-pdf/references/options-matrix.md' and
'skills/odl-pdf/scripts/sync-skill-refs.py') so edits to those files will run
the workflow.
skills/odl-pdf/references/installation-matrix.md (1)
7-21: ⚠️ Potential issue | 🟡 Minor

Add language tags to all unlabeled fenced blocks.

The decision tree and command fences should declare a language (text/bash) to resolve MD040 warnings.
Proposed fix
-```
+```text
 Do you have Python available?
 ├── Yes
 │   ├── Do you need LangChain integration?
@@
 └── Unsure?
     └── pip install opendataloader-pdf  (simplest, works on all platforms)
@@
- +bash
java -version
@@
-```
+```bash
opendataloader-pdf --version
</details>


Also applies to: 33-35, 92-94

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @skills/odl-pdf/references/installation-matrix.md around lines 7 - 21, The
unlabeled fenced code blocks in installation-matrix.md (the decision-tree ASCII
block and the command examples like the java --version and opendataloader-pdf
--version blocks) are triggering MD040; add explicit language tags: change the
decision-tree fence to text and change command/example fences to bash, and
apply the same fixes to the other occurrences referenced (around the other
blocks at the indicated sections), ensuring every triple-backtick fence has a
language identifier.
</details>

</blockquote></details>
<details>
<summary>skills/odl-pdf/references/format-guide.md (1)</summary><blockquote>

`47-49`: _⚠️ Potential issue_ | _🟡 Minor_

**Add language identifiers to fenced CLI examples.**

Both fenced blocks should include a language tag (e.g., `bash`) to satisfy MD040 and improve readability.

 
<details>
<summary>Proposed fix</summary>

```diff
-```
+```bash
 opendataloader-pdf input.pdf --format markdown,json
 ```
@@
-```
+```bash
 opendataloader-pdf input.pdf --format text --to-stdout | my-indexer
 ```
```
</details>


Also applies to: 57-59

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @skills/odl-pdf/references/format-guide.md around lines 47 - 49, The fenced
CLI examples using the opendataloader-pdf command need language identifiers for
markdown linting (MD040); change the opening backticks for the blocks that
contain "opendataloader-pdf input.pdf --format markdown,json" and the block with
"opendataloader-pdf input.pdf --format text --to-stdout | my-indexer" so they
start with bash instead of (i.e., replace withbash for those two
fenced code blocks) and keep the contents unchanged.
</details>

</blockquote></details>
<details>
<summary>skills/odl-pdf/scripts/quick-eval.py (1)</summary><blockquote>

`146-155`: _⚠️ Potential issue_ | _🟡 Minor_

**Skip unified-diff file headers when building snippets.**

Current hunk parsing can emit a header-only snippet (`---/+++`) before the first `@@`, which wastes snippet slots.

 
<details>
<summary>Proposed fix</summary>

```diff
     snippets: list[str] = []
     current_hunk: list[str] = []
+    in_hunk = False
     for line in diff:
-        if line.startswith("@@") and current_hunk:
-            snippets.append("\n".join(current_hunk))
-            current_hunk = [line]
-            if len(snippets) >= max_snippets:
-                break
-        else:
+        if line.startswith("@@"):
+            in_hunk = True
+            if current_hunk:
+                snippets.append("\n".join(current_hunk))
+                if len(snippets) >= max_snippets:
+                    break
+            current_hunk = [line]
+            continue
+        if in_hunk:
             current_hunk.append(line)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/scripts/quick-eval.py` around lines 146 - 155, The loop that
builds hunks appends header-only snippets (lines like '---'/'+++') before the
first '@@', wasting snippet slots; modify the logic in the diff-processing loop
that iterates "for line in diff:" so that you skip or drop file-header lines
(e.g., lines starting with '---' or '+++') when current_hunk contains only
headers and you haven't hit a hunk header '@@' yet, only starting current_hunk
when you see the first '@@' or non-header content; ensure the final append block
that checks "if current_hunk and len(snippets) < max_snippets:" also ignores
header-only current_hunk so snippets only contain real hunks (references:
variables diff, current_hunk, snippets, max_snippets).
```

</details>

</blockquote></details>
<details>
<summary>skills/odl-pdf/SKILL.md (3)</summary><blockquote>

`37-37`: _⚠️ Potential issue_ | _🟡 Minor_

**Address remaining markdownlint warnings for CI cleanliness.**

Multiple fenced code blocks still lack language specifications (MD040) and surrounding blank lines (MD031). These were flagged in previous reviews but remain unresolved:

- **MD040**: Add language identifiers to fences at lines 37, 100, 133, 182, 328, 377, 392, 405, 417, 425, 433, 441 (use `text` for decision trees/pseudo-code, or appropriate language for code examples)
- **MD031**: Add blank lines before/after fenced blocks at lines 244, 253, 266, 278, 292, 377, 392, 405, 417, 425, 433, 540, 545, 550, 556, 704, 709

 


Also applies to: 100-100, 133-133, 182-182, 244-244, 253-253, 266-266, 278-278, 292-292, 328-328, 377-377, 392-392, 405-405, 417-417, 425-425, 433-433, 441-441, 540-540, 545-545, 550-550, 556-556, 704-704, 709-709

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/SKILL.md` at line 37, The markdown has multiple fenced code
blocks missing language identifiers (MD040) and missing surrounding blank lines
(MD031); update SKILL.md by adding appropriate language tags (use "text" for
decision trees/pseudo-code or the correct language for code examples) to the
fences at the listed locations (e.g., the fences reported at lines
~37,100,133,182,328,377,392,405,417,425,433,441 and others) and ensure there is
a blank line before and after each fenced block for the reported positions
(e.g., around the blocks at
~244,253,266,278,292,377,392,405,417,425,433,540,545,550,556,704,709) so the
file passes markdownlint checks MD040 and MD031.
```

</details>

---

`13-14`: _⚠️ Potential issue_ | _🟡 Minor_

**Add blank line between metadata headers.**

Line 14 still violates MD022 (headings should be surrounded by blank lines). A blank line is needed between lines 13 and 14 to separate these two heading-level metadata comments.



<details>
<summary>📝 Suggested fix</summary>

```diff
 # Targets: opendataloader-pdf >= 2.2.0
+
 # Last synced options.json: 26 options
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/SKILL.md` around lines 13 - 14, Add a blank line between the
two metadata headings "# Targets: opendataloader-pdf >= 2.2.0" and "# Last
synced options.json: 26 options" in SKILL.md so the second header is separated
and the file no longer violates MD022 (headings should be surrounded by blank
lines).
```

</details>

---

`63-63`: _⚠️ Potential issue_ | _🟡 Minor_

**Correct macOS capitalization.**

Apple's operating system should be capitalized as "macOS" not "macos".



<details>
<summary>✏️ Suggested fix</summary>

```diff
-| `OS` | Operating system (linux, macos, windows) |
+| `OS` | Operating system (linux, macOS, windows) |
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@skills/odl-pdf/SKILL.md` at line 63, Update the OS table entry in SKILL.md to
use the correct Apple capitalization by changing the value for the `OS` column
from "macos" to "macOS" so the row reads "`OS` | Operating system (linux, macOS,
windows)"; locate the table line that contains the `OS` symbol and replace only
that token.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/skill-drift-check.yml:

Around line 31-38: The workflow treats any non-zero exit from the sync script
as "Drift detected"; change the logic around the EXIT_CODE from running
skills/odl-pdf/scripts/sync-skill-refs.py so that only the specific drift exit
code (e.g., 1) triggers the drift message and exit 1, while other non-zero codes
(e.g., parser/input failures) are treated as script errors—log a distinct error
message referencing sync-skill-refs.py and propagate the original EXIT_CODE (or
exit 2) instead; update the conditional that inspects EXIT_CODE (the block that
currently echoes "Drift detected: skill references are out of sync with
options.json." and exits 1) to branch on EXIT_CODE values accordingly.

In @skills/odl-pdf/scripts/quick-eval.py:

Around line 70-83: The function compute_similarity_rapidfuzz currently uses
Levenshtein.distance and incorrect normalization; replace that with RapidFuzz's
Indel.normalized_distance to get the correct Normalized Indel Distance (NID) and
then return 1 - NID as the similarity. Specifically, import and call
Indel.normalized_distance(extracted, ground_truth) to obtain NID (which already
normalizes by len(a)+len(b)), handle the empty-string case (return 1.0 when both
strings are empty) and then compute similarity = max(0.0, 1.0 - nid) before
returning. Ensure the function docstring reflects the Indel API usage and remove
the old max_len division logic.

Duplicate comments:
In @.github/workflows/skill-drift-check.yml:

Around line 9-13: The workflow's path filters only watch 'options.json' so
changes to skill files like 'skills/odl-pdf/references/options-matrix.md' or
'skills/odl-pdf/scripts/sync-skill-refs.py' don't trigger the drift check;
update the push and pull_request path patterns (the 'paths' entries in the
workflow) to include the relevant skill directories or files (for example
broaden to 'skills/**', or add the specific paths
'skills/odl-pdf/references/options-matrix.md' and
'skills/odl-pdf/scripts/sync-skill-refs.py') so edits to those files will run
the workflow.

In @skills/odl-pdf/references/format-guide.md:

Around line 47-49: The fenced CLI examples using the opendataloader-pdf
command need language identifiers for markdown linting (MD040); change the
opening backticks for the blocks that contain "opendataloader-pdf input.pdf
--format markdown,json" and the block with "opendataloader-pdf input.pdf
--format text --to-stdout | my-indexer" so they start with bash instead of (i.e., replace withbash for those two fenced code blocks) and keep
the contents unchanged.

In @skills/odl-pdf/references/installation-matrix.md:

Around line 7-21: The unlabeled fenced code blocks in installation-matrix.md
(the decision-tree ASCII block and the command examples like the java --version
and opendataloader-pdf --version blocks) are triggering MD040; add explicit
language tags: change the decision-tree fence to text and change command/example fences to bash, and apply the same fixes to the other
occurrences referenced (around the other blocks at the indicated sections),
ensuring every triple-backtick fence has a language identifier.

In @skills/odl-pdf/scripts/hybrid-health.sh:

Around line 15-25: The --url parsing must reject empty or flag-like values:
when handling the separate-arg branch (case for --url) validate that "$2" is
non-empty and does not start with '-' before assigning HYBRID_URL and shifting;
when handling the --url=* branch validate that the extracted value (from
"${1#--url=}") is non-empty before assigning HYBRID_URL and shifting; on
validation failure print an error ("--url requires a non-empty value") to stderr
and exit 1 to prevent creating invalid endpoints.

In @skills/odl-pdf/scripts/quick-eval.py:

Around line 146-155: The loop that builds hunks appends header-only snippets
(lines like '---'/'+++') before the first '@@', wasting snippet slots; modify
the logic in the diff-processing loop that iterates "for line in diff:" so that
you skip or drop file-header lines (e.g., lines starting with '---' or '+++')
when current_hunk contains only headers and you haven't hit a hunk header '@@'
yet, only starting current_hunk when you see the first '@@' or non-header
content; ensure the final append block that checks "if current_hunk and
len(snippets) < max_snippets:" also ignores header-only current_hunk so snippets
only contain real hunks (references: variables diff, current_hunk, snippets,
max_snippets).

In @skills/odl-pdf/SKILL.md:

Line 37: The markdown has multiple fenced code blocks missing language
identifiers (MD040) and missing surrounding blank lines (MD031); update SKILL.md
by adding appropriate language tags (use "text" for decision trees/pseudo-code
or the correct language for code examples) to the fences at the listed locations
(e.g., the fences reported at lines
~37,100,133,182,328,377,392,405,417,425,433,441 and others) and ensure there is
a blank line before and after each fenced block for the reported positions
(e.g., around the blocks at
~244,253,266,278,292,377,392,405,417,425,433,540,545,550,556,704,709) so the
file passes markdownlint checks MD040 and MD031.

Around line 13-14: Add a blank line between the two metadata headings "#
Targets: opendataloader-pdf >= 2.2.0" and "# Last synced options.json: 26
options" in SKILL.md so the second header is separated and the file no longer
violates MD022 (headings should be surrounded by blank lines).

Line 63: Update the OS table entry in SKILL.md to use the correct Apple
capitalization by changing the value for the OS column from "macos" to "macOS"
so the row reads "OS | Operating system (linux, macOS, windows)"; locate the
table line that contains the OS symbol and replace only that token.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `9350321c-5ef8-4af5-9d0f-036a033709d4`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between faf526c6b096d5da927d43fd7bab14307fb4bbad and 98c6cc75e402ecc904e8ff05818fba86a8f80ada.

</details>

<details>
<summary>📒 Files selected for processing (9)</summary>

* `.github/workflows/skill-drift-check.yml`
* `skills/odl-pdf/SKILL.md`
* `skills/odl-pdf/references/eval-metrics.md`
* `skills/odl-pdf/references/format-guide.md`
* `skills/odl-pdf/references/hybrid-guide.md`
* `skills/odl-pdf/references/installation-matrix.md`
* `skills/odl-pdf/scripts/detect-env.sh`
* `skills/odl-pdf/scripts/hybrid-health.sh`
* `skills/odl-pdf/scripts/quick-eval.py`

</details>

</details>

Objective: quick-eval.py claims to compute Normalized Indel Distance but uses Levenshtein.distance (substitution weight 1, not Indel) with incorrect normalization (max(len) instead of sum(len)). Approach: Replace rapidfuzz.distance.Levenshtein with rapidfuzz.distance.Indel.normalized_distance which computes the correct NID metric and normalization directly. Evidence: Indel.normalized_distance uses distance / (len(a) + len(b)) per RapidFuzz docs, matching the NID definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/odl-pdf/scripts/quick-eval.py`:
- Around line 157-166: The build_report function has a boolean positional
parameter verbose which can cause positional-boolean footguns; change the
signature of build_report so verbose (and any parameters after it) are
keyword-only by inserting a bare * before verbose (e.g., add "*" before the
verbose parameter) and keep its type (bool) and default if any; update any
callers if needed (the existing call already passes arguments as keywords so no
changes required). Ensure the function name build_report and its parameter list
are updated accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 76d0bb95-48bc-45c1-9cfc-2047d7662c86

📥 Commits

Reviewing files that changed from the base of the PR and between 98c6cc7 and c9d3d48.

📒 Files selected for processing (1)

skills/odl-pdf/scripts/quick-eval.py

…v.sh Objective: Hybrid extras detection only checked for docling, missing fastapi and uvicorn which are also required to run the hybrid server. Approach: Update import check to verify all three packages are present. Evidence: pyproject.toml [hybrid] extras require docling, fastapi, and uvicorn. hybrid_server.py imports all three at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dling Objective: CI workflow had two issues: (1) changes to sync-skill-refs.py itself could silently break drift detection without triggering a check, (2) exit code 2 (JSON parse/input errors) was conflated with exit code 1 (drift detected), producing misleading "Drift detected" messages. Approach: - Add sync-skill-refs.py to trigger paths for self-validation on change - Differentiate exit 1 (drift) from exit 2+ (input/script errors) with separate messages Evidence: sync-skill-refs.py uses sys.exit(2) for JSON parse failure, missing options key, and missing name field — all distinct from drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

julioccorderoc · 2026-04-10T10:17:03Z

Subscribed and looking to test asap

bundolee · 2026-04-15T10:15:38Z

Review: Strong execution — 4 questions before merge

The technical quality is high, and the 2-round review feedback handling was mature (21 accepted, 13 rejected with reasoning). I want to confirm 4 things before merging — they're about sharpening the skill's reason to exist, not about the code itself.

1. Baseline comparison (most important — drives the other 3)

Observation: The Evidence section proves 5/5 PASS with the skill. There is no result without it.

Question: For eval-002 (Korean scanned PDF, M1 Mac) — does an agent with only README + options.json + --help reach --force-ocr + --ocr-lang "ko,en"? What about eval-005 (the silent --hybrid-mode full requirement for enrichments)?

Ask: Run the same 5 scenarios once without the skill (agent gets README + options.json only). If 3+ fail, the skill's value is proven. If all 5 pass, the skill's scope needs redefinition.

2. Target user definition

Observation: SKILL.md has trigger keywords in the description but no definition of "whose problem, in what situation."

Question: Should 729 lines load for a senior developer who already uses opendataloader-pdf daily and just wants to convert one more file?

Ask: Add one line to the top of SKILL.md — "This skill is for ___ who need to ___." This also helps platform skill routers make better activation decisions.

3. Skill size vs. code-readable content

Observation: options-matrix.md (235 lines) largely duplicates options.json. installation-matrix.md (98 lines) largely duplicates the README install section. Agents can read options.json directly when the source repo is checked out.

Question: In a source-checkout environment, what is the justification for this content existing in the skill as well?

Ask: Try removing content the agent can derive from code (options.json, --help, README). Count the remaining lines. If the evals still pass at that size, that's the optimal size. The "dual-path for pip users" argument is valid — but quantify how much of the 2595 lines it justifies.

4. Maintenance surface area

Observation: The drift check covers options-matrix.md only. But when a new option is added, these also need manual updates:

SKILL.md options quick-reference table (line ~530)
hybrid-guide.md server options table
format-guide.md format list

Question: How many files need updating when the next CLI option ships?

Ask: Pick one:

(a) Expand the drift check to cover SKILL.md and other references, or
(b) Remove duplicate option tables from SKILL.md/hybrid-guide.md (point to options-matrix.md instead), or
(c) Explicitly list the manual-update targets in CONTRIBUTING.md so they don't go stale silently

Priority: Start with #1 (baseline comparison). The results will determine how much weight #2–#4 carry. If the skill provides a large delta over no-skill, the maintenance cost is justified. If the delta is small, slimming down becomes more important.

bundolee · 2026-04-17T13:37:17Z

Here's how the review process will go from here:

CodeRabbit — all review threads resolved with a fix commit
CI — all checks pass (codecov excluded)
Component owner approval
- Java (java/) → @MaximPlusov
- Others → @bundolee

hyunhee-jo requested review from LonelyMidoriya, MaximPlusov, bundolee and hnc-jglee as code owners April 9, 2026 02:01

github-advanced-security AI found potential problems Apr 9, 2026

View reviewed changes

Comment thread .github/workflows/skill-drift-check.yml Fixed

hyunhee-jo force-pushed the feature/agent-skill-odl-pdf branch from 665a28d to eb2159e Compare April 9, 2026 02:08

hyunhee-jo force-pushed the feature/agent-skill-odl-pdf branch from eb2159e to 14cf497 Compare April 9, 2026 02:09

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

hyunhee-jo force-pushed the feature/agent-skill-odl-pdf branch from 14cf497 to 3bb3f9c Compare April 9, 2026 04:08

hyunhee-jo force-pushed the feature/agent-skill-odl-pdf branch from 3bb3f9c to faf526c Compare April 9, 2026 04:09

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Comment thread skills/odl-pdf/scripts/detect-env.sh Outdated

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Comment thread .github/workflows/skill-drift-check.yml

Comment thread skills/odl-pdf/scripts/quick-eval.py

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Comment thread skills/odl-pdf/scripts/quick-eval.py

hyunhee-jo and others added 2 commits April 9, 2026 15:42

Conversation

hyunhee-jo commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Approach

New files

Updated files

Evidence

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

codecov bot commented Apr 9, 2026

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julioccorderoc commented Apr 10, 2026

Uh oh!

bundolee commented Apr 15, 2026

Review: Strong execution — 4 questions before merge

1. Baseline comparison (most important — drives the other 3)

2. Target user definition

3. Skill size vs. code-readable content

4. Maintenance surface area

Uh oh!

bundolee commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hyunhee-jo commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading