Add: built-in agent skill for AI coding assistants (agentskills.io)#395
Add: built-in agent skill for AI coding assistants (agentskills.io)#395hyunhee-jo wants to merge 6 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds an Agent Skills subsystem for opendataloader-pdf: a Claude marketplace manifest, skill documentation and references, evaluation scenarios, utility CLI/scripts (env detection, hybrid health, quick-eval, sync-skill-refs), and a CI workflow to detect options drift; minor .gitignore and top-level docs updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer (PR)
participant GH as GitHub Actions
participant Repo as Repository
participant Sync as sync-skill-refs.py
Dev->>GH: Push / open PR (affecting `options.json`)
GH->>Repo: checkout
GH->>Sync: run `python skills/odl-pdf/scripts/sync-skill-refs.py`
Sync->>Repo: read `options.json`
Sync->>Repo: read `skills/odl-pdf/references/options-matrix.md`
Sync-->>GH: exit 0 (match) or exit 1 + drift report
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
665a28d to
eb2159e
Compare
Objective: AI coding assistants (Claude Code, Codex, Cursor, etc.) have no way to know how to use opendataloader-pdf optimally. Users must manually figure out which of 26 options, 7 formats, and 3 install methods to use for their specific PDF type and downstream pipeline. Approach: Ship a built-in agent skill at skills/odl-pdf/ following the agentskills.io spec. The skill acts as a Document Intelligence Engineer persona that detects the user's environment, recommends optimal options via decision trees, executes conversions directly, and diagnoses quality issues with metric-driven escalation (local → cluster → hybrid → hybrid-mode full). Key design decisions: - Dual-path option reference (built-in summary + dynamic options.json) so pip-install users without source code also get full guidance - Progressive disclosure (L1 description → L2 SKILL.md → L3 references) to minimize token usage - CI drift check (skill-drift-check.yml) catches option/skill mismatch when CLI options change in Java New files: - skills/odl-pdf/ — SKILL.md, 5 references, 3 scripts, evals - skills/README.md — skill creation and maintenance guide - .claude-plugin/marketplace.json — plugin registry - .github/workflows/skill-drift-check.yml — CI drift detection Updated: README.md (Agent Skills section), CLAUDE.md (skill dev notes), CONTRIBUTING.md (skill maintenance guide) Evidence: Ran 5 eval scenarios via independent AI agents with zero prior knowledge of opendataloader-pdf. Each agent loaded only SKILL.md and responded to user queries. All 5 passed must_mention/must_not_mention checks covering RAG pipelines, Korean OCR, table diagnostics, Windows Node.js hybrid setup, and formula+chart enrichment. Drift check (sync-skill-refs.py) confirmed 26/26 options in sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eb2159e to
14cf497
Compare
There was a problem hiding this comment.
Actionable comments posted: 21
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 5-18: Add explicit least-privilege workflow permissions by adding
a top-level permissions block (before jobs) that limits GITHUB_TOKEN to only
what's required for the read-only drift check; for this workflow add something
like permissions: contents: read (so the check-drift job can read repository
files but nothing else). Update the .github/workflows/skill-drift-check.yml file
and ensure the permissions block is present at the workflow root above the
check-drift job.
- Around line 9-13: The workflow trigger is too narrow because the current
'paths' filters only include 'options.json' for both push and pull_request
events; update the GitHub Actions workflow
(.github/workflows/skill-drift-check.yml) to include the related docs file(s)
that can introduce drift (e.g., add
'skills/odl-pdf/references/options-matrix.md' or a broader pattern like
'skills/**' to the 'paths' lists) so the 'push' and 'pull_request' triggers run
when those files change; modify the existing 'paths' entries under both the
top-level event and the 'pull_request' block rather than adding a new event
block.
- Around line 27-33: The workflow message references a non-existent --fix flag
for the script sync-skill-refs.py; either update the message in the workflow to
instruct contributors to run the correct command (remove --fix or replace it
with the actual supported flag from sync-skill-refs.py) or add a --fix/--apply
option to the sync-skill-refs.py CLI (update its argument parser and handler) so
the workflow guidance is accurate; search for sync-skill-refs.py and the
workflow snippet that echoes "Run 'python
skills/odl-pdf/scripts/sync-skill-refs.py --fix' ..." and change the guidance or
implement the flag accordingly.
In @.gitignore:
- Line 79: The .gitignore contains a redundant specific entry
"skills/odl-pdf/scripts/__pycache__/" which is already covered by the broader
pattern "**/__pycache__/"; remove the specific
"skills/odl-pdf/scripts/__pycache__/" line (or keep only the global
"**/__pycache__/" entry) to eliminate duplication and keep the file DRY.
In `@skills/odl-pdf/references/eval-metrics.md`:
- Around line 182-186: Update the documentation in
skills/odl-pdf/references/eval-metrics.md to accurately reflect quick-eval.py's
CLI: state that quick-eval.py requires two positional arguments (extracted and
ground_truth) and optionally supports --verbose, and replace the incorrect
description about subset/per-metric benchmarking with a precise one-line summary
that the script compares a single extracted file against a single ground-truth
file and reports a single similarity score (and optional diff snippets when
--verbose is used); reference the script name quick-eval.py and the positional
argument names extracted and ground_truth so the change is easy to locate.
In `@skills/odl-pdf/references/format-guide.md`:
- Around line 47-49: The fenced code blocks containing the CLI examples
"opendataloader-pdf input.pdf --format markdown,json" and "opendataloader-pdf
input.pdf --format text --to-stdout | my-indexer" are missing language
identifiers; update those triple-backtick fences to include a language tag
(e.g., bash) so markdownlint MD040 is satisfied and the snippets are highlighted
correctly.
In `@skills/odl-pdf/references/installation-matrix.md`:
- Line 7: The fenced code blocks in installation-matrix.md are missing language
specifiers causing linter warnings; update each triple-backtick block that
contains CLI output or commands to include an appropriate language tag (e.g.,
replace ``` with ```text for prompt/decision text like "Do you have Python
available?" and use ```bash for command lines such as "java -version" and
"opendataloader-pdf --version"); apply the same change to the other occurrences
called out (lines with the same blocks at 33 and 92) so all fenced blocks have
proper language specifiers.
- Around line 78-85: Update the installation matrix table rows for the runtimes
so they match the package manifests: change the "Minimum Runtime" for the
pip/pip [hybrid]/pip langchain rows to "Python 3.10+" (per pyproject.toml
requires-python = ">=3.10"), change the "Minimum Runtime" for the npm row to
"Node.js 20.19.0+" (per package.json "node": ">=20.19.0"), and leave the Maven
row as "Java 11+"; update the pip/npm table cells in installation-matrix.md
(rows identified by "pip", "pip [hybrid]", "pip langchain", "npm", and "Maven")
so the documented versions match the manifests.
In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 18: The fallback branch in detect-env.sh currently maps unknown platforms
to "linux" (the case pattern `*) echo "linux" ;;`), which can mislead callers;
change the fallback to emit "unknown" instead so callers can handle
unsupported/undetected OSes explicitly—update the `*)` case in the script (the
fallback case in the main OS-detection logic) to return or print "unknown"
rather than "linux".
- Line 37: The grep pipelines that extract values (for example the ver
assignment using grep -oE '"[^"]+"') can cause the script to exit under set -euo
pipefail when there is no match; update each such pipeline (the ver assignment
and the other grep-based extractions at the same pattern locations) to append
"|| true" so that grep failures do not abort the script and the subsequent
${...:-none} fallbacks and conditional checks remain reachable.
In `@skills/odl-pdf/scripts/hybrid-health.sh`:
- Line 55: The message hardcodes "--port 5002"; update the echo to use the
script's resolved host/port or URL variables instead (e.g., the variable(s)
populated when parsing --url/--port such as HYBRID_URL or HYBRID_PORT) so the
hint reflects the actual endpoint the script is checking; modify the echo in
hybrid-health.sh (the stopped-server hint line) to print the resolved URL/port
(or suggest using "--url <resolved-url>" if a custom URL was provided) rather
than the literal 5002.
- Around line 15-17: When handling the --url case in the argument parser, guard
against missing/empty values before assigning HYBRID_URL="$2" (this currently
fails under set -u). Change the --url branch so it first checks that the next
positional parameter exists and is not another flag (e.g. test that "${2-}" is
non-empty and does not start with '-') and if the check fails emit a clear error
and exit non-zero; otherwise assign HYBRID_URL="$2" and shift 2. Ensure you
reference the --url case and the HYBRID_URL variable when making this change.
In `@skills/odl-pdf/scripts/quick-eval.py`:
- Around line 146-155: The snippet builder currently can append a pre-hunk file
header (lines like '--- ' or '+++ ') as its own snippet; update the loop that
iterates over diff to skip unified-diff file header lines when current_hunk is
empty (ignore lines starting with '--- ' or '+++ ' before the first '@@'), or
alternatively, before appending a current_hunk to snippets check and discard
hunks that contain only header lines; modify the logic around the variables
diff, current_hunk, snippets, and max_snippets so only real hunks starting with
'@@' are captured as snippets.
In `@skills/odl-pdf/scripts/sync-skill-refs.py`:
- Around line 177-185: The two print calls that currently use pointless
f-strings should be changed to plain string literals: replace print(f"\nNEW
options (in options.json, not in skill):") and print(f"\nREMOVED options (in
skill, not in options.json):") with print("\nNEW options (in options.json, not
in skill):") and print("\nREMOVED options (in skill, not in options.json):")
respectively so the linter warning is resolved; these lines appear near the
handling of the new_options and removed_options variables in sync-skill-refs.py.
- Around line 31-35: The try-except-pass around sys.stdout.reconfigure can be
simplified using contextlib.suppress for clearer intent: replace the try/except
block that calls sys.stdout.reconfigure(encoding="utf-8") with a
contextlib.suppress(Exception) context manager so any exception is silently
ignored; reference the existing symbol sys.stdout.reconfigure and import
contextlib at the top if not already present.
In `@skills/odl-pdf/SKILL.md`:
- Around line 558-559: The pipeline in the example uses opendataloader-pdf with
--format markdown and then pipes to jq, which fails because jq expects JSON;
update the example to either emit JSON by changing --format markdown to --format
json (e.g., use opendataloader-pdf input.pdf --format json --to-stdout | jq .)
or remove the jq pipe and keep markdown output; modify the command shown in the
diff to use the chosen approach so the output type matches the consumer (jq or
stdout).
- Around line 13-14: The SKILL.md file has markdownlint warnings (MD022, MD031,
MD040) caused by missing blank lines around headings and fenced code blocks and
by fence blocks lacking a language tag; update SKILL.md by inserting a single
blank line before and after each heading (fix MD022), ensure fenced code blocks
are separated from surrounding text with blank lines (fix MD031), and add
appropriate language identifiers to all fenced code blocks (e.g., ```bash,
```json) to resolve MD040; check the repeated regions referenced in the comment
and apply the same fixes to headings and fences throughout the document.
- Around line 63-66: The SKILL.md environment table is inconsistent with
skills/odl-pdf/scripts/detect-env.sh and can break Phase 1 parsing; update the
table entries for the environment fields to match detect-env.sh semantics: use
"none" (not "missing") for absent interpreters/runtimes for PYTHON and NODE, and
specify that `JAVA` contains only the major Java version (e.g., `17`) or `none`
rather than a full version string; explicitly mention the `OS` field values
remain (linux, macos, windows) and reference detect-env.sh so future edits stay
in sync.
- Around line 605-608: The multiline Bash example for the opendataloader-pdf
command is broken because the line continuation backslash is followed by spaces
and an inline comment after --hybrid-mode full, so the next line runs as a
separate command; fix by ensuring the backslash is the last character on the
continuation line (remove trailing spaces and move the comment to its own line
or above the command) for the opendataloader-pdf invocation and its flags
(--hybrid, --hybrid-mode, --format) so the entire command executes as one shell
statement.
- Around line 272-284: The example in SKILL.md calls await at top-level while
using CommonJS require(), which will throw; wrap the async work in an async
function (e.g., async function run() { ... } then run()) or use an async IIFE so
that PdfConverter construction and the await on
converter.convert(['file1.pdf','file2.pdf']) run inside an async context; update
the snippet around the PdfConverter instantiation and converter.convert call
accordingly so top-level await is removed but functionality is unchanged.
In `@skills/README.md`:
- Line 7: The README has plain triple-backtick fenced code blocks for directory
tree snippets; update those fences to include a language specifier (e.g., change
opening ``` to ```text) for each snippet in skills/README.md (including the
snippet around the top-level tree and the snippet at the my-skill example around
line 52) so the blocks become ```text ... ``` and render with the correct
formatting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 150ff9d6-34f6-4f68-8a2a-72cc403814dd
📒 Files selected for processing (18)
.claude-plugin/marketplace.json.github/workflows/skill-drift-check.yml.gitignoreCLAUDE.mdCONTRIBUTING.mdREADME.mdskills/README.mdskills/odl-pdf/SKILL.mdskills/odl-pdf/evals/evals.jsonskills/odl-pdf/references/eval-metrics.mdskills/odl-pdf/references/format-guide.mdskills/odl-pdf/references/hybrid-guide.mdskills/odl-pdf/references/installation-matrix.mdskills/odl-pdf/references/options-matrix.mdskills/odl-pdf/scripts/detect-env.shskills/odl-pdf/scripts/hybrid-health.shskills/odl-pdf/scripts/quick-eval.pyskills/odl-pdf/scripts/sync-skill-refs.py
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (10)
skills/odl-pdf/scripts/hybrid-health.sh (2)
15-17:⚠️ Potential issue | 🟠 MajorGuard
--urlagainst missing values.With
set -uenabled,HYBRID_URL="$2"will cause an unbound variable error if--urlis passed without a value.🛠️ Proposed fix
--url) + if [[ $# -lt 2 || "$2" == --* ]]; then + echo "Missing value for --url" >&2 + echo "Usage: $0 [--url <url>]" >&2 + exit 1 + fi HYBRID_URL="$2" shift 2 ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/hybrid-health.sh` around lines 15 - 17, The --url) case assigns HYBRID_URL="$2" without checking that a value exists; with set -u this will raise an unbound variable error when --url is passed with no argument. Update the --url) branch in the option-parsing switch to validate the next positional ($2) before assigning to HYBRID_URL: if the next token is missing or looks like another option (starts with '-'), print a clear error and exit nonzero; otherwise assign HYBRID_URL="$2" and shift 2. Ensure you reference the --url) case and the HYBRID_URL variable so reviewers can find the change.
55-55:⚠️ Potential issue | 🟡 MinorAvoid hardcoded port in the stopped-server hint.
Line 55 always suggests
--port 5002, which is misleading when a custom--urlwas provided. Use the resolved URL instead.🛠️ Proposed fix
- echo "Hybrid server is not running. Start it with: opendataloader-pdf-hybrid --port 5002" + echo "Hybrid server is not running at ${HYBRID_URL}. Start it with: opendataloader-pdf-hybrid [--port <port>]"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/hybrid-health.sh` at line 55, The echo currently hardcodes "--port 5002" in the stopped-server hint; replace that literal with the script's resolved hybrid server URL variable (the same variable used earlier to build/check the URL, e.g. HYBRID_URL or RESOLVED_URL) so the message prints the actual URL/port the script used when a custom --url was provided; update the echo in the block that prints "Hybrid server is not running..." to reference that variable (e.g. echo "Hybrid server is not running. Start it with: opendataloader-pdf-hybrid --url $HYBRID_URL").skills/odl-pdf/scripts/detect-env.sh (1)
18-18:⚠️ Potential issue | 🟠 MajorDo not classify unknown OS as Linux.
Defaulting unknown platforms to
linuxcan trigger incorrect install/runtime guidance. Emitunknownand let the caller handle it explicitly.🛠️ Proposed fix
- *) echo "linux" ;; # best-effort fallback + *) echo "unknown" ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/detect-env.sh` at line 18, The script detect-env.sh currently maps the case default pattern "* )" to echo "linux" which misclassifies unknown platforms; change the default branch in the case statement (the "* )" entry) to emit "unknown" instead of "linux" and update the comment accordingly so callers can handle unknown OS explicitly.skills/odl-pdf/scripts/quick-eval.py (1)
143-157:⚠️ Potential issue | 🟡 MinorSkip unified-diff file headers when building snippet hunks.
Current logic can emit a first "snippet" containing only
---/+++headers (before the first@@), which wastes snippet slots and may confuse output.🛠️ Proposed fix
# Collect individual hunks (separated by @@ markers) snippets: list[str] = [] current_hunk: list[str] = [] + in_hunk = False for line in diff: - if line.startswith("@@") and current_hunk: - snippets.append("\n".join(current_hunk)) - current_hunk = [line] - if len(snippets) >= max_snippets: - break - else: - current_hunk.append(line) + if line.startswith("@@"): + if in_hunk and current_hunk: + snippets.append("\n".join(current_hunk)) + if len(snippets) >= max_snippets: + break + current_hunk = [line] + in_hunk = True + elif in_hunk: + current_hunk.append(line) if current_hunk and len(snippets) < max_snippets: snippets.append("\n".join(current_hunk))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/quick-eval.py` around lines 143 - 157, The snippet builder can produce a first hunk that contains only unified-diff file headers (lines like "diff ", "index ", "--- ", "+++ ") before the first "@@" which wastes slots; update the loop that fills snippets/current_hunk so it ignores header lines until the first "@@" is seen (e.g., only begin a new current_hunk when line.startswith("@@") or append to current_hunk only after an "@@" has been encountered), and ensure the final append only adds hunks that contain an "@@" (or at least non-header content) and still respects max_snippets; refer to variables/arrays snippets, current_hunk, diff and max_snippets to locate the change.skills/odl-pdf/references/eval-metrics.md (1)
182-186:⚠️ Potential issue | 🟠 Major
quick-eval.pycommand and behavior are still documented incorrectly.Line 183 omits required positional arguments, and Line 186 describes subset/per-metric benchmarking that this script does not perform.
Proposed fix
```bash -python skills/odl-pdf/scripts/quick-eval.py +python skills/odl-pdf/scripts/quick-eval.py extracted.md ground-truth.md-This script runs a subset evaluation suitable for rapid iteration. It processes a small representative sample and reports per-metric scores without requiring the full benchmark corpus.
+This script compares one extracted file against one ground-truth file and reports a single similarity score (plus optional diff snippets with--verbose).</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@skills/odl-pdf/references/eval-metrics.mdaround lines 182 - 186, Update the
docs for skills/odl-pdf/scripts/quick-eval.py to show the required positional
arguments and correct behavior: change the example invocation to require two
files (e.g., "python skills/odl-pdf/scripts/quick-eval.py extracted.md
ground-truth.md") and replace the description text to state that quick-eval.py
compares one extracted file against one ground-truth file and reports a single
similarity score, noting that optional verbose output (e.g., --verbose) can
include diff snippets; reference the script name quick-eval.py and the sample
filenames extracted.md and ground-truth.md so the change is unambiguous.</details> </blockquote></details> <details> <summary>skills/odl-pdf/references/installation-matrix.md (2)</summary><blockquote> `7-7`: _⚠️ Potential issue_ | _🟡 Minor_ **Add language specifiers to fenced code blocks.** Line 7, Line 33, and Line 92 should include fence languages (`text`/`bash`) to satisfy markdown lint and improve rendering. <details> <summary>Proposed fix</summary> ```diff -``` +```text Do you have Python available? ... -``` +``` -``` +```bash java -version -``` +``` -``` +```bash opendataloader-pdf --version -``` +```Also applies to: 33-33, 92-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/references/installation-matrix.md` at line 7, Add language specifiers to the three fenced code blocks that currently lack them: the block containing "Do you have Python available?" should be ```text, the block with the `java -version` command should be ```bash, and the block with `opendataloader-pdf --version` should be ```bash; update each opening fence to include the appropriate language to satisfy markdown linting and improve rendering.
80-83:⚠️ Potential issue | 🟠 MajorVersion compatibility table appears outdated versus package constraints.
Line 80-Line 83 likely understate minimum versions (
Python 3.8+,Node.js 16+), which can lead users to installation failures.#!/bin/bash # Verify documented runtime requirements against package manifests. set -euo pipefail echo "Searching declared runtime constraints..." rg -n -- 'requires-python|python_requires|\"node\"\\s*:' pyproject.toml setup.cfg setup.py package.json || true echo "" echo "Current installation-matrix table rows:" rg -n -- '^\| pip |^\| pip \[hybrid\] |^\| pip langchain |^\| npm ' skills/odl-pdf/references/installation-matrix.md🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/references/installation-matrix.md` around lines 80 - 83, The installation matrix rows for "| pip | Python 3.8+ | Yes |", "| pip [hybrid] | Python 3.8+ | Yes |", "| pip langchain | Python 3.8+, LangChain 0.1+ | Yes |", and "| npm | Node.js 16+ | Yes |" are likely out of sync with package manifests; run the provided verification (search python_requires/requires-python in pyproject.toml/setup.cfg/setup.py and "engines"/node constraint in package.json), determine the true minimum Python and Node versions, then update these exact table rows in skills/odl-pdf/references/installation-matrix.md to reflect the manifest values (and adjust the "LangChain" version text if package metadata dictates), ensuring the documented constraints exactly match the package manifests..github/workflows/skill-drift-check.yml (3)
27-33:⚠️ Potential issue | 🟠 MajorFix non-idiomatic exit handling and invalid
--fixremediation text.Line 29 uses indirect
$?checking, and Line 32 instructs users to run--fix, which is not supported bysync-skill-refs.py.#!/bin/bash # Verify whether --fix is supported and where it's referenced. set -euo pipefail echo "Checking workflow references..." rg -n -- '--fix|sync-skill-refs.py' .github/workflows/skill-drift-check.yml echo "" echo "Checking script CLI surface..." rg -n -- '--fix|argparse|add_argument|parse_args' skills/odl-pdf/scripts/sync-skill-refs.pyProposed fix
- name: Check skill drift run: | - python skills/odl-pdf/scripts/sync-skill-refs.py - if [ $? -ne 0 ]; then + if ! python skills/odl-pdf/scripts/sync-skill-refs.py; then echo "" echo "Drift detected: skill references are out of sync with options.json." - echo "Run 'python skills/odl-pdf/scripts/sync-skill-refs.py --fix' locally to update them." + echo "Update skills/odl-pdf/references/options-matrix.md to match options.json, then re-run the check locally." exit 1 fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/skill-drift-check.yml around lines 27 - 33, Replace the non-idiomatic "$?" check with a direct command test (use a negated command form: "if ! python ...; then") so the script exit is checked immediately and reliably, and update the remediation message to not reference the unsupported "--fix" flag; instead instruct users to run the same script (sync-skill-refs.py) locally to view/fix differences or point to the correct supported flag if the script provides one. Target the workflow step invoking sync-skill-refs.py in skill-drift-check.yml and the script name sync-skill-refs.py when making these changes.
5-7:⚠️ Potential issue | 🟠 MajorAdd explicit least-privilege workflow permissions.
There is no top-level
permissionsblock, soGITHUB_TOKENdefaults are broader than needed for this read-only check.Proposed fix
name: Skill Drift Check + +permissions: + contents: read on:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/skill-drift-check.yml around lines 5 - 7, Add an explicit top-level permissions block to the "Skill Drift Check" workflow so GITHUB_TOKEN is scoped to least privilege; update the workflow YAML to include a permissions section (e.g., permissions: contents: read, checks: read) instead of relying on defaults so the job only has the read rights it needs.
9-13:⚠️ Potential issue | 🟠 Major
pathsfilter is too narrow to catch all drift cases.Line 10 and Line 13 only watch
options.json; drift can also be introduced by edits toskills/odl-pdf/references/options-matrix.mdor the checker script itself.Proposed fix
on: push: paths: - 'options.json' + - 'skills/odl-pdf/references/options-matrix.md' + - 'skills/odl-pdf/scripts/sync-skill-refs.py' pull_request: paths: - 'options.json' + - 'skills/odl-pdf/references/options-matrix.md' + - 'skills/odl-pdf/scripts/sync-skill-refs.py' workflow_dispatch:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/skill-drift-check.yml around lines 9 - 13, The workflow's paths filter currently only lists 'options.json' under the paths keys and misses other sources of drift; update the paths arrays (the paths key entries) to include the options-matrix markdown (options-matrix.md) and the drift checker script plus broader globs (e.g., skills/** and scripts/** or a specific checker script name) so changes to those files trigger the job; ensure both the top-level push and pull_request paths arrays are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/odl-pdf/references/format-guide.md`:
- Around line 36-39: The docs for the image-output option are inconsistent:
replace the example value `image-output=dir` with the authoritative value
`image-output=external` wherever it appears (notably in the line describing
`image-output` and in any examples referencing `image-output`), and ensure the
description of `image-dir` clearly states it applies when
`image-output=external`; update references near `image-output`, `image-dir`, and
related `*-page-separator` notes to match the valid values `off`, `embedded`,
`external` documented in options-matrix.
In `@skills/odl-pdf/references/hybrid-guide.md`:
- Line 84: The table entry for `--hybrid-timeout` is inconsistent: replace the
default marker "—" with `0` so it matches the documented default in
`options-matrix.md` (Line 59) and ensures the hybrid-guide.md `|
--hybrid-timeout <ms> | Integer | 0 | Request timeout in milliseconds... |`
reflects that `0` disables the timeout.
In `@skills/odl-pdf/references/installation-matrix.md`:
- Around line 68-74: Update the Maven dependency snippet in
installation-matrix.md that currently shows groupId "io.opendataloader" and
artifactId "opendataloader-pdf" so it matches the canonical published
coordinates used elsewhere: change groupId to "org.opendataloader" and
artifactId to "opendataloader-pdf-core" (leave the <version> tag for you to pin
to a specific release rather than "LATEST"); edit the dependency block
referenced by the groupId/artifactId lines to reflect these exact identifiers so
it aligns with the README and Maven Central badge.
In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 37: The pipelines that extract versions (assignments to ver, major, and
version variables such as the ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' |
tr -d '"' | head -1)" line and the other ver/major/version extractions) can
return non-zero when grep finds no match and cause the script to exit under set
-euo pipefail; fix by appending "|| true" to those grep-containing pipelines so
they never fail the pipeline (apply to the shown ver assignment and the other
locations referenced: major=..., ver=... at line 70, and the version=... lines
at ~114 and ~122), leaving the fallback logic intact.
- Around line 42-50: The script currently returns only the major version
(variable ver -> echoing BASH_REMATCH[1] or extracting major) which conflicts
with SKILL.md expecting the full version string; update the logic in the
detect-env.sh snippet so both branches return the full semantic version (e.g.,
17.0.9) instead of just the major: for the legacy 1.x match stop echoing
BASH_REMATCH[1] and echo the original ver (or a normalized full version), and in
the else branch replace the major-only extraction (variable major with grep -oE
'^[0-9]+') with a pattern that captures the full dotted version prefix (e.g.,
grep -oE '^[0-9]+(\.[0-9]+)*') so the function returns the complete version
string.
---
Duplicate comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 27-33: Replace the non-idiomatic "$?" check with a direct command
test (use a negated command form: "if ! python ...; then") so the script exit is
checked immediately and reliably, and update the remediation message to not
reference the unsupported "--fix" flag; instead instruct users to run the same
script (sync-skill-refs.py) locally to view/fix differences or point to the
correct supported flag if the script provides one. Target the workflow step
invoking sync-skill-refs.py in skill-drift-check.yml and the script name
sync-skill-refs.py when making these changes.
- Around line 5-7: Add an explicit top-level permissions block to the "Skill
Drift Check" workflow so GITHUB_TOKEN is scoped to least privilege; update the
workflow YAML to include a permissions section (e.g., permissions: contents:
read, checks: read) instead of relying on defaults so the job only has the read
rights it needs.
- Around line 9-13: The workflow's paths filter currently only lists
'options.json' under the paths keys and misses other sources of drift; update
the paths arrays (the paths key entries) to include the options-matrix markdown
(options-matrix.md) and the drift checker script plus broader globs (e.g.,
skills/** and scripts/** or a specific checker script name) so changes to those
files trigger the job; ensure both the top-level push and pull_request paths
arrays are updated consistently.
In `@skills/odl-pdf/references/eval-metrics.md`:
- Around line 182-186: Update the docs for skills/odl-pdf/scripts/quick-eval.py
to show the required positional arguments and correct behavior: change the
example invocation to require two files (e.g., "python
skills/odl-pdf/scripts/quick-eval.py extracted.md ground-truth.md") and replace
the description text to state that quick-eval.py compares one extracted file
against one ground-truth file and reports a single similarity score, noting that
optional verbose output (e.g., --verbose) can include diff snippets; reference
the script name quick-eval.py and the sample filenames extracted.md and
ground-truth.md so the change is unambiguous.
In `@skills/odl-pdf/references/installation-matrix.md`:
- Line 7: Add language specifiers to the three fenced code blocks that currently
lack them: the block containing "Do you have Python available?" should be
```text, the block with the `java -version` command should be ```bash, and the
block with `opendataloader-pdf --version` should be ```bash; update each opening
fence to include the appropriate language to satisfy markdown linting and
improve rendering.
- Around line 80-83: The installation matrix rows for "| pip | Python 3.8+ | Yes
|", "| pip [hybrid] | Python 3.8+ | Yes |", "| pip langchain | Python 3.8+,
LangChain 0.1+ | Yes |", and "| npm | Node.js 16+ | Yes |" are likely out of
sync with package manifests; run the provided verification (search
python_requires/requires-python in pyproject.toml/setup.cfg/setup.py and
"engines"/node constraint in package.json), determine the true minimum Python
and Node versions, then update these exact table rows in
skills/odl-pdf/references/installation-matrix.md to reflect the manifest values
(and adjust the "LangChain" version text if package metadata dictates), ensuring
the documented constraints exactly match the package manifests.
In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 18: The script detect-env.sh currently maps the case default pattern "*
)" to echo "linux" which misclassifies unknown platforms; change the default
branch in the case statement (the "* )" entry) to emit "unknown" instead of
"linux" and update the comment accordingly so callers can handle unknown OS
explicitly.
In `@skills/odl-pdf/scripts/hybrid-health.sh`:
- Around line 15-17: The --url) case assigns HYBRID_URL="$2" without checking
that a value exists; with set -u this will raise an unbound variable error when
--url is passed with no argument. Update the --url) branch in the option-parsing
switch to validate the next positional ($2) before assigning to HYBRID_URL: if
the next token is missing or looks like another option (starts with '-'), print
a clear error and exit nonzero; otherwise assign HYBRID_URL="$2" and shift 2.
Ensure you reference the --url) case and the HYBRID_URL variable so reviewers
can find the change.
- Line 55: The echo currently hardcodes "--port 5002" in the stopped-server
hint; replace that literal with the script's resolved hybrid server URL variable
(the same variable used earlier to build/check the URL, e.g. HYBRID_URL or
RESOLVED_URL) so the message prints the actual URL/port the script used when a
custom --url was provided; update the echo in the block that prints "Hybrid
server is not running..." to reference that variable (e.g. echo "Hybrid server
is not running. Start it with: opendataloader-pdf-hybrid --url $HYBRID_URL").
In `@skills/odl-pdf/scripts/quick-eval.py`:
- Around line 143-157: The snippet builder can produce a first hunk that
contains only unified-diff file headers (lines like "diff ", "index ", "--- ",
"+++ ") before the first "@@" which wastes slots; update the loop that fills
snippets/current_hunk so it ignores header lines until the first "@@" is seen
(e.g., only begin a new current_hunk when line.startswith("@@") or append to
current_hunk only after an "@@" has been encountered), and ensure the final
append only adds hunks that contain an "@@" (or at least non-header content) and
still respects max_snippets; refer to variables/arrays snippets, current_hunk,
diff and max_snippets to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: fdd6b05b-ad83-4507-8990-82c61279d8d7
📒 Files selected for processing (18)
.claude-plugin/marketplace.json.github/workflows/skill-drift-check.yml.gitignoreCLAUDE.mdCONTRIBUTING.mdREADME.mdskills/README.mdskills/odl-pdf/SKILL.mdskills/odl-pdf/evals/evals.jsonskills/odl-pdf/references/eval-metrics.mdskills/odl-pdf/references/format-guide.mdskills/odl-pdf/references/hybrid-guide.mdskills/odl-pdf/references/installation-matrix.mdskills/odl-pdf/references/options-matrix.mdskills/odl-pdf/scripts/detect-env.shskills/odl-pdf/scripts/hybrid-health.shskills/odl-pdf/scripts/quick-eval.pyskills/odl-pdf/scripts/sync-skill-refs.py
14cf497 to
3bb3f9c
Compare
Objective: Code review identified 6 issues across CI workflow, shell scripts, SKILL.md references, and .gitignore. Approach: Accept 6 items, reject 4 items with technical reasoning. - CI: remove nonexistent --fix flag from error message, fix unreachable exit-code logic with set +e - Shell: add executable bit (100755) to detect-env.sh, hybrid-health.sh - SKILL.md: add eval-metrics.md to reference table, fix quick-eval.py usage from --input/--reference flags to positional args - .gitignore: remove duplicate __pycache__ entry (already covered by **/__pycache__/ on line 53) Evidence: git ls-tree confirms 100755 for .sh files. grep confirms no remaining --fix references in CI. SKILL.md quick-eval.py example matches argparse positional interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3bb3f9c to
faf526c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (11)
skills/odl-pdf/SKILL.md (4)
272-284:⚠️ Potential issue | 🟠 MajorNode.js example uses invalid top-level
awaitin CommonJS.The snippet uses
require()plus top-levelawait, which will throw in CommonJS execution.Proposed patch
const { PdfConverter } = require('@opendataloader/pdf'); @@ -// Batch all files in one call -const results = await converter.convert(['file1.pdf', 'file2.pdf']); -results.forEach(r => console.log(r.markdown)); +async function main() { + // Batch all files in one call + const results = await converter.convert(['file1.pdf', 'file2.pdf']); + results.forEach((r) => console.log(r.markdown)); +} + +main().catch(console.error);#!/bin/bash # Verify CommonJS top-level await failure mode. set -euo pipefail cat >/tmp/odl-skill-commonjs-check.cjs <<'EOF' const { PdfConverter } = require('@opendataloader/pdf'); const results = await Promise.resolve([]); console.log(results); EOF node --check /tmp/odl-skill-commonjs-check.cjs || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/SKILL.md` around lines 272 - 284, The snippet mixes CommonJS require() with top-level await which fails in CommonJS; update the example using either an async IIFE (e.g., wrap the PdfConverter creation and await converter.convert(...) inside an immediately-invoked async function) or convert the snippet to ESM (use import and keep top-level await) so that the call to PdfConverter.convert can await correctly; reference PdfConverter and converter.convert to locate and modify the example accordingly.
603-606:⚠️ Potential issue | 🟠 MajorFix broken Bash line continuation in enrichment command.
The backslash on Line 605 is followed by spaces/comment text, so the command continuation is broken.
Proposed patch
opendataloader-pdf input.pdf \ --hybrid docling-fast \ - --hybrid-mode full \ # required for enrichments + --hybrid-mode full \ + # required for enrichments --format markdownBased on learnings, enrichment flags must be paired with
--hybrid-mode full, so this command must remain copy-paste safe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/SKILL.md` around lines 603 - 606, The Bash line continuation for the enrichment example using opendataloader-pdf is broken because the backslash after "--hybrid-mode full" is followed by spaces and a comment; update the SKILL.md example so the continuation is copy-paste safe by removing the trailing spaces/comment after the backslash (or place the comment on its own line) so the command that includes "--hybrid docling-fast", "--hybrid-mode full" and "--format markdown" continues correctly; ensure the flags remain paired with "--hybrid-mode full" and the example is executable.
64-66:⚠️ Potential issue | 🟠 MajorAlign environment field semantics with
detect-env.sh.These rows currently document
missingand full Java version, butskills/odl-pdf/scripts/detect-env.shemitsnoneand Java major version. This mismatch can break Phase 1 parsing behavior.Proposed patch
-| `JAVA` | Java version detected (e.g., `17.0.9`) or `missing` | -| `PYTHON` | Python version or `missing` | -| `NODE` | Node.js version or `missing` | +| `JAVA` | Java major version detected (e.g., `17`) or `none` | +| `PYTHON` | Python version or `none` | +| `NODE` | Node.js version or `none` |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/SKILL.md` around lines 64 - 66, The SKILL.md table rows for environment detection are out of sync with scripts/detect-env.sh; update the `JAVA`, `PYTHON`, and `NODE` row descriptions in skills/odl-pdf/SKILL.md to match detect-env.sh's output semantics (use `none` instead of `missing`, and for `JAVA` document the reported Java major version like `17` rather than a full runtime string), ensuring the wording matches the exact tokens emitted by scripts/detect-env.sh so Phase 1 parsing remains compatible.
556-557:⚠️ Potential issue | 🟠 Major
jqcannot parse Markdown output in this example.This command emits Markdown and pipes it to
jq, which expects JSON.Proposed patch
-opendataloader-pdf input.pdf --format markdown --to-stdout | jq . +opendataloader-pdf input.pdf --format json --to-stdout | jq .🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/SKILL.md` around lines 556 - 557, The example pipes Markdown into jq which fails because jq expects JSON; update the example command string "opendataloader-pdf input.pdf --format markdown --to-stdout | jq ." to either emit JSON (change --format markdown to --format json) when piping to jq or remove the " | jq ." pipe and leave the Markdown output; ensure the example shows a matching output processor for the chosen --format.skills/odl-pdf/scripts/detect-env.sh (2)
18-18:⚠️ Potential issue | 🟠 MajorReturn
unknownfor unmapped OS instead of forcinglinux.Line 18 currently misclassifies unknown platforms as Linux, which can produce incorrect setup guidance.
Proposed patch
- *) echo "linux" ;; # best-effort fallback + *) echo "unknown" ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/detect-env.sh` at line 18, The default case in the OS detection case statement currently returns "linux" (pattern '*)'), which misclassifies unknown platforms; update the default branch (the '*)' pattern in detect-env.sh) to echo "unknown" instead of "linux" and update the inline comment to reflect it's an unknown-platform fallback so callers get a clear "unknown" result rather than being forced to Linux.
37-37:⚠️ Potential issue | 🔴 CriticalGuard grep-based pipelines under strict mode to avoid premature exits.
With
set -euo pipefail, thesegreppipelines can terminate the script when no match exists, bypassing yournonefallback logic.Proposed patch
- ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1)" + ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1 || true)" ... - major="$(printf '%s' "${ver}" | grep -oE '^[0-9]+')" + major="$(printf '%s' "${ver}" | grep -oE '^[0-9]+' || true)" ... - ver="$(printf '%s' "${raw}" | grep -oE '[0-9]+\.[0-9]+(\.[0-9]+)?'| head -1)" + ver="$(printf '%s' "${raw}" | grep -oE '[0-9]+\.[0-9]+(\.[0-9]+)?' | head -1 || true)" ... - version="$(printf '%s' "${cli_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)" + version="$(printf '%s' "${cli_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1 || true)" ... - version="$(printf '%s' "${mod_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)" + version="$(printf '%s' "${mod_ver}" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1 || true)"#!/bin/bash # Verifies strict-mode pipeline failure behavior (no repo modification). set -euo pipefail bash -lc ' set -euo pipefail raw="no-version-token" # This exits non-zero without "|| true" ver="$(printf "%s" "$raw" | grep -oE "\"[^\"]+\"" | tr -d "\"" | head -1)" echo "ver=$ver" ' || echo "Confirmed: pipeline aborts under strict mode when grep has no match."Also applies to: 48-48, 70-70, 114-114, 122-122
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/detect-env.sh` at line 37, The grep-based pipeline that sets ver (ver="$(printf '%s' "${raw}" | grep -oE '"[^"]+"' | tr -d '"' | head -1)") will cause the script to exit under set -euo pipefail when grep finds no match; update the pipeline in the ver assignment (and the other similar pipelines at the mentioned locations) to guard against non-zero grep exits by appending a safe fallback (for example append "|| true" or use "grep ... || :" before head or capture via a conditional) so the variable falls back to your existing "none" logic instead of aborting the script; locate the ver assignment and analogous pipelines (the grep -oE '"[^"]+"' | tr -d '"' | head -1 sequences) and make them resilient to no-match failures.skills/odl-pdf/scripts/hybrid-health.sh (2)
15-17:⚠️ Potential issue | 🟠 MajorGuard
--urlfor missing values before reading$2.With
set -u, passing--urlwithout a value can terminate the script with an unbound variable.Proposed patch
--url) + if [[ $# -lt 2 || "${2-}" == --* || -z "${2-}" ]]; then + echo "Missing value for --url" >&2 + echo "Usage: $0 [--url <url>]" >&2 + exit 1 + fi HYBRID_URL="$2" shift 2 ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/hybrid-health.sh` around lines 15 - 17, The --url case reads $2 directly which under set -u will fail if the option is provided without a value; update the --url branch in hybrid-health.sh to first validate that a next argument exists and is not another option (e.g., check that ${2-} is non-empty and does not start with '-') before assigning HYBRID_URL="$2", otherwise print a helpful error and exit; keep the existing shift 2 behavior only after the validation succeeds so HYBRID_URL and shifts are safe.
55-55:⚠️ Potential issue | 🟡 MinorAvoid hardcoded
--port 5002in the stopped-server hint.This message is inaccurate when users pass a custom
--url.Proposed patch
- echo "Hybrid server is not running. Start it with: opendataloader-pdf-hybrid --port 5002" + echo "Hybrid server is not running at ${HYBRID_URL}. Start it with: opendataloader-pdf-hybrid [--port <port>] or use --url ${HYBRID_URL}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/hybrid-health.sh` at line 55, The hint message in hybrid-health.sh currently hardcodes "--port 5002" which is wrong when a custom --url is used; update the echo to construct the hint dynamically from the script's configured variables (use the variable that holds the configured URL/port — e.g., HYBRID_URL or HYBRID_PORT or URL used by the health check logic) and fall back to the default (5002) so the printed start command reflects the actual configured endpoint used by the script..github/workflows/skill-drift-check.yml (3)
5-7:⚠️ Potential issue | 🟠 MajorAdd explicit least-privilege workflow permissions.
This workflow currently relies on default
GITHUB_TOKENpermissions. Pin it to read-only for this check.Proposed patch
name: Skill Drift Check + +permissions: + contents: read on:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/skill-drift-check.yml around lines 5 - 7, The workflow "Skill Drift Check" is using default GITHUB_TOKEN permissions; restrict it to least privilege by adding an explicit top-level permissions block that sets GITHUB_TOKEN to read-only (e.g., permissions: contents: read) so the workflow only has the minimum required access; update the workflow header near the "name: Skill Drift Check" entry and ensure the block includes the specific permission(s) your job needs (reference GITHUB_TOKEN) rather than relying on defaults.
9-13:⚠️ Potential issue | 🟠 MajorExpand path triggers so drift checks run when skill docs/scripts change.
Only watching
options.jsonmisses drift introduced by editingskills/odl-pdf/references/options-matrix.mdorskills/odl-pdf/scripts/sync-skill-refs.py.Proposed patch
on: push: paths: - 'options.json' + - 'skills/odl-pdf/references/options-matrix.md' + - 'skills/odl-pdf/scripts/sync-skill-refs.py' pull_request: paths: - 'options.json' + - 'skills/odl-pdf/references/options-matrix.md' + - 'skills/odl-pdf/scripts/sync-skill-refs.py'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/skill-drift-check.yml around lines 9 - 13, The workflow's trigger only watches 'options.json' under the 'paths' arrays for push and the 'pull_request' job, so edits to skill docs or scripts (e.g., references markdown or sync scripts) won't run drift checks; update the 'paths' entries in .github/workflows/skill-drift-check.yml to include glob patterns that cover skill documentation and scripts in the skills tree (for example add patterns for skills/**/*.md and skills/**/scripts/** or similar) alongside the existing 'options.json' so pushes and PRs that change skill docs/scripts will trigger the drift check.
31-35:⚠️ Potential issue | 🟡 MinorDifferentiate drift (exit 1) from execution/input errors (exit 2).
sync-skill-refs.pyuses distinct exit codes; treating all non-zero as drift makes failures harder to triage.Proposed patch
python skills/odl-pdf/scripts/sync-skill-refs.py EXIT_CODE=$? - if [ $EXIT_CODE -ne 0 ]; then + if [ "$EXIT_CODE" -eq 1 ]; then echo "" echo "Drift detected: skill references are out of sync with options.json." echo "Update skills/odl-pdf/references/options-matrix.md to match options.json." exit 1 + elif [ "$EXIT_CODE" -ne 0 ]; then + echo "" + echo "Skill drift check failed to run (exit code: $EXIT_CODE)." + exit "$EXIT_CODE" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/skill-drift-check.yml around lines 31 - 35, The workflow currently treats any non-zero EXIT_CODE from sync-skill-refs.py as drift (exiting 1); change the branch that handles non-zero EXIT_CODE so it preserves the script's semantics: if EXIT_CODE equals 2, re-exit with code 2 (and an appropriate message about execution/input error), otherwise exit 1 for actual drift; reference the EXIT_CODE variable and the sync-skill-refs.py invocation and update the echo/exit behavior in that conditional block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/odl-pdf/scripts/detect-env.sh`:
- Around line 140-157: The detect_hybrid_extras function currently probes for a
non-existent module docling_serve which makes HYBRID_EXTRAS incorrectly false;
update detect_hybrid_extras to test importing the actual required packages
(docling, fastapi, uvicorn) instead of docling_serve, i.e., run a Python import
check that attempts "import docling, fastapi, uvicorn" (or import them
individually) and set HYBRID_EXTRAS=true only if those imports succeed, while
preserving the existing python/pycmd detection and error suppression behavior.
---
Duplicate comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 5-7: The workflow "Skill Drift Check" is using default
GITHUB_TOKEN permissions; restrict it to least privilege by adding an explicit
top-level permissions block that sets GITHUB_TOKEN to read-only (e.g.,
permissions: contents: read) so the workflow only has the minimum required
access; update the workflow header near the "name: Skill Drift Check" entry and
ensure the block includes the specific permission(s) your job needs (reference
GITHUB_TOKEN) rather than relying on defaults.
- Around line 9-13: The workflow's trigger only watches 'options.json' under the
'paths' arrays for push and the 'pull_request' job, so edits to skill docs or
scripts (e.g., references markdown or sync scripts) won't run drift checks;
update the 'paths' entries in .github/workflows/skill-drift-check.yml to include
glob patterns that cover skill documentation and scripts in the skills tree (for
example add patterns for skills/**/*.md and skills/**/scripts/** or similar)
alongside the existing 'options.json' so pushes and PRs that change skill
docs/scripts will trigger the drift check.
- Around line 31-35: The workflow currently treats any non-zero EXIT_CODE from
sync-skill-refs.py as drift (exiting 1); change the branch that handles non-zero
EXIT_CODE so it preserves the script's semantics: if EXIT_CODE equals 2, re-exit
with code 2 (and an appropriate message about execution/input error), otherwise
exit 1 for actual drift; reference the EXIT_CODE variable and the
sync-skill-refs.py invocation and update the echo/exit behavior in that
conditional block.
In `@skills/odl-pdf/scripts/detect-env.sh`:
- Line 18: The default case in the OS detection case statement currently returns
"linux" (pattern '*)'), which misclassifies unknown platforms; update the
default branch (the '*)' pattern in detect-env.sh) to echo "unknown" instead of
"linux" and update the inline comment to reflect it's an unknown-platform
fallback so callers get a clear "unknown" result rather than being forced to
Linux.
- Line 37: The grep-based pipeline that sets ver (ver="$(printf '%s' "${raw}" |
grep -oE '"[^"]+"' | tr -d '"' | head -1)") will cause the script to exit under
set -euo pipefail when grep finds no match; update the pipeline in the ver
assignment (and the other similar pipelines at the mentioned locations) to guard
against non-zero grep exits by appending a safe fallback (for example append "||
true" or use "grep ... || :" before head or capture via a conditional) so the
variable falls back to your existing "none" logic instead of aborting the
script; locate the ver assignment and analogous pipelines (the grep -oE
'"[^"]+"' | tr -d '"' | head -1 sequences) and make them resilient to no-match
failures.
In `@skills/odl-pdf/scripts/hybrid-health.sh`:
- Around line 15-17: The --url case reads $2 directly which under set -u will
fail if the option is provided without a value; update the --url branch in
hybrid-health.sh to first validate that a next argument exists and is not
another option (e.g., check that ${2-} is non-empty and does not start with '-')
before assigning HYBRID_URL="$2", otherwise print a helpful error and exit; keep
the existing shift 2 behavior only after the validation succeeds so HYBRID_URL
and shifts are safe.
- Line 55: The hint message in hybrid-health.sh currently hardcodes "--port
5002" which is wrong when a custom --url is used; update the echo to construct
the hint dynamically from the script's configured variables (use the variable
that holds the configured URL/port — e.g., HYBRID_URL or HYBRID_PORT or URL used
by the health check logic) and fall back to the default (5002) so the printed
start command reflects the actual configured endpoint used by the script.
In `@skills/odl-pdf/SKILL.md`:
- Around line 272-284: The snippet mixes CommonJS require() with top-level await
which fails in CommonJS; update the example using either an async IIFE (e.g.,
wrap the PdfConverter creation and await converter.convert(...) inside an
immediately-invoked async function) or convert the snippet to ESM (use import
and keep top-level await) so that the call to PdfConverter.convert can await
correctly; reference PdfConverter and converter.convert to locate and modify the
example accordingly.
- Around line 603-606: The Bash line continuation for the enrichment example
using opendataloader-pdf is broken because the backslash after "--hybrid-mode
full" is followed by spaces and a comment; update the SKILL.md example so the
continuation is copy-paste safe by removing the trailing spaces/comment after
the backslash (or place the comment on its own line) so the command that
includes "--hybrid docling-fast", "--hybrid-mode full" and "--format markdown"
continues correctly; ensure the flags remain paired with "--hybrid-mode full"
and the example is executable.
- Around line 64-66: The SKILL.md table rows for environment detection are out
of sync with scripts/detect-env.sh; update the `JAVA`, `PYTHON`, and `NODE` row
descriptions in skills/odl-pdf/SKILL.md to match detect-env.sh's output
semantics (use `none` instead of `missing`, and for `JAVA` document the reported
Java major version like `17` rather than a full runtime string), ensuring the
wording matches the exact tokens emitted by scripts/detect-env.sh so Phase 1
parsing remains compatible.
- Around line 556-557: The example pipes Markdown into jq which fails because jq
expects JSON; update the example command string "opendataloader-pdf input.pdf
--format markdown --to-stdout | jq ." to either emit JSON (change --format
markdown to --format json) when piping to jq or remove the " | jq ." pipe and
leave the Markdown output; ensure the example shows a matching output processor
for the chosen --format.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 33aec8d3-c2e3-4797-9349-9097189a0188
📒 Files selected for processing (5)
.github/workflows/skill-drift-check.yml.gitignoreskills/odl-pdf/SKILL.mdskills/odl-pdf/scripts/detect-env.shskills/odl-pdf/scripts/hybrid-health.sh
💤 Files with no reviewable changes (1)
- .gitignore
Objective: CodeRabbit and CodeQL identified 24 issues across
skill files — factual errors, security, and edge cases.
Approach: Accept 15 items, reject 9 with technical reasoning.
Critical fixes:
- Python/Node.js/Java API examples replaced with actual APIs
(opendataloader_pdf.convert, convert from @opendataloader/pdf,
OpenDataLoaderPDF.processFile)
- NID renamed to Normalized Indel Distance (was Information/Inversion)
- MHS renamed to Markdown Heading Similarity (was Mean)
- Hybrid install corrected to pip install "opendataloader-pdf[hybrid]"
(was separate pip install opendataloader-pdf-hybrid)
- detect-env.sh: grep pipefail fix (|| true), unknown OS fallback,
hybrid detection via import docling (was docling_serve)
Other fixes:
- CI: added permissions { contents: read }
- installation-matrix.md: Python 3.10+, Node 20.19+, Maven coordinates
- hybrid-guide.md: timeout default 0, docling-serve description
- SKILL.md: jq pipe format, bash continuation, env table none/missing
- hybrid-health.sh: --url guard, dynamic port hint
- format-guide.md: image-output values corrected
- installation-matrix.md: GitHub URL corrected
Rejected (with push back):
- CI trigger expansion to options-matrix.md (false positives)
- markdownlint code block language tags (no markdownlint CI)
- SKILL.md top-level await concern (example snippet, not executed)
- contextlib.suppress suggestion (reviewer agreed acceptable)
- f-string placeholder linting (no functional impact)
- hybrid-health.sh exit code (key=value parsing, no caller)
- CI auto-PR creation (v1.0 scope, tracked as follow-up)
Evidence: opus audit verified 17 previously-flagged checks + 9 fresh
scan checks all PASS. Drift check 26/26 options in sync.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (8)
skills/odl-pdf/scripts/hybrid-health.sh (1)
15-25:⚠️ Potential issue | 🟡 MinorHarden
--urlparsing for empty/flag-like values.
--url --fooand--url=currently pass parsing and lead to invalid health endpoints instead of a clear argument error.Proposed fix
--url) - if [[ $# -lt 2 ]]; then + if [[ $# -lt 2 || -z "${2-}" || "${2}" == --* ]]; then echo "Error: --url requires a value" >&2 exit 1 fi HYBRID_URL="$2" shift 2 ;; --url=*) - HYBRID_URL="${1#--url=}" + HYBRID_URL="${1#--url=}" + if [[ -z "${HYBRID_URL}" ]]; then + echo "Error: --url requires a non-empty value" >&2 + exit 1 + fi shift ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/hybrid-health.sh` around lines 15 - 25, The --url parsing must reject empty or flag-like values: when handling the separate-arg branch (case for --url) validate that "$2" is non-empty and does not start with '-' before assigning HYBRID_URL and shifting; when handling the --url=* branch validate that the extracted value (from "${1#--url=}") is non-empty before assigning HYBRID_URL and shifting; on validation failure print an error ("--url requires a non-empty value") to stderr and exit 1 to prevent creating invalid endpoints..github/workflows/skill-drift-check.yml (1)
9-13:⚠️ Potential issue | 🟠 MajorBroaden path triggers to avoid missed drift checks.
This workflow can miss drift when only
skills/odl-pdf/references/options-matrix.mdorskills/odl-pdf/scripts/sync-skill-refs.pychanges.Proposed fix
on: push: paths: - 'options.json' + - 'skills/odl-pdf/references/options-matrix.md' + - 'skills/odl-pdf/scripts/sync-skill-refs.py' pull_request: paths: - 'options.json' + - 'skills/odl-pdf/references/options-matrix.md' + - 'skills/odl-pdf/scripts/sync-skill-refs.py' workflow_dispatch:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/skill-drift-check.yml around lines 9 - 13, The workflow's path filters only watch 'options.json' so changes to skill files like 'skills/odl-pdf/references/options-matrix.md' or 'skills/odl-pdf/scripts/sync-skill-refs.py' don't trigger the drift check; update the push and pull_request path patterns (the 'paths' entries in the workflow) to include the relevant skill directories or files (for example broaden to 'skills/**', or add the specific paths 'skills/odl-pdf/references/options-matrix.md' and 'skills/odl-pdf/scripts/sync-skill-refs.py') so edits to those files will run the workflow.skills/odl-pdf/references/installation-matrix.md (1)
7-21:⚠️ Potential issue | 🟡 MinorAdd language tags to all unlabeled fenced blocks.
The decision tree and command fences should declare a language (
text/bash) to resolve MD040 warnings.Proposed fix
-``` +```text Do you have Python available? ├── Yes │ ├── Do you need LangChain integration? @@ └── Unsure? └── pip install opendataloader-pdf (simplest, works on all platforms)@@
-+bash
java -version@@ -``` +```bash opendataloader-pdf --version</details> Also applies to: 33-35, 92-94 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@skills/odl-pdf/references/installation-matrix.mdaround lines 7 - 21, The
unlabeled fenced code blocks in installation-matrix.md (the decision-tree ASCII
block and the command examples like the java --version and opendataloader-pdf
--version blocks) are triggering MD040; add explicit language tags: change the
decision-tree fence totext and change command/example fences tobash, and
apply the same fixes to the other occurrences referenced (around the other
blocks at the indicated sections), ensuring every triple-backtick fence has a
language identifier.</details> </blockquote></details> <details> <summary>skills/odl-pdf/references/format-guide.md (1)</summary><blockquote> `47-49`: _⚠️ Potential issue_ | _🟡 Minor_ **Add language identifiers to fenced CLI examples.** Both fenced blocks should include a language tag (e.g., `bash`) to satisfy MD040 and improve readability. <details> <summary>Proposed fix</summary> ```diff -``` +```bash opendataloader-pdf input.pdf --format markdown,json ``` @@ -``` +```bash opendataloader-pdf input.pdf --format text --to-stdout | my-indexer ``` ``` </details> Also applies to: 57-59 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@skills/odl-pdf/references/format-guide.mdaround lines 47 - 49, The fenced
CLI examples using the opendataloader-pdf command need language identifiers for
markdown linting (MD040); change the opening backticks for the blocks that
contain "opendataloader-pdf input.pdf --format markdown,json" and the block with
"opendataloader-pdf input.pdf --format text --to-stdout | my-indexer" so they
start withbash instead of(i.e., replacewithbash for those two
fenced code blocks) and keep the contents unchanged.</details> </blockquote></details> <details> <summary>skills/odl-pdf/scripts/quick-eval.py (1)</summary><blockquote> `146-155`: _⚠️ Potential issue_ | _🟡 Minor_ **Skip unified-diff file headers when building snippets.** Current hunk parsing can emit a header-only snippet (`---/+++`) before the first `@@`, which wastes snippet slots. <details> <summary>Proposed fix</summary> ```diff snippets: list[str] = [] current_hunk: list[str] = [] + in_hunk = False for line in diff: - if line.startswith("@@") and current_hunk: - snippets.append("\n".join(current_hunk)) - current_hunk = [line] - if len(snippets) >= max_snippets: - break - else: + if line.startswith("@@"): + in_hunk = True + if current_hunk: + snippets.append("\n".join(current_hunk)) + if len(snippets) >= max_snippets: + break + current_hunk = [line] + continue + if in_hunk: current_hunk.append(line) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/scripts/quick-eval.py` around lines 146 - 155, The loop that builds hunks appends header-only snippets (lines like '---'/'+++') before the first '@@', wasting snippet slots; modify the logic in the diff-processing loop that iterates "for line in diff:" so that you skip or drop file-header lines (e.g., lines starting with '---' or '+++') when current_hunk contains only headers and you haven't hit a hunk header '@@' yet, only starting current_hunk when you see the first '@@' or non-header content; ensure the final append block that checks "if current_hunk and len(snippets) < max_snippets:" also ignores header-only current_hunk so snippets only contain real hunks (references: variables diff, current_hunk, snippets, max_snippets). ``` </details> </blockquote></details> <details> <summary>skills/odl-pdf/SKILL.md (3)</summary><blockquote> `37-37`: _⚠️ Potential issue_ | _🟡 Minor_ **Address remaining markdownlint warnings for CI cleanliness.** Multiple fenced code blocks still lack language specifications (MD040) and surrounding blank lines (MD031). These were flagged in previous reviews but remain unresolved: - **MD040**: Add language identifiers to fences at lines 37, 100, 133, 182, 328, 377, 392, 405, 417, 425, 433, 441 (use `text` for decision trees/pseudo-code, or appropriate language for code examples) - **MD031**: Add blank lines before/after fenced blocks at lines 244, 253, 266, 278, 292, 377, 392, 405, 417, 425, 433, 540, 545, 550, 556, 704, 709 Also applies to: 100-100, 133-133, 182-182, 244-244, 253-253, 266-266, 278-278, 292-292, 328-328, 377-377, 392-392, 405-405, 417-417, 425-425, 433-433, 441-441, 540-540, 545-545, 550-550, 556-556, 704-704, 709-709 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/SKILL.md` at line 37, The markdown has multiple fenced code blocks missing language identifiers (MD040) and missing surrounding blank lines (MD031); update SKILL.md by adding appropriate language tags (use "text" for decision trees/pseudo-code or the correct language for code examples) to the fences at the listed locations (e.g., the fences reported at lines ~37,100,133,182,328,377,392,405,417,425,433,441 and others) and ensure there is a blank line before and after each fenced block for the reported positions (e.g., around the blocks at ~244,253,266,278,292,377,392,405,417,425,433,540,545,550,556,704,709) so the file passes markdownlint checks MD040 and MD031. ``` </details> --- `13-14`: _⚠️ Potential issue_ | _🟡 Minor_ **Add blank line between metadata headers.** Line 14 still violates MD022 (headings should be surrounded by blank lines). A blank line is needed between lines 13 and 14 to separate these two heading-level metadata comments. <details> <summary>📝 Suggested fix</summary> ```diff # Targets: opendataloader-pdf >= 2.2.0 + # Last synced options.json: 26 options ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/SKILL.md` around lines 13 - 14, Add a blank line between the two metadata headings "# Targets: opendataloader-pdf >= 2.2.0" and "# Last synced options.json: 26 options" in SKILL.md so the second header is separated and the file no longer violates MD022 (headings should be surrounded by blank lines). ``` </details> --- `63-63`: _⚠️ Potential issue_ | _🟡 Minor_ **Correct macOS capitalization.** Apple's operating system should be capitalized as "macOS" not "macos". <details> <summary>✏️ Suggested fix</summary> ```diff -| `OS` | Operating system (linux, macos, windows) | +| `OS` | Operating system (linux, macOS, windows) | ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@skills/odl-pdf/SKILL.md` at line 63, Update the OS table entry in SKILL.md to use the correct Apple capitalization by changing the value for the `OS` column from "macos" to "macOS" so the row reads "`OS` | Operating system (linux, macOS, windows)"; locate the table line that contains the `OS` symbol and replace only that token. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 31-38: The workflow treats any non-zero exit from the sync script
as "Drift detected"; change the logic around the EXIT_CODE from running
skills/odl-pdf/scripts/sync-skill-refs.py so that only the specific drift exit
code (e.g., 1) triggers the drift message and exit 1, while other non-zero codes
(e.g., parser/input failures) are treated as script errors—log a distinct error
message referencing sync-skill-refs.py and propagate the original EXIT_CODE (or
exit 2) instead; update the conditional that inspects EXIT_CODE (the block that
currently echoes "Drift detected: skill references are out of sync with
options.json." and exits 1) to branch on EXIT_CODE values accordingly.In
@skills/odl-pdf/scripts/quick-eval.py:
- Around line 70-83: The function compute_similarity_rapidfuzz currently uses
Levenshtein.distance and incorrect normalization; replace that with RapidFuzz's
Indel.normalized_distance to get the correct Normalized Indel Distance (NID) and
then return 1 - NID as the similarity. Specifically, import and call
Indel.normalized_distance(extracted, ground_truth) to obtain NID (which already
normalizes by len(a)+len(b)), handle the empty-string case (return 1.0 when both
strings are empty) and then compute similarity = max(0.0, 1.0 - nid) before
returning. Ensure the function docstring reflects the Indel API usage and remove
the old max_len division logic.
Duplicate comments:
In @.github/workflows/skill-drift-check.yml:
- Around line 9-13: The workflow's path filters only watch 'options.json' so
changes to skill files like 'skills/odl-pdf/references/options-matrix.md' or
'skills/odl-pdf/scripts/sync-skill-refs.py' don't trigger the drift check;
update the push and pull_request path patterns (the 'paths' entries in the
workflow) to include the relevant skill directories or files (for example
broaden to 'skills/**', or add the specific paths
'skills/odl-pdf/references/options-matrix.md' and
'skills/odl-pdf/scripts/sync-skill-refs.py') so edits to those files will run
the workflow.In
@skills/odl-pdf/references/format-guide.md:
- Around line 47-49: The fenced CLI examples using the opendataloader-pdf
command need language identifiers for markdown linting (MD040); change the
opening backticks for the blocks that contain "opendataloader-pdf input.pdf
--format markdown,json" and the block with "opendataloader-pdf input.pdf
--format text --to-stdout | my-indexer" so they start withbash instead of(i.e., replacewithbash for those two fenced code blocks) and keep
the contents unchanged.In
@skills/odl-pdf/references/installation-matrix.md:
- Around line 7-21: The unlabeled fenced code blocks in installation-matrix.md
(the decision-tree ASCII block and the command examples like the java --version
and opendataloader-pdf --version blocks) are triggering MD040; add explicit
language tags: change the decision-tree fence totext and change command/example fences tobash, and apply the same fixes to the other
occurrences referenced (around the other blocks at the indicated sections),
ensuring every triple-backtick fence has a language identifier.In
@skills/odl-pdf/scripts/hybrid-health.sh:
- Around line 15-25: The --url parsing must reject empty or flag-like values:
when handling the separate-arg branch (case for --url) validate that "$2" is
non-empty and does not start with '-' before assigning HYBRID_URL and shifting;
when handling the --url=* branch validate that the extracted value (from
"${1#--url=}") is non-empty before assigning HYBRID_URL and shifting; on
validation failure print an error ("--url requires a non-empty value") to stderr
and exit 1 to prevent creating invalid endpoints.In
@skills/odl-pdf/scripts/quick-eval.py:
- Around line 146-155: The loop that builds hunks appends header-only snippets
(lines like '---'/'+++') before the first '@@', wasting snippet slots; modify
the logic in the diff-processing loop that iterates "for line in diff:" so that
you skip or drop file-header lines (e.g., lines starting with '---' or '+++')
when current_hunk contains only headers and you haven't hit a hunk header '@@'
yet, only starting current_hunk when you see the first '@@' or non-header
content; ensure the final append block that checks "if current_hunk and
len(snippets) < max_snippets:" also ignores header-only current_hunk so snippets
only contain real hunks (references: variables diff, current_hunk, snippets,
max_snippets).In
@skills/odl-pdf/SKILL.md:
- Line 37: The markdown has multiple fenced code blocks missing language
identifiers (MD040) and missing surrounding blank lines (MD031); update SKILL.md
by adding appropriate language tags (use "text" for decision trees/pseudo-code
or the correct language for code examples) to the fences at the listed locations
(e.g., the fences reported at lines
~37,100,133,182,328,377,392,405,417,425,433,441 and others) and ensure there is
a blank line before and after each fenced block for the reported positions
(e.g., around the blocks at
~244,253,266,278,292,377,392,405,417,425,433,540,545,550,556,704,709) so the
file passes markdownlint checks MD040 and MD031.- Around line 13-14: Add a blank line between the two metadata headings "#
Targets: opendataloader-pdf >= 2.2.0" and "# Last synced options.json: 26
options" in SKILL.md so the second header is separated and the file no longer
violates MD022 (headings should be surrounded by blank lines).- Line 63: Update the OS table entry in SKILL.md to use the correct Apple
capitalization by changing the value for theOScolumn from "macos" to "macOS"
so the row reads "OS| Operating system (linux, macOS, windows)"; locate the
table line that contains theOSsymbol and replace only that token.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: ASSERTIVE **Plan**: Pro **Run ID**: `9350321c-5ef8-4af5-9d0f-036a033709d4` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between faf526c6b096d5da927d43fd7bab14307fb4bbad and 98c6cc75e402ecc904e8ff05818fba86a8f80ada. </details> <details> <summary>📒 Files selected for processing (9)</summary> * `.github/workflows/skill-drift-check.yml` * `skills/odl-pdf/SKILL.md` * `skills/odl-pdf/references/eval-metrics.md` * `skills/odl-pdf/references/format-guide.md` * `skills/odl-pdf/references/hybrid-guide.md` * `skills/odl-pdf/references/installation-matrix.md` * `skills/odl-pdf/scripts/detect-env.sh` * `skills/odl-pdf/scripts/hybrid-health.sh` * `skills/odl-pdf/scripts/quick-eval.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Objective: quick-eval.py claims to compute Normalized Indel Distance but uses Levenshtein.distance (substitution weight 1, not Indel) with incorrect normalization (max(len) instead of sum(len)). Approach: Replace rapidfuzz.distance.Levenshtein with rapidfuzz.distance.Indel.normalized_distance which computes the correct NID metric and normalization directly. Evidence: Indel.normalized_distance uses distance / (len(a) + len(b)) per RapidFuzz docs, matching the NID definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/odl-pdf/scripts/quick-eval.py`:
- Around line 157-166: The build_report function has a boolean positional
parameter verbose which can cause positional-boolean footguns; change the
signature of build_report so verbose (and any parameters after it) are
keyword-only by inserting a bare * before verbose (e.g., add "*" before the
verbose parameter) and keep its type (bool) and default if any; update any
callers if needed (the existing call already passes arguments as keywords so no
changes required). Ensure the function name build_report and its parameter list
are updated accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 76d0bb95-48bc-45c1-9cfc-2047d7662c86
📒 Files selected for processing (1)
skills/odl-pdf/scripts/quick-eval.py
…v.sh Objective: Hybrid extras detection only checked for docling, missing fastapi and uvicorn which are also required to run the hybrid server. Approach: Update import check to verify all three packages are present. Evidence: pyproject.toml [hybrid] extras require docling, fastapi, and uvicorn. hybrid_server.py imports all three at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dling Objective: CI workflow had two issues: (1) changes to sync-skill-refs.py itself could silently break drift detection without triggering a check, (2) exit code 2 (JSON parse/input errors) was conflated with exit code 1 (drift detected), producing misleading "Drift detected" messages. Approach: - Add sync-skill-refs.py to trigger paths for self-validation on change - Differentiate exit 1 (drift) from exit 2+ (input/script errors) with separate messages Evidence: sync-skill-refs.py uses sys.exit(2) for JSON parse failure, missing options key, and missing name field — all distinct from drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Subscribed and looking to test asap |
Review: Strong execution — 4 questions before mergeThe technical quality is high, and the 2-round review feedback handling was mature (21 accepted, 13 rejected with reasoning). I want to confirm 4 things before merging — they're about sharpening the skill's reason to exist, not about the code itself. 1. Baseline comparison (most important — drives the other 3)Observation: The Evidence section proves 5/5 PASS with the skill. There is no result without it. Question: For eval-002 (Korean scanned PDF, M1 Mac) — does an agent with only README + Ask: Run the same 5 scenarios once without the skill (agent gets README + 2. Target user definitionObservation: SKILL.md has trigger keywords in the description but no definition of "whose problem, in what situation." Question: Should 729 lines load for a senior developer who already uses opendataloader-pdf daily and just wants to convert one more file? Ask: Add one line to the top of SKILL.md — "This skill is for ___ who need to ___." This also helps platform skill routers make better activation decisions. 3. Skill size vs. code-readable contentObservation: Question: In a source-checkout environment, what is the justification for this content existing in the skill as well? Ask: Try removing content the agent can derive from code ( 4. Maintenance surface areaObservation: The drift check covers
Question: How many files need updating when the next CLI option ships? Ask: Pick one:
Priority: Start with #1 (baseline comparison). The results will determine how much weight #2–#4 carry. If the skill provides a large delta over no-skill, the maintenance cost is justified. If the delta is small, slimming down becomes more important. |
|
Here's how the review process will go from here:
|
Objective
AI coding assistants (Claude Code, Codex, Cursor, Gemini CLI, VS Code, etc.) have no way to know how to use opendataloader-pdf optimally. Users must manually figure out which of 26 options, 7 formats, and 3 install methods to use for their specific PDF type and downstream pipeline. This is the most common friction point for new users.
Approach
Ship a built-in agent skill at
skills/odl-pdf/following the agentskills.io specification, compatible with 26+ AI platforms.The skill acts as a Document Intelligence Engineer persona with a 5-phase workflow:
detect-env.shto detect OS, Java, Python, Node.js, ODL installationKey design decisions:
options-matrix.md+ dynamicoptions.jsonlookup, so pip-install users (no source code) also get full guidanceskill-drift-check.ymlautomatically detects when CLI options change in Java butoptions-matrix.mdisn't updatedNew files
skills/odl-pdf/SKILL.mdskills/odl-pdf/references/options-matrix.mdskills/odl-pdf/references/hybrid-guide.mdskills/odl-pdf/references/format-guide.mdskills/odl-pdf/references/installation-matrix.mdskills/odl-pdf/references/eval-metrics.mdskills/odl-pdf/scripts/detect-env.shskills/odl-pdf/scripts/hybrid-health.shskills/odl-pdf/scripts/quick-eval.pyskills/odl-pdf/scripts/sync-skill-refs.pyskills/odl-pdf/evals/evals.jsonskills/README.md.claude-plugin/marketplace.json.github/workflows/skill-drift-check.ymlUpdated files
README.md— Added Agent Skills section with install command and phase tableCLAUDE.md— Added skill development notes for internal contributorsCONTRIBUTING.md— Added skill maintenance guide for external contributorsEvidence
Ran 5 eval scenarios via independent AI agents with zero prior knowledge of opendataloader-pdf. Each agent loaded only SKILL.md and references, then responded to user queries.
Drift check:
sync-skill-refs.pyconfirmed 26/26 options in sync betweenoptions.jsonandoptions-matrix.md.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores