Fix: handle invalid input paths in Java API to prevent crash#375
Fix: handle invalid input paths in Java API to prevent crash#375pavanpai769 wants to merge 4 commits intoopendataloader-project:mainfrom
Conversation
Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthrough
Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant OpenDataLoaderPDF
participant Filesystem
participant DocumentProcessor
Client->>OpenDataLoaderPDF: processFile(inputPdfName, config)
OpenDataLoaderPDF->>Filesystem: validateInputFile(inputPdfName)
alt validation fails
Filesystem-->>OpenDataLoaderPDF: error (invalid path / not found / not .pdf)
OpenDataLoaderPDF->>Client: ProcessingResult.failure(inputPdfName, error)
else validation succeeds
Filesystem-->>OpenDataLoaderPDF: file OK
OpenDataLoaderPDF->>DocumentProcessor: processFile(path, config)
alt processing throws IOException
DocumentProcessor-->>OpenDataLoaderPDF: IOException
OpenDataLoaderPDF->>Client: ProcessingResult.failure(inputPdfName, IOException)
else processing succeeds
DocumentProcessor-->>OpenDataLoaderPDF: success
OpenDataLoaderPDF->>Client: ProcessingResult.success(inputPdfName)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 78-79: The locale-sensitive toLowerCase call on inputPdfName
should be made locale-stable: replace
inputPdfName.toLowerCase().endsWith(".pdf") with a root-locale folding (e.g.,
inputPdfName.toLowerCase(Locale.ROOT).endsWith(".pdf")) in the check inside
class OpenDataLoaderPDF (the inputPdfName validation branch); also add the
necessary import for java.util.Locale if missing so the code compiles. Ensure
the LOGGER.log(Level.WARNING, ...) behavior remains unchanged.
- Around line 66-67: The code calls Paths.get(inputPdfName) without handling
InvalidPathException, so malformed input paths can crash batch processing;
update the method that constructs the Path (the location with
Paths.get(inputPdfName) in OpenDataLoaderPDF.java) to wrap that call in a
try/catch that catches java.nio.file.InvalidPathException, logs the invalid path
(including inputPdfName) and then skips/continues processing the rest of the
batch instead of rethrowing; make sure any existing logic that assumes a
non-null Path is guarded by this flow so invalid inputs are safely ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3ad5e3e6-68fd-4606-82eb-f0fb2c964910
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java (1)
41-52:⚠️ Potential issue | 🟠 MajorSilent return breaks the documented exception contract.
The Javadoc states
@throws IOException If an error occurs during file reading or processing, but invalid inputs now silently return with only a log warning. Callers who depend onIOExceptionfor error handling (e.g., setting exit codes, reporting failures in batch jobs) will receive no signal that processing was skipped.Consider one of:
- Return a status: Change return type to
booleanor a result object to indicate success/failure- Throw for invalid input: Use
IllegalArgumentExceptionfor programmatic errors (null/blank) vs.IOExceptionfor file-system issues (missing file)- Update Javadoc: If silent skipping is intentional, document that invalid inputs are logged and skipped without throwing
From the relevant context,
DocumentProcessor.processFile()throwsIOExceptionon failures—swallowing these at the API layer breaks error propagation.Option 2: Throw exceptions for invalid input
public static void processFile(String inputPdfName, Config config) throws IOException { - if (!isValidInputFile(inputPdfName)) return; + validateInputFile(inputPdfName); // throws on invalid input DocumentProcessor.processFile(inputPdfName, config); } - private static boolean isValidInputFile(String inputPdfName) { + private static void validateInputFile(String inputPdfName) throws IOException { if (inputPdfName == null || inputPdfName.isBlank()) { - LOGGER.log(Level.WARNING, "Input file path is null or empty"); - return false; + throw new IllegalArgumentException("Input file path is null or empty"); } // ... similar changes for other validations, using IOException for file-system errors🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java` around lines 41 - 52, The method OpenDataLoaderPDF.processFile currently silently returns on invalid input (isValidInputFile) which violates the documented `@throws` IOException contract; change behavior to throw an IllegalArgumentException (for null/blank or clearly invalid input) instead of returning so callers get a deterministic error, keep the method signature throwing IOException for file-system failures from DocumentProcessor.processFile, and update the Javadoc to document both IllegalArgumentException for invalid input and IOException for processing errors; ensure you remove the silent return and call DocumentProcessor.processFile(inputPdfName, config) only after validation passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 41-52: The method OpenDataLoaderPDF.processFile currently silently
returns on invalid input (isValidInputFile) which violates the documented
`@throws` IOException contract; change behavior to throw an
IllegalArgumentException (for null/blank or clearly invalid input) instead of
returning so callers get a deterministic error, keep the method signature
throwing IOException for file-system failures from
DocumentProcessor.processFile, and update the Javadoc to document both
IllegalArgumentException for invalid input and IOException for processing
errors; ensure you remove the silent return and call
DocumentProcessor.processFile(inputPdfName, config) only after validation
passes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c293be9f-32c0-4a83-b6ed-9941dce2d3a8
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 49-59: CLIMain.processFile currently ignores the ProcessingResult
from OpenDataLoaderPDF.processFile (which now swallows exceptions), causing it
to always return true; change CLIMain.processFile to capture the
ProcessingResult (from OpenDataLoaderPDF.processFile), then return
result.isSuccess() (i.e., return false on failure) so hasFailure in run() can
become true and the process exit code reflects failures; update any dead
exception handling in CLIMain.processFile accordingly and ensure the test
testProcessingFailureReturnsNonZeroExitCode passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 57331170-1397-447c-b5ec-a98bcf34c4ea
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java`:
- Around line 18-20: The factory method ProcessingResult.failure(String
inputPath, Exception error) must guard against a null error and refuse to create
an invalid failure; update ProcessingResult.failure to validate error (e.g., use
Objects.requireNonNull(error) or explicitly throw an
IllegalArgumentException/NullPointerException with a clear message) before
calling the ProcessingResult constructor, ensuring callers cannot receive a
failure with a null exception and preserving invariants in the ProcessingResult
class.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 381015ce-828f-4cca-8c2e-8b803bb9a022
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java
| public static ProcessingResult failure(String inputPath, Exception error) { | ||
| return new ProcessingResult(inputPath, false, error); | ||
| } |
There was a problem hiding this comment.
Guard against null error in failure factory.
ProcessingResult.failure(inputPath, null) currently creates an invalid failure state and pushes null-handling risk to callers.
Proposed fix
package org.opendataloader.pdf.entities;
+
+import java.util.Objects;
@@
public static ProcessingResult failure(String inputPath, Exception error) {
- return new ProcessingResult(inputPath, false, error);
+ return new ProcessingResult(inputPath, false, Objects.requireNonNull(error, "error must not be null"));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public static ProcessingResult failure(String inputPath, Exception error) { | |
| return new ProcessingResult(inputPath, false, error); | |
| } | |
| public static ProcessingResult failure(String inputPath, Exception error) { | |
| return new ProcessingResult(inputPath, false, Objects.requireNonNull(error, "error must not be null")); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java`
around lines 18 - 20, The factory method ProcessingResult.failure(String
inputPath, Exception error) must guard against a null error and refuse to create
an invalid failure; update ProcessingResult.failure to validate error (e.g., use
Objects.requireNonNull(error) or explicitly throw an
IllegalArgumentException/NullPointerException with a clear message) before
calling the ProcessingResult constructor, ensuring callers cannot receive a
failure with a null exception and preserving invariants in the ProcessingResult
class.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java`:
- Around line 151-157: OpenDataLoaderPDF.processFile() currently only catches
IllegalArgumentException and IOException, so UncheckedIOException from parallel
work in DocumentProcessor.processFile() can escape and kill the batch loop;
update OpenDataLoaderPDF.processFile() to also catch RuntimeException (or
specifically UncheckedIOException) and convert it into a ProcessingResult
failure (include the exception in the failure result) so callers
(processPath/processDirectory/CLIMain.run) receive a normal failure instead of
an uncaught exception.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 472b8640-3b88-425c-a0d5-0f628d0a6f71
📒 Files selected for processing (2)
java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java
| try { | ||
| OpenDataLoaderPDF.processFile(file.getAbsolutePath(), config); | ||
| return true; | ||
| } catch (Exception exception) { | ||
| LOGGER.log(Level.SEVERE, "Exception during processing file " + file.getAbsolutePath() + ": " + | ||
| exception.getMessage(), exception); | ||
| return false; | ||
| ProcessingResult result = OpenDataLoaderPDF.processFile(file.getAbsolutePath(), config); | ||
| if (result.isFailure()) { | ||
| LOGGER.log(Level.WARNING, "Exception during processing file {0}: {1}", | ||
| new Object[]{file.getAbsolutePath(), result.getError().getMessage()}); | ||
| } | ||
| return result.isSuccess(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify UncheckedIOException is thrown and not caught in the call chain
echo "=== Checking UncheckedIOException throw sites in DocumentProcessor ==="
rg -n "UncheckedIOException" --type=java -C3
echo ""
echo "=== Checking catch blocks in OpenDataLoaderPDF.processFile ==="
ast-grep --pattern $'catch ($_ | $_) {
$$$
}'Repository: opendataloader-project/opendataloader-pdf
Length of output: 2114
🏁 Script executed:
#!/bin/bash
echo "=== OpenDataLoaderPDF.java - around line 54 ==="
fd -t f "OpenDataLoaderPDF.java" --exec wc -l {} \;
fd -t f "OpenDataLoaderPDF.java" --exec sed -n '40,70p' {} \;
echo ""
echo "=== DocumentProcessor.java - around line 200 ==="
fd -t f "DocumentProcessor.java" --exec sed -n '180,210p' {} \;
echo ""
echo "=== Check for ForkJoinPool and propagateState usage ==="
rg -n "ForkJoinPool|propagateState" --type=java -B2 -A2Repository: opendataloader-project/opendataloader-pdf
Length of output: 10842
🏁 Script executed:
#!/bin/bash
echo "=== CLIMain.processFile() full method context ==="
fd -t f "CLIMain.java" --exec sed -n '140,180p' {} \;
echo ""
echo "=== Check who calls processFile and exception handling ==="
fd -t f "CLIMain.java" --exec sed -n '1,50p' {} \;Repository: opendataloader-project/opendataloader-pdf
Length of output: 2937
🏁 Script executed:
#!/bin/bash
echo "=== CLIMain.run() and batch processing loop ==="
fd -t f "CLIMain.java" --exec wc -l {} \;
fd -t f "CLIMain.java" --exec sed -n '50,150p' {} \;Repository: opendataloader-project/opendataloader-pdf
Length of output: 3625
Catch RuntimeException in processFile() to prevent uncaught UncheckedIOException from crashing the batch loop.
The fix is incomplete. UncheckedIOException from parallel processing in DocumentProcessor.processFile() (line 200) is not caught by OpenDataLoaderPDF.processFile() (which only catches IllegalArgumentException | IOException). The exception propagates uncaught through processFile(), processPath(), processDirectory(), and the batch loop in run(), terminating batch processing and preventing remaining files from being processed.
Proposed fix: catch RuntimeException to handle UncheckedIOException
private static boolean processFile(File file, Config config) {
if (!isPdfFile(file)) {
LOGGER.log(Level.FINE, "Skipping non-PDF file " + file.getAbsolutePath());
return true;
}
try {
ProcessingResult result = OpenDataLoaderPDF.processFile(file.getAbsolutePath(), config);
if (result.isFailure()) {
LOGGER.log(Level.WARNING, "Exception during processing file {0}: {1}",
new Object[]{file.getAbsolutePath(), result.getError().getMessage()});
}
return result.isSuccess();
+ } catch (RuntimeException e) {
+ // Catch unchecked exceptions (e.g., UncheckedIOException from parallel processing)
+ LOGGER.log(Level.WARNING, "Exception during processing file {0}: {1}",
+ new Object[]{file.getAbsolutePath(), e.getMessage()});
+ return false;
} finally {
StaticLayoutContainers.closeContrastRatioConsumer();
}
}Alternatively, fix the root cause by catching UncheckedIOException in OpenDataLoaderPDF.processFile() and returning a failure result.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java`
around lines 151 - 157, OpenDataLoaderPDF.processFile() currently only catches
IllegalArgumentException and IOException, so UncheckedIOException from parallel
work in DocumentProcessor.processFile() can escape and kill the batch loop;
update OpenDataLoaderPDF.processFile() to also catch RuntimeException (or
specifically UncheckedIOException) and convert it into a ProcessingResult
failure (include the exception in the failure result) so callers
(processPath/processDirectory/CLIMain.run) receive a normal failure instead of
an uncaught exception.
Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
1d19a95 to
4e23f14
Compare
bundolee
left a comment
There was a problem hiding this comment.
Thanks for identifying this gap between CLI and Java API behavior — it's a real usability issue for batch processing.
However, the current approach silently swallows all errors (including real IOException from corrupt PDFs, disk-full, etc.) by catching and logging them. This removes the caller's ability to detect failures, which is a breaking change for existing users who rely on the throws IOException contract.
A library API should report errors — the caller decides how to handle them. The documented batch loop example can simply wrap processFile in a try-catch:
for (String pdf : files) {
try {
OpenDataLoaderPDF.processFile(pdf, config);
} catch (IOException e) {
logger.warning("Skipping " + pdf + ": " + e.getMessage());
}
}If you'd like to add input validation (null/blank/invalid path), that's reasonable — but it should throw IllegalArgumentException so the caller can catch it, not silently swallow it. The throws IOException on the signature should stay.
Also a minor typo: "File not fount at" → "File not found at".
Would you be open to reworking this so validation throws rather than swallows?
|
Closing this PR as there's been no response to the review feedback for 2+ weeks and CI is still failing. The underlying problem is valid — I've opened #430 to track it. The key design constraint: the API should throw, not swallow errors. See the review thread above for details. @pavanpai769 Thanks for identifying this issue! If you'd like to take another shot based on the feedback, feel free to reference #430. |
Problem
When using the Java API (
OpenDataLoaderPDF.processFile) in a loop (as shown in the documentation https://opendataloader.org/docs/quick-start-java ), providing an invalid file path causes an exception and stops the entire processing.Reproduction Example
The issue can be reproduced using the documented example:
❌ Current Behavior
If one of the input paths is invalid or does not exist, the program throws an exception and stops execution, preventing other valid files from being processed.
✅ Fix
OpenDataLoaderPDF.processFile📈 Impact
🧪 Testing
Tested manually with:
Unit tests can be added if required.
Summary by CodeRabbit
Bug Fixes
Changes