Skip to content

Fix: handle invalid input paths in Java API to prevent crash#375

Closed
pavanpai769 wants to merge 4 commits intoopendataloader-project:mainfrom
pavanpai769:fix/handle-invalid-input-paths
Closed

Fix: handle invalid input paths in Java API to prevent crash#375
pavanpai769 wants to merge 4 commits intoopendataloader-project:mainfrom
pavanpai769:fix/handle-invalid-input-paths

Conversation

@pavanpai769
Copy link
Copy Markdown

@pavanpai769 pavanpai769 commented Mar 31, 2026

Problem

When using the Java API (OpenDataLoaderPDF.processFile) in a loop (as shown in the documentation https://opendataloader.org/docs/quick-start-java ), providing an invalid file path causes an exception and stops the entire processing.

Reproduction Example

The issue can be reproduced using the documented example:

import org.opendataloader.pdf.api.Config;
import org.opendataloader.pdf.api.OpenDataLoaderPDF;

public class Sample {
    public static void main(String[] args) throws Exception {
        Config config = new Config();
        config.setOutputFolder("path/to/output");
        config.setGeneratePDF(true);
        config.setGenerateMarkdown(true);
        config.setGenerateHtml(true);

        try {
            // Process multiple files in one JVM invocation
            for (String pdf : new String[]{"valid.pdf", "invalid.pdf"}) {
                OpenDataLoaderPDF.processFile(pdf, config);
            }
        } finally {
            OpenDataLoaderPDF.shutdown();
        }
    }
}

❌ Current Behavior

If one of the input paths is invalid or does not exist, the program throws an exception and stops execution, preventing other valid files from being processed.

⚠️ Inconsistency

  • CLI handles invalid paths gracefully
  • Java API does not
  • Documentation suggests batch processing should work reliably

✅ Fix

  • Added input file validation in OpenDataLoaderPDF.processFile
  • If a path is invalid:
    • Log a warning
    • Skip processing that file
  • Continue processing remaining valid inputs

📈 Impact

  • Prevents crashes during batch processing
  • Improves robustness of the Java API
  • Aligns behavior with CLI and documented usage

🧪 Testing

Tested manually with:

  • Valid paths only
  • Mix of valid and invalid paths
  • All invalid paths

Unit tests can be added if required.

Summary by CodeRabbit

  • Bug Fixes

    • PDF inputs are validated before processing: null/blank, invalid paths, missing or non-regular files, and non-.pdf names are detected; invalid inputs are skipped and validation failures are logged.
  • Changes

    • Processing now returns a result object indicating success or failure instead of throwing exceptions; failures are logged and reported in the returned result.

Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 31, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

processFile in OpenDataLoaderPDF now validates the input PDF path, logs warnings on invalid input or IO/argument failures, and returns a ProcessingResult instead of throwing IOException. On valid input it delegates to DocumentProcessor.processFile(...) and returns a success result; on errors it returns a failure result.

Changes

Cohort / File(s) Summary
API: Input validation & error handling
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java
Changed processFile(String, Config) to return ProcessingResult (removed throws IOException). Added validateInputFile(...) checking null/blank, InvalidPathExceptionIllegalArgumentException, existence, regular file, and case-insensitive .pdf extension. Added LOGGER; processFile catches IllegalArgumentException and IOException, logs warnings, and returns ProcessingResult.failure(...) on errors; calls DocumentProcessor.processFile(...) and returns ProcessingResult.success(...) on success.
New entity: result wrapper
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java
Added ProcessingResult class with private constructor and static factories success(inputPath) and failure(inputPath, error). Exposes getInputPath(), isSuccess(), isFailure(), and getError().
CLI: result-driven flow
java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java
Updated CLIMain.processFile to consume ProcessingResult from OpenDataLoaderPDF.processFile(...). Logs a warning and returns failure when result.isFailure() (uses result.getError().getMessage()); removed previous catch-all exception handling and unconditional success path.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant OpenDataLoaderPDF
  participant Filesystem
  participant DocumentProcessor

  Client->>OpenDataLoaderPDF: processFile(inputPdfName, config)
  OpenDataLoaderPDF->>Filesystem: validateInputFile(inputPdfName)
  alt validation fails
    Filesystem-->>OpenDataLoaderPDF: error (invalid path / not found / not .pdf)
    OpenDataLoaderPDF->>Client: ProcessingResult.failure(inputPdfName, error)
  else validation succeeds
    Filesystem-->>OpenDataLoaderPDF: file OK
    OpenDataLoaderPDF->>DocumentProcessor: processFile(path, config)
    alt processing throws IOException
      DocumentProcessor-->>OpenDataLoaderPDF: IOException
      OpenDataLoaderPDF->>Client: ProcessingResult.failure(inputPdfName, IOException)
    else processing succeeds
      DocumentProcessor-->>OpenDataLoaderPDF: success
      OpenDataLoaderPDF->>Client: ProcessingResult.success(inputPdfName)
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding input path validation to prevent crashes in the Java API when processing invalid paths.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 78-79: The locale-sensitive toLowerCase call on inputPdfName
should be made locale-stable: replace
inputPdfName.toLowerCase().endsWith(".pdf") with a root-locale folding (e.g.,
inputPdfName.toLowerCase(Locale.ROOT).endsWith(".pdf")) in the check inside
class OpenDataLoaderPDF (the inputPdfName validation branch); also add the
necessary import for java.util.Locale if missing so the code compiles. Ensure
the LOGGER.log(Level.WARNING, ...) behavior remains unchanged.
- Around line 66-67: The code calls Paths.get(inputPdfName) without handling
InvalidPathException, so malformed input paths can crash batch processing;
update the method that constructs the Path (the location with
Paths.get(inputPdfName) in OpenDataLoaderPDF.java) to wrap that call in a
try/catch that catches java.nio.file.InvalidPathException, logs the invalid path
(including inputPdfName) and then skips/continues processing the rest of the
batch instead of rethrowing; make sure any existing logic that assumes a
non-null Path is guarded by this flow so invalid inputs are safely ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3ad5e3e6-68fd-4606-82eb-f0fb2c964910

📥 Commits

Reviewing files that changed from the base of the PR and between 29da86d and faaddc8.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java (1)

41-52: ⚠️ Potential issue | 🟠 Major

Silent return breaks the documented exception contract.

The Javadoc states @throws IOException If an error occurs during file reading or processing, but invalid inputs now silently return with only a log warning. Callers who depend on IOException for error handling (e.g., setting exit codes, reporting failures in batch jobs) will receive no signal that processing was skipped.

Consider one of:

  1. Return a status: Change return type to boolean or a result object to indicate success/failure
  2. Throw for invalid input: Use IllegalArgumentException for programmatic errors (null/blank) vs. IOException for file-system issues (missing file)
  3. Update Javadoc: If silent skipping is intentional, document that invalid inputs are logged and skipped without throwing

From the relevant context, DocumentProcessor.processFile() throws IOException on failures—swallowing these at the API layer breaks error propagation.

Option 2: Throw exceptions for invalid input
     public static void processFile(String inputPdfName, Config config) throws IOException {
-        if (!isValidInputFile(inputPdfName)) return;
+        validateInputFile(inputPdfName);  // throws on invalid input
 
         DocumentProcessor.processFile(inputPdfName, config);
     }
 
-    private static boolean isValidInputFile(String inputPdfName) {
+    private static void validateInputFile(String inputPdfName) throws IOException {
 
         if (inputPdfName == null || inputPdfName.isBlank()) {
-            LOGGER.log(Level.WARNING, "Input file path is null or empty");
-            return false;
+            throw new IllegalArgumentException("Input file path is null or empty");
         }
         // ... similar changes for other validations, using IOException for file-system errors
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`
around lines 41 - 52, The method OpenDataLoaderPDF.processFile currently
silently returns on invalid input (isValidInputFile) which violates the
documented `@throws` IOException contract; change behavior to throw an
IllegalArgumentException (for null/blank or clearly invalid input) instead of
returning so callers get a deterministic error, keep the method signature
throwing IOException for file-system failures from
DocumentProcessor.processFile, and update the Javadoc to document both
IllegalArgumentException for invalid input and IOException for processing
errors; ensure you remove the silent return and call
DocumentProcessor.processFile(inputPdfName, config) only after validation
passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 41-52: The method OpenDataLoaderPDF.processFile currently silently
returns on invalid input (isValidInputFile) which violates the documented
`@throws` IOException contract; change behavior to throw an
IllegalArgumentException (for null/blank or clearly invalid input) instead of
returning so callers get a deterministic error, keep the method signature
throwing IOException for file-system failures from
DocumentProcessor.processFile, and update the Javadoc to document both
IllegalArgumentException for invalid input and IOException for processing
errors; ensure you remove the silent return and call
DocumentProcessor.processFile(inputPdfName, config) only after validation
passes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c293be9f-32c0-4a83-b6ed-9941dce2d3a8

📥 Commits

Reviewing files that changed from the base of the PR and between faaddc8 and cead1a8.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java`:
- Around line 49-59: CLIMain.processFile currently ignores the ProcessingResult
from OpenDataLoaderPDF.processFile (which now swallows exceptions), causing it
to always return true; change CLIMain.processFile to capture the
ProcessingResult (from OpenDataLoaderPDF.processFile), then return
result.isSuccess() (i.e., return false on failure) so hasFailure in run() can
become true and the process exit code reflects failures; update any dead
exception handling in CLIMain.processFile accordingly and ensure the test
testProcessingFailureReturnsNonZeroExitCode passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57331170-1397-447c-b5ec-a98bcf34c4ea

📥 Commits

Reviewing files that changed from the base of the PR and between cead1a8 and 91f498c.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java`:
- Around line 18-20: The factory method ProcessingResult.failure(String
inputPath, Exception error) must guard against a null error and refuse to create
an invalid failure; update ProcessingResult.failure to validate error (e.g., use
Objects.requireNonNull(error) or explicitly throw an
IllegalArgumentException/NullPointerException with a clear message) before
calling the ProcessingResult constructor, ensuring callers cannot receive a
failure with a null exception and preserving invariants in the ProcessingResult
class.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 381015ce-828f-4cca-8c2e-8b803bb9a022

📥 Commits

Reviewing files that changed from the base of the PR and between 91f498c and e65ff26.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java

Comment on lines +18 to +20
public static ProcessingResult failure(String inputPath, Exception error) {
return new ProcessingResult(inputPath, false, error);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against null error in failure factory.

ProcessingResult.failure(inputPath, null) currently creates an invalid failure state and pushes null-handling risk to callers.

Proposed fix
 package org.opendataloader.pdf.entities;
+
+import java.util.Objects;
@@
     public static ProcessingResult failure(String inputPath, Exception error) {
-        return new ProcessingResult(inputPath, false, error);
+        return new ProcessingResult(inputPath, false, Objects.requireNonNull(error, "error must not be null"));
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public static ProcessingResult failure(String inputPath, Exception error) {
return new ProcessingResult(inputPath, false, error);
}
public static ProcessingResult failure(String inputPath, Exception error) {
return new ProcessingResult(inputPath, false, Objects.requireNonNull(error, "error must not be null"));
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java`
around lines 18 - 20, The factory method ProcessingResult.failure(String
inputPath, Exception error) must guard against a null error and refuse to create
an invalid failure; update ProcessingResult.failure to validate error (e.g., use
Objects.requireNonNull(error) or explicitly throw an
IllegalArgumentException/NullPointerException with a clear message) before
calling the ProcessingResult constructor, ensuring callers cannot receive a
failure with a null exception and preserving invariants in the ProcessingResult
class.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java`:
- Around line 151-157: OpenDataLoaderPDF.processFile() currently only catches
IllegalArgumentException and IOException, so UncheckedIOException from parallel
work in DocumentProcessor.processFile() can escape and kill the batch loop;
update OpenDataLoaderPDF.processFile() to also catch RuntimeException (or
specifically UncheckedIOException) and convert it into a ProcessingResult
failure (include the exception in the failure result) so callers
(processPath/processDirectory/CLIMain.run) receive a normal failure instead of
an uncaught exception.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 472b8640-3b88-425c-a0d5-0f628d0a6f71

📥 Commits

Reviewing files that changed from the base of the PR and between e65ff26 and 1d19a95.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/entities/ProcessingResult.java

Comment on lines +151 to +157
try {
OpenDataLoaderPDF.processFile(file.getAbsolutePath(), config);
return true;
} catch (Exception exception) {
LOGGER.log(Level.SEVERE, "Exception during processing file " + file.getAbsolutePath() + ": " +
exception.getMessage(), exception);
return false;
ProcessingResult result = OpenDataLoaderPDF.processFile(file.getAbsolutePath(), config);
if (result.isFailure()) {
LOGGER.log(Level.WARNING, "Exception during processing file {0}: {1}",
new Object[]{file.getAbsolutePath(), result.getError().getMessage()});
}
return result.isSuccess();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify UncheckedIOException is thrown and not caught in the call chain

echo "=== Checking UncheckedIOException throw sites in DocumentProcessor ==="
rg -n "UncheckedIOException" --type=java -C3

echo ""
echo "=== Checking catch blocks in OpenDataLoaderPDF.processFile ==="
ast-grep --pattern $'catch ($_ | $_) {
  $$$
}'

Repository: opendataloader-project/opendataloader-pdf

Length of output: 2114


🏁 Script executed:

#!/bin/bash

echo "=== OpenDataLoaderPDF.java - around line 54 ==="
fd -t f "OpenDataLoaderPDF.java" --exec wc -l {} \;
fd -t f "OpenDataLoaderPDF.java" --exec sed -n '40,70p' {} \;

echo ""
echo "=== DocumentProcessor.java - around line 200 ==="
fd -t f "DocumentProcessor.java" --exec sed -n '180,210p' {} \;

echo ""
echo "=== Check for ForkJoinPool and propagateState usage ==="
rg -n "ForkJoinPool|propagateState" --type=java -B2 -A2

Repository: opendataloader-project/opendataloader-pdf

Length of output: 10842


🏁 Script executed:

#!/bin/bash

echo "=== CLIMain.processFile() full method context ==="
fd -t f "CLIMain.java" --exec sed -n '140,180p' {} \;

echo ""
echo "=== Check who calls processFile and exception handling ==="
fd -t f "CLIMain.java" --exec sed -n '1,50p' {} \;

Repository: opendataloader-project/opendataloader-pdf

Length of output: 2937


🏁 Script executed:

#!/bin/bash

echo "=== CLIMain.run() and batch processing loop ==="
fd -t f "CLIMain.java" --exec wc -l {} \;
fd -t f "CLIMain.java" --exec sed -n '50,150p' {} \;

Repository: opendataloader-project/opendataloader-pdf

Length of output: 3625


Catch RuntimeException in processFile() to prevent uncaught UncheckedIOException from crashing the batch loop.

The fix is incomplete. UncheckedIOException from parallel processing in DocumentProcessor.processFile() (line 200) is not caught by OpenDataLoaderPDF.processFile() (which only catches IllegalArgumentException | IOException). The exception propagates uncaught through processFile(), processPath(), processDirectory(), and the batch loop in run(), terminating batch processing and preventing remaining files from being processed.

Proposed fix: catch RuntimeException to handle UncheckedIOException
     private static boolean processFile(File file, Config config) {
         if (!isPdfFile(file)) {
             LOGGER.log(Level.FINE, "Skipping non-PDF file " + file.getAbsolutePath());
             return true;
         }
         try {
             ProcessingResult result = OpenDataLoaderPDF.processFile(file.getAbsolutePath(), config);
             if (result.isFailure()) {
                 LOGGER.log(Level.WARNING, "Exception during processing file {0}: {1}",
                     new Object[]{file.getAbsolutePath(), result.getError().getMessage()});
             }
             return result.isSuccess();
+        } catch (RuntimeException e) {
+            // Catch unchecked exceptions (e.g., UncheckedIOException from parallel processing)
+            LOGGER.log(Level.WARNING, "Exception during processing file {0}: {1}",
+                new Object[]{file.getAbsolutePath(), e.getMessage()});
+            return false;
         } finally {
             StaticLayoutContainers.closeContrastRatioConsumer();
         }
     }

Alternatively, fix the root cause by catching UncheckedIOException in OpenDataLoaderPDF.processFile() and returning a failure result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java`
around lines 151 - 157, OpenDataLoaderPDF.processFile() currently only catches
IllegalArgumentException and IOException, so UncheckedIOException from parallel
work in DocumentProcessor.processFile() can escape and kill the batch loop;
update OpenDataLoaderPDF.processFile() to also catch RuntimeException (or
specifically UncheckedIOException) and convert it into a ProcessingResult
failure (include the exception in the failure result) so callers
(processPath/processDirectory/CLIMain.run) receive a normal failure instead of
an uncaught exception.

Signed-off-by: pavanpai769 <151814231+pavanpai769@users.noreply.github.com>
@pavanpai769 pavanpai769 force-pushed the fix/handle-invalid-input-paths branch from 1d19a95 to 4e23f14 Compare April 1, 2026 04:58
Copy link
Copy Markdown
Contributor

@bundolee bundolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying this gap between CLI and Java API behavior — it's a real usability issue for batch processing.

However, the current approach silently swallows all errors (including real IOException from corrupt PDFs, disk-full, etc.) by catching and logging them. This removes the caller's ability to detect failures, which is a breaking change for existing users who rely on the throws IOException contract.

A library API should report errors — the caller decides how to handle them. The documented batch loop example can simply wrap processFile in a try-catch:

for (String pdf : files) {
    try {
        OpenDataLoaderPDF.processFile(pdf, config);
    } catch (IOException e) {
        logger.warning("Skipping " + pdf + ": " + e.getMessage());
    }
}

If you'd like to add input validation (null/blank/invalid path), that's reasonable — but it should throw IllegalArgumentException so the caller can catch it, not silently swallow it. The throws IOException on the signature should stay.

Also a minor typo: "File not fount at""File not found at".

Would you be open to reworking this so validation throws rather than swallows?

@bundolee
Copy link
Copy Markdown
Contributor

Closing this PR as there's been no response to the review feedback for 2+ weeks and CI is still failing.

The underlying problem is valid — I've opened #430 to track it. The key design constraint: the API should throw, not swallow errors. See the review thread above for details.

@pavanpai769 Thanks for identifying this issue! If you'd like to take another shot based on the feedback, feel free to reference #430.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants