Skip to content

[BugFix] Fix NOT LIKE includes null/missing field rows (#5169)#5338

Draft
songkant-aws wants to merge 6 commits intoopensearch-project:mainfrom
songkant-aws:fix/not-like-null-5169
Draft

[BugFix] Fix NOT LIKE includes null/missing field rows (#5169)#5338
songkant-aws wants to merge 6 commits intoopensearch-project:mainfrom
songkant-aws:fix/not-like-null-5169

Conversation

@songkant-aws
Copy link
Copy Markdown
Collaborator

Description

SimpleQueryExpression.not() in PredicateAnalyzer.java generated boolQuery().mustNot(builder()) without an existsQuery filter. This caused OpenSearch's must_not to match documents where the field doesn't exist (null/missing), leaking null rows into NOT LIKE results.

Fixed by adding .must(existsQuery(getFieldReference())) alongside the mustNot, consistent with how notLike() and notEquals() are already correctly implemented in the same class.

Related Issues

Resolves #5169

Check List

  • New functionality includes testing
  • Commits signed per DCO (-s)
  • spotlessCheck passed
  • Unit tests passed
  • Integration tests passed

@songkant-aws
Copy link
Copy Markdown
Collaborator Author

Decision Log

Root Cause: SimpleQueryExpression.not() in PredicateAnalyzer.java (line ~1300) generated boolQuery().mustNot(builder()) without an existsQuery filter. OpenSearch's must_not clause alone matches documents where the field doesn't exist, causing null/missing rows to leak through in any negated predicate that routes through not() (e.g., NOT ... LIKE, NOT field > N).

Approach: Added .must(existsQuery(getFieldReference())) to the not() method, matching the existing pattern used by notLike() (line ~1346) and notEquals() (line ~1368) in the same class. This ensures all generic negations exclude null/missing documents.

Alternatives Rejected:

  • Modifying the prefix() method to inject exists logic before calling not() — rejected because it would duplicate the pattern and not() is the correct place for this concern.
  • Only fixing for LIKE specifically — rejected because not() is generic and affects all negations routed through it (e.g., NOT field > N).

Pitfalls:

  • The getFieldReference() returns the base field name (e.g., b), not the keyword subfield (e.g., b.keyword). This is correct because OpenSearch's exists query works with base field names.
  • NOT(field = value) is not affected because it uses notEquals() directly rather than routing through not().

Things to Watch:

  • Verify that the not() path doesn't cause double-exists checks when combined with other operators that already add exists filters.
  • The notEquals() method at line ~1379 has a comment saying "NOT LIKE should return false when field is NULL" but it's in the notEquals method — this is a copy-paste comment issue, not a bug.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

PR Reviewer Guide 🔍

(Review updated until commit da780a9)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Core fix: add existsQuery filter to NOT predicate in PredicateAnalyzer

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/request/PredicateAnalyzerTest.java
  • integ-test/src/test/resources/expectedOutput/calcite/clickbench/q23.yaml

Sub-PR theme: Integration tests for NOT LIKE null/missing field fix

Relevant files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotLikeNullIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/CalciteNoPushdownIT.java
  • integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5169.yml

⚡ Recommended focus areas for review

Incomplete Coverage

The isNullIntolerantPredicate method only handles RexCall nodes. If the inner operand is not a RexCall (e.g., a RexInputRef for a boolean field or other node types), the method returns false and falls through to expr.not() without the exists filter. This may silently miss some null-intolerant cases that are not represented as RexCall.

private static boolean isNullIntolerantPredicate(RexNode node) {
  if (!(node instanceof RexCall innerCall)) {
    return false;
  }
  return switch (innerCall.getKind()) {
    case LIKE,
        EQUALS,
        NOT_EQUALS,
        GREATER_THAN,
        GREATER_THAN_OR_EQUAL,
        LESS_THAN,
        LESS_THAN_OR_EQUAL,
        BETWEEN,
        SEARCH ->
        true;
    default -> false;
  };
}
Field Reference Assumption

notWithExistsFilter() calls getFieldReference() which presumably returns the field from the SimpleQueryExpression. For compound predicates (e.g., BETWEEN or SEARCH) that may involve multiple fields or a different field reference than expected, this could produce an incorrect existsQuery. It should be validated that getFieldReference() always returns the correct field for all supported isNullIntolerantPredicate kinds.

QueryExpression notWithExistsFilter() {
  builder = boolQuery().must(existsQuery(getFieldReference())).mustNot(builder());
  return this;
}
Missing Field Test

The integration test covers null field values (document 5 has keyword_field: null) but does not include a document where keyword_field is entirely absent/missing from the document. The bug description mentions both null and missing fields — a missing-field document should also be tested to ensure the fix covers both cases.

Request bulkRequest = new Request("POST", "/" + TEST_INDEX + "/_bulk?refresh=true");
bulkRequest.setJsonEntity(
    "{\"index\":{\"_id\":\"1\"}}\n"
        + "{\"keyword_field\": \"hello\", \"int_field\": 1}\n"
        + "{\"index\":{\"_id\":\"2\"}}\n"
        + "{\"keyword_field\": \"world\", \"int_field\": 2}\n"
        + "{\"index\":{\"_id\":\"3\"}}\n"
        + "{\"keyword_field\": \"\", \"int_field\": 3}\n"
        + "{\"index\":{\"_id\":\"4\"}}\n"
        + "{\"keyword_field\": \"special chars...\", \"int_field\": 4}\n"
        + "{\"index\":{\"_id\":\"5\"}}\n"
        + "{\"keyword_field\": null, \"int_field\": null}\n");
client().performRequest(bulkRequest);

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

PR Code Suggestions ✨

Latest suggestions up to da780a9

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle unexpected expression type for null-intolerant predicates

If isNullIntolerantPredicate returns true but expr is not a SimpleQueryExpression,
the code silently falls through to expr.not() without the exists filter, meaning
null/missing rows would still be included. Consider throwing an exception or logging
a warning when the predicate is null-intolerant but the expression type is
unexpected, to avoid silent incorrect behavior.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [594-597]

-if (isNullIntolerantPredicate(innerOperand) && expr instanceof SimpleQueryExpression sqe) {
-  return sqe.notWithExistsFilter();
+if (isNullIntolerantPredicate(innerOperand)) {
+  if (expr instanceof SimpleQueryExpression sqe) {
+    return sqe.notWithExistsFilter();
+  }
+  // Log or handle unexpected expression type for null-intolerant predicate
+  throw new PredicateAnalyzerException(
+      format(Locale.ROOT, "Expected SimpleQueryExpression for null-intolerant predicate, got: %s", expr.getClass()));
 }
 return expr.not();
Suggestion importance[1-10]: 4

__

Why: The suggestion identifies a valid edge case where isNullIntolerantPredicate returns true but expr is not a SimpleQueryExpression, causing silent fallthrough without the exists filter. However, it's unclear how common this scenario is in practice, and throwing an exception could break existing functionality. The impact is moderate.

Low
General
Add teardown to clean up test index

The createTestIndex method deletes and recreates the index in init(), but if the
test class is run multiple times or in parallel, there could be race conditions.
More importantly, the index is never cleaned up after tests complete — consider
adding a @AfterClass or tearDown method to delete the test index, preventing index
pollution across test runs.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotLikeNullIT.java [33-40]

 private void createTestIndex() throws IOException {
   try {
     Request deleteIndex = new Request("DELETE", "/" + TEST_INDEX);
     client().performRequest(deleteIndex);
   } catch (ResponseException e) {
     // Index doesn't exist, which is fine
   }
+  // ... rest of method
+}
 
+@Override
+public void tearDown() throws Exception {
+  try {
+    Request deleteIndex = new Request("DELETE", "/" + TEST_INDEX);
+    client().performRequest(deleteIndex);
+  } catch (ResponseException e) {
+    // Ignore
+  }
+  super.tearDown();
+}
+
Suggestion importance[1-10]: 3

__

Why: Adding a tearDown method to clean up the test index is a good practice to prevent test pollution, but the init() method already deletes the index before recreating it, mitigating the risk. The improvement is minor and primarily affects test hygiene.

Low
Verify exists filter uses correct base field reference

The notWithExistsFilter method calls getFieldReference() to get the field for the
exists query, but for LIKE queries on keyword fields, the actual query targets a
.keyword sub-field (e.g., b.keyword) while the exists filter should target the
parent field (b). Verify that getFieldReference() returns the correct parent field
name and not the analyzed sub-field, to avoid the exists filter referencing a
non-existent mapping path.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [1341-1344]

 /** Negate with an exists filter to exclude null/missing documents. */
 QueryExpression notWithExistsFilter() {
+  // getFieldReference() must return the base field name (not a sub-field like .keyword)
+  // so the exists query correctly checks for null/missing on the parent field.
   builder = boolQuery().must(existsQuery(getFieldReference())).mustNot(builder());
   return this;
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion asks to verify that getFieldReference() returns the base field name, but the test output in the PR already shows "field": "b" (not "b.keyword") in the exists query, confirming the behavior is correct. The improved_code is essentially identical to the existing_code with only a comment added.

Low

Previous suggestions

Suggestions up to commit dac4c42
CategorySuggestion                                                                                                                                    Impact
Possible issue
Capture original builder before reassignment

The notWithExistsFilter method calls builder() after reassigning builder, which may
return the new builder rather than the original one. The original builder should be
captured before reassignment to ensure mustNot wraps the original query, not the new
bool query being constructed.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [1334-1337]

 /** Negate with an exists filter to exclude null/missing documents. */
 QueryExpression notWithExistsFilter() {
-  builder = boolQuery().must(existsQuery(getFieldReference())).mustNot(builder());
+  QueryBuilder original = builder();
+  builder = boolQuery().must(existsQuery(getFieldReference())).mustNot(original);
   return this;
 }
Suggestion importance[1-10]: 7

__

Why: This is a valid concern — if builder() returns the current builder field, calling it after reassignment could return the new boolQuery() instead of the original query. Capturing the original builder() result before reassignment ensures mustNot correctly wraps the original query expression.

Medium
Add trailing newline to bulk request body

The bulk request body is missing a trailing newline after the last document.
OpenSearch bulk API requires each action and document to end with a newline
character, and the last line must also end with \n. Without it, the last document
may be silently ignored or cause a parse error.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotLikeNullIT.java [65-66]

 "{\"index\":{\"_id\":\"5\"}}\n"
-        + "{\"keyword_field\": null, \"int_field\": null}\n");
+        + "{\"keyword_field\": null, \"int_field\": null}\n"
+        + "\n");
 client().performRequest(bulkRequest);
Suggestion importance[1-10]: 2

__

Why: The bulk request already ends with \n after the last document ("{"keyword_field": null, "int_field": null}\n"), so the trailing newline is already present. The suggestion's improved_code adds an extra blank line which is unnecessary and the existing code is already correct.

Low
General
Handle non-SimpleQueryExpression case for null-intolerant predicates

When isNullIntolerantPredicate returns true but expr is not a SimpleQueryExpression
(e.g., it's a CompoundQueryExpression), the code silently falls through to
expr.not() without the exists filter, potentially reintroducing the null-inclusion
bug for compound predicates. Consider either throwing an exception or handling the
non-SimpleQueryExpression case explicitly.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [594-597]

-if (isNullIntolerantPredicate(innerOperand) && expr instanceof SimpleQueryExpression sqe) {
-    return sqe.notWithExistsFilter();
+if (isNullIntolerantPredicate(innerOperand)) {
+    if (expr instanceof SimpleQueryExpression sqe) {
+      return sqe.notWithExistsFilter();
+    }
+    // For compound expressions, fall back to standard not() — exists filter not applicable
   }
   return expr.not();
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies a potential gap where a null-intolerant predicate that resolves to a non-SimpleQueryExpression would silently fall through to expr.not() without the exists filter. However, the improved_code is functionally equivalent to the existing code (both fall through to expr.not()), so it only adds clarity via a comment rather than fixing the issue.

Low
Suggestions up to commit 89e1bc5
CategorySuggestion                                                                                                                                    Impact
General
Add missing-field document to test coverage

The bulk request body is missing a trailing newline after the last document.
OpenSearch bulk API requires each action and source to end with a newline character
(\n), and the last line also needs a terminating newline. The current code ends with
\n after the last document, which appears correct, but there is no document with a
missing keyword_field (as opposed to explicitly null). Since the issue is about
null/missing fields, consider adding a document where keyword_field is absent
entirely to cover the "missing" case.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotLikeNullIT.java [64-66]

 "{\"index\":{\"_id\":\"5\"}}\n"
-    + "{\"keyword_field\": null, \"int_field\": null}\n");
+    + "{\"keyword_field\": null, \"int_field\": null}\n"
+    + "{\"index\":{\"_id\":\"6\"}}\n"
+    + "{\"int_field\": 6}\n");
 client().performRequest(bulkRequest);
Suggestion importance[1-10]: 4

__

Why: Adding a document with a truly missing keyword_field (as opposed to explicitly null) would improve test coverage for the "missing" case. However, this would also require updating the expected row counts in the test assertions, and the current tests already cover the null case which is the primary concern of the fix.

Low
Handle null field expression gracefully

The exists query is being added based on the field from SimpleQueryExpression, but
NOT_EQUALS (<>) already semantically excludes nulls in OpenSearch (a must_not term
query does not match missing fields, but the existing code for <> may already add an
exists filter). More critically, SEARCH kind (used for IN expressions) may produce a
CompoundQueryExpression rather than a SimpleQueryExpression, causing the exists
filter to be silently skipped. Consider also handling CompoundQueryExpression to
extract the field and add the exists filter, or document that SEARCH is
intentionally excluded.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [595-600]

-if (isNullIntolerantPredicate(innerOperand)
-    && negated instanceof SimpleQueryExpression sqe
-    && sqe.getFieldExpression() != null) {
-  return CompoundQueryExpression.and(
-      false, negated, QueryExpression.create(sqe.getFieldExpression()).exists());
+if (isNullIntolerantPredicate(innerOperand)) {
+  NamedFieldExpression fieldExpr = null;
+  if (negated instanceof SimpleQueryExpression sqe && sqe.getFieldExpression() != null) {
+    fieldExpr = sqe.getFieldExpression();
+  }
+  if (fieldExpr != null) {
+    return CompoundQueryExpression.and(
+        false, negated, QueryExpression.create(fieldExpr).exists());
+  }
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a valid concern about SEARCH kind potentially producing a CompoundQueryExpression rather than a SimpleQueryExpression, causing the exists filter to be silently skipped. However, the improved_code is functionally equivalent to the existing code for the SimpleQueryExpression case and doesn't actually add handling for CompoundQueryExpression, making the improvement marginal.

Low
Suggestions up to commit 161375c
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null field reference in negation

The not() method unconditionally adds an exists filter, but for compound expressions
(e.g., NOT (a > 1 AND b < 5)), getFieldReference() may return a field from only one
side of the expression, producing an incorrect or incomplete query. Ensure that
getFieldReference() reliably returns the correct field for all expression types
handled by this method, or guard against cases where it may return null or an
unexpected value.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [1301-1305]

-builder =
-    boolQuery()
-        // Negation should exclude documents where the field is NULL/missing
-        .must(existsQuery(getFieldReference()))
-        .mustNot(builder());
+String fieldRef = getFieldReference();
+if (fieldRef != null && !fieldRef.isEmpty()) {
+    builder =
+        boolQuery()
+            // Negation should exclude documents where the field is NULL/missing
+            .must(existsQuery(fieldRef))
+            .mustNot(builder());
+} else {
+    builder = boolQuery().mustNot(builder());
+}
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid concern about getFieldReference() potentially returning null for compound expressions, but the PR's design appears to handle single-field expressions in this context. The guard would add defensive programming value, though the impact depends on whether compound expressions actually reach this code path.

Low
General
Add missing-field document to test data

The bulk request body is missing a document for the "missing field" case (a document
with no keyword_field at all), which is part of the bug described in the issue title
("null/missing field rows"). Without a document where the field is entirely absent,
the test does not fully validate the fix for missing fields, only for explicit
nulls.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotLikeNullIT.java [54-66]

 Request bulkRequest = new Request("POST", "/" + TEST_INDEX + "/_bulk?refresh=true");
 bulkRequest.setJsonEntity(
     "{\"index\":{\"_id\":\"1\"}}\n"
         + "{\"keyword_field\": \"hello\", \"int_field\": 1}\n"
         + "{\"index\":{\"_id\":\"2\"}}\n"
         + "{\"keyword_field\": \"world\", \"int_field\": 2}\n"
         + "{\"index\":{\"_id\":\"3\"}}\n"
         + "{\"keyword_field\": \"\", \"int_field\": 3}\n"
         + "{\"index\":{\"_id\":\"4\"}}\n"
         + "{\"keyword_field\": \"special chars...\", \"int_field\": 4}\n"
         + "{\"index\":{\"_id\":\"5\"}}\n"
-        + "{\"keyword_field\": null, \"int_field\": null}\n");
+        + "{\"keyword_field\": null, \"int_field\": null}\n"
+        + "{\"index\":{\"_id\":\"6\"}}\n"
+        + "{\"int_field\": 6}\n");
 client().performRequest(bulkRequest);
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies that the test only covers explicit null values but not truly missing fields. Adding a document without keyword_field would improve test coverage for the "missing" case mentioned in the issue title, though it would also require updating the expected row counts in the test assertions.

Low
Suggestions up to commit 702b3f7
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard exists query against invalid field references

The not() method now unconditionally adds an exists filter, but getFieldReference()
may not always return a valid single field reference (e.g., for compound expressions
or expressions involving multiple fields). This could cause incorrect queries or
exceptions. Consider verifying that the expression has a single, valid field
reference before adding the exists clause, and falling back to the original
mustNot-only behavior otherwise.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [1301-1305]

-builder =
-    boolQuery()
-        // Negation should exclude documents where the field is NULL/missing
-        .must(existsQuery(getFieldReference()))
-        .mustNot(builder());
+String fieldRef = getFieldReference();
+if (fieldRef != null && !fieldRef.isEmpty()) {
+    builder =
+        boolQuery()
+            // Negation should exclude documents where the field is NULL/missing
+            .must(existsQuery(fieldRef))
+            .mustNot(builder());
+} else {
+    builder = boolQuery().mustNot(builder());
+}
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid concern about getFieldReference() potentially returning null or empty for compound expressions. However, in the context of PredicateAnalyzer, the not() method is called on expression nodes that already have a field reference established, so this edge case may be handled elsewhere. The fallback logic is reasonable but may be overly defensive.

Low
General
Add missing-field document to test coverage

The bulk request body is missing a document with a missing (absent) keyword_field to
test the "missing field" case mentioned in the issue title and test class
description. The fix is specifically for null/missing fields, but only a null value
is tested — a document without the field at all should also be included to ensure
full coverage.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotLikeNullIT.java [54-66]

 bulkRequest.setJsonEntity(
     "{\"index\":{\"_id\":\"1\"}}\n"
         + "{\"keyword_field\": \"hello\", \"int_field\": 1}\n"
         + "{\"index\":{\"_id\":\"2\"}}\n"
         + "{\"keyword_field\": \"world\", \"int_field\": 2}\n"
         + "{\"index\":{\"_id\":\"3\"}}\n"
         + "{\"keyword_field\": \"\", \"int_field\": 3}\n"
         + "{\"index\":{\"_id\":\"4\"}}\n"
         + "{\"keyword_field\": \"special chars...\", \"int_field\": 4}\n"
         + "{\"index\":{\"_id\":\"5\"}}\n"
-        + "{\"keyword_field\": null, \"int_field\": null}\n");
+        + "{\"keyword_field\": null, \"int_field\": null}\n"
+        + "{\"index\":{\"_id\":\"6\"}}\n"
+        + "{\"int_field\": 6}\n");
Suggestion importance[1-10]: 4

__

Why: Adding a document with a missing keyword_field would improve test coverage for the "missing" case described in the issue. However, this is a minor enhancement to test completeness and doesn't affect correctness of the existing tests.

Low

…dling

Update the expected DSL output to include the exists filter for the
NOT(LIKE(URL, pattern)) clause, matching the new behavior where
must(exists(field)) is emitted alongside must_not(wildcard(...)).

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 161375c

…rant predicates only

The blanket exists injection in SimpleQueryExpression.not() broke expressions
like NOT(IS_NOT_NULL(a)), NOT(IS_TRUE(e)), and NOT(IS_FALSE(e)) by always
adding an exists filter. This was incorrect because truth-test and null-test
operators already encode null semantics.

The fix reverts not() to its original mustNot-only behavior and moves the
exists injection to the prefix() call site, where it is applied only when the
inner operand is a null-intolerant predicate (LIKE, comparisons, SEARCH, etc.).

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 89e1bc5

…PredicateAnalyzer

Remove the getFieldExpression() accessor that exposed internal state and
replace it with notWithExistsFilter(), which encapsulates the exists +
mustNot logic inside SimpleQueryExpression. This produces a flatter DSL
(must[exists] + mustNot[original]) instead of a nested bool wrapper, and
keeps the same semantic behavior for null-intolerant predicate negation.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit dac4c42

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit da780a9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] NOT LIKE includes null/missing field rows (same root cause as A003)

1 participant