Skip to content

Latest commit

 

History

History
753 lines (559 loc) · 32.2 KB

File metadata and controls

753 lines (559 loc) · 32.2 KB

Kimi Code Review Guidelines

This document provides instructions for conducting thorough code reviews that match or exceed the quality of other AI code reviewers (e.g., Claude, Copilot, Codex).

Review Philosophy

Don't just read code—attack it. Assume bugs exist and find them. Question every assumption. Test edge cases mentally or actually. Verify that the implementation matches the specification, not just that it doesn't crash.

Review Workflow

Branch Management

CRITICAL: Stay on the development branch for the entire review session. Do not switch back to main or any other branch until:

  • The user explicitly says the PR is merged
  • The user asks you to switch branches
  • The review is completely finished and the user is done working on that PR

Why: Switching branches disrupts the development workflow, loses uncommitted context, and makes it harder to verify incremental fixes.


Phase 1: Static Analysis Checklist

Verify every AGENTS.md requirement explicitly:

1.1 Documentation Compliance

For every public class, interface, enum, and inner class:

# Search for public types without GroovyDoc
grep -B2 "^class\|^interface\|^enum\|static class" src/main/groovy/**/*.groovy | grep -v "/\*\*"
  • Class-level GroovyDoc exists
  • Public methods have GroovyDoc with @param, @return, @throws
  • Complex algorithms have implementation comments

1.2 Idiomatic Groovy Patterns

Search for and flag:

grep -n "new ArrayList<>()\|new HashSet<>()\|new LinkedHashSet<>()\|Arrays.asList()" src/main/groovy/**/*.groovy
grep -n "\.doubleValue()\|\.intValue()" src/main/groovy/**/*.groovy  # Unnecessary unboxing
grep -n "case.*:" src/main/groovy/**/*.groovy  # Old-style switch (JDK 21+ should use ->)

Required patterns per AGENTS.md:

  • [] instead of new ArrayList<>()
  • [] as Set instead of new HashSet<>()
  • [*list] instead of new ArrayList<>(list) (spread operator)
  • Arrow syntax case X -> for switches (unless return-in-case needed)
  • == instead of .equals()
  • "${var}" instead of string concatenation

1.3 Type Safety

Search for Object abuse:

grep -n "Object " src/main/groovy/**/*.groovy | grep -v "Object>" | grep -v "import"
  • No Object parameters when specific types are known
  • No Object return types when specific types are known
  • Use method overloads instead of Object with instanceof chains

1.4 @CompileStatic Compliance

grep -L "@CompileStatic" src/main/groovy/**/*.groovy | grep -v test
  • All production classes have @CompileStatic unless the build script sets it to compile statically per default (e.g. in matrix-stats).
  • Only use @CompileDynamic when explicitly justified

Phase 2: Specification Compliance (CRITICAL)

Verify the implementation matches the documented requirements.

2.1 Roadmap/Requirements Cross-Reference

When a PR claims to implement a specific section or feature:

  1. Read the requirements first - Don't review code in isolation
  2. Check every bullet point in the requirements against the implementation
  3. Question rejections - If the code throws an error, verify it SHOULD error per the spec

Example:

Roadmap says: "Supported syntax: dot expansion (.), interaction expansion (^)"
Test shows: "Dot expansion with interaction power (^) cannot be normalized"
Question: Does the roadmap imply these should work together? (.^2)
If yes: Implementation is incomplete
If no: Roadmap should document the limitation

Red flags:

  • Requirements say "supports X and Y" but code rejects "X + Y" combination
  • Tests validate rejection of syntax that should work per requirements
  • Requirements are updated to match implementation rather than vice versa

2.2 Feature Completeness Matrix

For features with multiple components, verify the matrix of combinations:

Feature A Feature B Status Test
Supported Supported Should work testAWithB()
Supported Rejected Should error Document why
Rejected Rejected Should error N/A

Example:

  • . (dot expansion) = Supported
  • ^ (power) = Supported
  • .^2 (combination) = ??? (Must check requirements)

Phase 3: Semantic Correctness Testing (CRITICAL)

This phase finds logic bugs that don't crash but produce wrong results.

2.1 Round-trip Verification

For parser/normalizer implementations, verify that transformations are intentional:

// Example: A normalizer may change syntax while preserving semantics
input = 'y ~ x + z'
output = normalize(input).toString()
// If input != output, is the transformation:
//   1. Correct (semantics preserved)?
//   2. Intentional (documented behavior)?
//   3. Consistent (same input always produces same output)?

2.2 Invalid Input Rejection

Question: Should this input be accepted?

Test cases that should REJECT with clear errors:

// Example: In a formula DSL, response-side should reject predictor-side operators
// These should error if response-side operations are not supported:
//   'y - z ~ x'       // Subtraction on response
//   'y + z ~ x'       // Multiple responses
//   'y - 1 ~ x'       // Intercept control on response

// Example: Missing operands
//   'y ~ / x'         // Missing left operand
//   'y ~ x :'         // Missing right operand

Red flag: If malformed inputs produce valid-looking outputs instead of errors.

2.3 Reference Implementation Comparison

For domain-specific implementations (parsers, statistical tests, protocol handlers):

  • Test against known reference behavior (reference implementations, standards, specifications)
  • Document intentional deviations from reference behavior
  • Test common patterns from real-world usage

Example:

// If implementing R-style formula parsing:
// Test real column naming patterns from R datasets:
//   'sepal.length ~ sepal.width'  // Dotted identifiers
//   '`gross margin` ~ revenue'     // Backtick quoting
//   'y ~ x + z'                    // Basic form
//   update(base, '~ . - z + w')    // Shorthand syntax

2.4 Context-Sensitive Semantics

For DSLs with context-sensitive semantics (e.g., left-side vs right-side), verify that operators/keywords have the correct meaning in each context:

// Example: Formula parsing has different semantics on each side of '~'
// Right side (predictors): accepts operators like +, -, *, /, ^
// Left side (response): should reject most operators

// Valid on right side, invalid on left side:
//   'y ~ x - z'     // OK: remove z from predictors
//   'y - z ~ x'     // ERROR: can't subtract from response

// Valid on right side, invalid on left side:
//   'y ~ 0 + x'     // OK: no intercept
//   '0 ~ x'         // ERROR: literal 0 as response

Check: Does the implementation correctly track context and apply appropriate validation rules?

2.5 Tokenization Ambiguity

Test token boundaries where lexical elements might be confused:

// Example: Dotted identifiers vs dot operator in formula parsing
// Is 'y.foo' one identifier or 'y' '.' 'foo'?
// Test both: 'y.foo ~ x' vs 'y . foo ~ x'

// General cases to consider:
// - Special characters in identifiers (dots, underscores, numbers)
// - Multi-character operators vs sequences of single-character operators
// - Escaped/quoted content that should be treated as literal
// - Whitespace sensitivity (when present vs absent)

2.6 Syntax Variation Testing (CRITICAL)

Test ALL variations of supported syntax, not just the canonical form:

Principle: If a feature has multiple syntactic forms, test each one.

// Example: R-style formula update syntax has multiple forms
// All of these should work equivalently:
// update(base, '. ~ . - z + w')      // Standard: explicit dot
// update(base, '~ . - z + w')        // Shorthand: omitted LHS
// update(base, '. ~ - z + w')        // Without dot on RHS

Key insight: Different user communities (R, Python, SAS, SQL) have different syntax conventions. Users will try variations based on their background. Test the variations that users from each background might try.

2.7 Real-World Pattern Testing

Test with realistic data patterns, not just synthetic examples:

// Example: Formula parsing should handle real column naming patterns
// Synthetic: 'y ~ x + z'
// Real-world: 'sepal.length ~ sepal.width' (dot-separated)
// Real-world: '`Gross Margin` ~ Revenue' (spaces in names)
// Real-world: 'log(Income) ~ Age + Education' (transforms + multiple terms)

Use patterns from:

  • Famous public datasets in the domain
  • The project's own example data
  • Documentation examples from reference implementations
  • Community usage patterns (Stack Overflow, forums)

2.8 Feature Interaction Testing (CRITICAL)

Test combinations of features, not just features in isolation:

// Example: Dot expansion + Power operator in formulas
// Tested separately:
//   normalize('y ~ .')      // Dot expansion
//   normalize('y ~ (a+b)^2') // Power operator
// Must ALSO test combined:
//   normalize('y ~ .^2')    // Dot + Power interaction

// Example: Grouping + Operators
//   normalize('y ~ (a + b)^2 - a:b')  // Power then subtraction
//   normalize('y ~ (a + b) * c')      // Grouping then expansion

Key insight: Features that work independently may fail when combined. The matrix of feature combinations must be tested.

2.9 Rejection Policy Completeness

When a feature should reject certain inputs, test ALL syntactic forms of that invalid input:

// Example: Multi-response formulas should be rejected
// Same conceptual error, different syntax:
//   'y1 + y2 ~ x'           // Binary operator
//   'cbind(y1, y2) ~ x'     // Function call
//   'c(y1, y2) ~ x'         // Alternative function name
//   '(y1 + y2) ~ x'         // Grouped expression

// Example: Invalid response-side operations
//   'y - z ~ x'             // Subtraction
//   'y * z ~ x'             // Multiplication
//   'y:z ~ x'               // Interaction syntax

Check: Is the rejection logic consistent across all syntactic forms of the same conceptual error?


Phase 3: Edge Case Attack Testing

For every public method, mentally execute these inputs:

3.1 Null and Empty Inputs

Input Type Test Case Expected Behavior
String null, "", " " IllegalArgumentException or FormulaParseException
List/Map null, [], [:] Same as above
Array null, [] Same as above
Number null, NaN, Infinity Validate appropriately

Example:

// For a parser method:
parse(null)                // Should throw, not NPE
parse("")                  // Should throw descriptive error
parse("   ")               // Should throw (blank check)
parse("unterminated `foo") // Unterminated quote/escape

3.2 Boundary Values

  • Integers: 0, 1, -1, Integer.MAX_VALUE, Integer.MIN_VALUE
  • Decimals: 0.0, very small (1e-10), very large (1e308)
  • String length: empty, single char, very long (10k+ chars)
  • Collection size: empty, single element, very large

3.3 Malformed/Suspicious Input

  • Unbalanced delimiters: (, [, {, `
  • Invalid escape sequences
  • Unicode edge cases: emoji, zero-width spaces, combining chars
  • Scientific notation edge cases: 1e, 1e-, e10, 1.5.6

3.4 Silent Behavior Check

Look for code that discards input without warning:

// In FormulaSupport.expandTerms():
if (expression instanceof FormulaExpression.NumberLiteral) {
  return [] as LinkedHashSet<FormulaTerm>  // SILENT DROP - flag this!
}

Question: Should this warn? Throw? Is this expected per R semantics?


Phase 4: Constructor Safety Audit

For every public constructor in the PR:

class ParsedFormula {
  ParsedFormula(String source, FormulaExpression response, FormulaExpression predictors) {
    // Check: Are these validated?
    this.source = source        // Could be null
    this.response = response    // Could be null
    this.predictors = predictors // Could be null
  }
}

Checklist:

  • Null validation on all reference parameters
  • Empty string validation where applicable
  • Range validation on numeric parameters
  • Defensive copying of mutable inputs (List.copyOf(), .clone())
  • Immutable fields marked final

Phase 5: API Surface Validation

5.1 String Interpolation Traps

Check for dangerous string interpolation:

// In Formula.of():
static ParsedFormula of(String response, String predictors) {
  parse("${response} ~ ${predictors}")  // If response is null, becomes "null ~ x"!
}

Verify: Does this handle null gracefully or produce garbage?

5.2 Fluent/Chaining API Safety

For builder patterns or fluent APIs:

builder.setX(null).setY(5).build()  // Does setX(null) throw or accept?

5.3 Exception Quality

Verify exceptions are:

  • Descriptive (not raw NPE)
  • Include context (position in string, invalid value)
  • Use appropriate type (IllegalArgumentException, IllegalStateException, custom)

Example of good:

throw new FormulaParseException(
  "Expected ')' to close grouped expression at position ${pos}\n${source}\n${' ' * pos}^", 
  pos
)

Phase 6: Test Coverage Verification

6.1 Run All Tests

./gradlew :module:test --tests 'package.*' -g ./.gradle-user

6.2 Identify Untested Public API

For each public method, verify:

  • Happy path tested
  • Null input tested (or rejected at API boundary)
  • Empty input tested
  • Invalid format tested (if applicable)
  • Boundary values tested

6.3 Common Test Gaps

Look for missing tests of:

  • toString() implementations
  • equals()/hashCode() contracts
  • Exception message content (not just type)
  • Resource cleanup (closeables, streams)
  • Thread safety (if applicable)

6.4 Semantic Test Gaps (CRITICAL)

Test that invalid inputs are rejected, not silently converted:

// BAD: This test passes but behavior is wrong
void testInvalidOperation() {
  def result = parser.parse('invalid - operation')
  assertEquals('simplified', result.toString())  // Silent data loss!
}

// GOOD: Invalid input should error
void testRejectsInvalidOperation() {
  assertThrows(ParseException) {
    parser.parse('invalid - operation')
  }
}

Key rule: Tests should verify that:

  1. Valid inputs produce correct outputs
  2. Invalid inputs produce errors (not garbage outputs)
  3. Edge cases are handled consistently

Phase 7: Architecture & Design Review

7.1 Immutability

Check mutable state:

grep -n "List<\|Map<\|Set<" src/main/groovy/**/*.groovy | grep -v "final"
  • Public fields should be rare
  • Returned collections should be unmodifiable or defensive copies
  • Internal state changes are thread-safe (or documented as not thread-safe)

7.2 Encapsulation

  • Internal classes marked @PackageScope or package-private
  • No implementation details leaked in public API
  • Utility classes have private constructors

7.3 Logging

Per AGENTS.md: Use se.alipsa.matrix.core.util.Logger, not println/log4j/SLF4J:

grep -rn "println\|System.out\|System.err\|LoggerFactory\|log4j" src/main/groovy/

Phase 8: Domain-Specific Validation

8.1 Parser/Grammar Implementations

Additional checks for parser/grammar implementations:

// 1. Operator precedence correctness
// Example: 'a + b * c' should parse as a + (b * c), not (a + b) * c

// 2. Associativity
// Example: 'a - b - c' should be (a - b) - c, not a - (b - c)

// 3. Whitespace handling
// Test both: 'a+b' and 'a + b' should be equivalent

// 4. Comment/documentation examples actually work
// Copy examples from GroovyDoc and verify they produce expected output

8.2 Statistical Implementations

For statistical classes:

  • Rejects insufficient data (n < minimum required)
  • Handles constant/zero-variance data appropriately
  • Validates probability inputs in [0, 1] range
  • Checks for numerical instability (division by zero, log(0))

8.3 Chart/Visualization

For charting code:

  • Never use Object for aesthetic parameters (use typed overloads)
  • Color handling uses shared utilities
  • SVG output tests use direct object access when possible
  • Scale implementations handle NaN, null, Infinity

Phase 9: Security Spot Check

Quick security audit:

  • No eval() or dynamic code execution
  • No external command execution
  • Path traversal protection (if file paths accepted)
  • No sensitive data in logs/exceptions
  • Input size limits (if processing user input)

Review Output Format

Output format:

  • Only list findings that require action
  • Group related issues by type (e.g., "Documentation Defects", "Null Safety Issues")
  • Provide exact file paths and line numbers
  • Include ready-to-paste code snippets for fixes
  • Include exact test commands to verify fixes
  • Omit "Good", "Correct", "Well done" assessments entirely
  • Omit summary sections

Structure findings as:

## Issue Group: [Descriptive Type]

**Impact:** Important Issues (N found) | Minor Issues (N found)

### Issue 1: [Brief name]
- Location: `File.groovy:line`
- Problem: One-line description
- Fix:
  ```groovy
  // ready-to-paste code

Test command:

./gradlew :module:test --tests 'TestClass.testMethod' -g ./.gradle-user

Issue 2: [Brief name with test coverage gap]

  • Location: File.groovy:line
  • Problem: Missing test for edge case X
  • Test to add:
    @Test
    void testXYZ() {
      // test code
    }

Test command:

./gradlew :module:test --tests 'TestClass.testXYZ' -g ./.gradle-user

---

## Self-Correction Checklist

Before submitting review, verify:

- [ ] Did I check EVERY public constructor for null validation?
- [ ] Did I test null inputs on EVERY public method?
- [ ] Did I search for AGENTS.md anti-patterns?
- [ ] Did I question any silent data dropping?
- [ ] Did I verify GroovyDoc on ALL public types?
- [ ] **Did I verify that invalid inputs are REJECTED, not silently converted?**
- [ ] **Did I test against real-world usage patterns (dotted names, update syntax)?**
- [ ] **Did I verify response-side vs predictor-side semantics are correctly handled?**
- [ ] **Did I verify ALL parallel data sources stay aligned after filtering (subset, NA handling)?**
- [ ] **Did I test zero-dimension edge cases (empty matrices, single-level categoricals)?**
- [ ] Did I actually run the tests?
- [ ] Did I verify build passes with spotless/codenarc?

### Plan/Implementation Review Additional Checks

When reviewing implementation plans or roadmap documents:

- [ ] **Cross-reference with actual source code** - When a plan references existing types (enums, classes, methods, constants), verify they exist with the exact names/values described. Example: If a plan says `NaAction.DROP`, check the enum to confirm the actual value is `OMIT` not `DROP`.
- [ ] **Compare against established patterns** - New sections should follow conventions established in earlier sections. Example: If sections 1-3 consistently use `./gradlew :matrix-stats:spotlessApply :matrix-stats:test`, verify section 17 maintains this pattern.
- [ ] **Verify quality gate consistency** - If earlier sections include `codenarcMain` and `codenarcTest`, ensure later sections include both, not just one.
- [ ] **Check command syntax** - Verify Gradle commands, especially those with multiple flags or filters, follow working patterns from earlier in the project.
- [ ] **Cross-reference plan items with actual changes** - Verify that every file/task listed in the implementation plan was actually modified. Check the PR diff against the plan to identify gaps (e.g., DesignMatrixBuilder listed for GroovyDoc updates but not touched).
- [ ] **Verify sub-task traceability** - When parent tasks are marked complete but sub-tasks remain unchecked, verify whether: (1) sub-tasks were completed but not checked, (2) existing coverage was sufficient, or (3) the sub-tasks were skipped. Require explicit documentation of which case applies.
- [ ] **Analyze test data changes** - When test data is modified (new rows, changed values), determine if it's: (1) new test coverage, (2) a bug fix to existing test data, or (3) unrelated noise. Bug fixes to test data should be called out in commit messages.
- [ ] **Verify dependency scope** - When new dependencies are added, verify the Gradle scope is correct: `implementation` for runtime dependencies that shouldn't leak to consumers, `api` for types exposed in public API, `compileOnly` for compile-time only. Match the pattern of existing dependencies in the build file.
- [ ] **Check for implementation leakage into API specs** - API specification sections should not reference implementation details (specific libraries, internal types). Keep API contracts abstract ("computations in double") separate from implementation choices ("use EJML DMatrixRMaj").
- [ ] **Inventory existing capabilities vs new facade** - When creating a facade over existing code, compare what exists internally against what the facade will expose. Note gaps (existing capabilities not exposed) and justify omissions.
- [ ] **Verify package naming rationale** - When introducing packages with similar names (e.g., `linear` and `linalg`), document the intentional separation or consider consolidation to avoid contributor confusion.
- [ ] **Cross-reference interop tests with all callers** - When listing regression/interop tests, verify the complete set of callers that could be affected. Don't assume the listed tests cover all risk.

---

## Project-Specific Context

### matrix-stats Patterns

When reviewing matrix-stats:
- Statistical classes often need to reject constant/zero-variance data
- **CRITICAL:** R-style formulas have complex context-sensitive semantics
  - Response-side vs predictor-side validation differs
  - Operators interact (e.g., dot expansion + power)
  - Multiple syntactic forms for same concept (multi-response)
- **CRITICAL:** Formula pipeline data alignment
  - When subset/NA filtering drops rows, ALL data sources must be filtered: matrix columns, environment variables, weights, offset
  - Test: `ModelFrame.of('y ~ x + z', data).environment(env).subset(mask).evaluate()` - verify env values align with surviving rows
- **CRITICAL:** Empty design matrix edge cases
  - When all predictors drop (single-level categoricals), design matrix must preserve row count (N rows, 0 columns)
  - Test: `y ~ group` where group has 1 level - verify `result.data.rowCount() == result.response.size()`
- Distribution classes need validation for probability range [0,1]
- All new code should remove/replace commons-math3 dependencies

### matrix-charts/matrix-ggplot Patterns

When reviewing charting code:
- Never use `Object` for aesthetic parameters (use typed overloads)
- Color handling should use shared utilities
- SVG output tests should use direct object access, not `toXml()` when possible
- Scale implementations must handle `NaN`, `null`, and `Infinity`

---

## Tools to Use

```bash
# Run all checks before finalizing review
./gradlew :module:spotlessCheck :module:codenarcMain :module:test -g ./.gradle-user

# Find potential issues
grep -rn "TODO\|FIXME\|XXX" src/main/groovy/  # Unresolved items
grep -rn "throw new RuntimeException" src/main/groovy/  # Should be specific type

# Not in matrix-stats as all classes are compiled statically per default in that module
find src/main/groovy -name "*.groovy" -exec grep -L "@CompileStatic" {} \;  # Missing annotation

# Test semantic edge cases (add to test file temporarily)
echo "Testing semantic edge cases..."

Post-Review Analysis: Learning from Misses

When other reviewers (Codex, Claude, human) find issues you missed, analyze why:

Common Miss Patterns

Miss Type Example Why It Happened Prevention
Happy path bias Only tested y + z ~ x (multiple responses), missed y - z ~ x (invalid operation) Focused on "does it work" not "does it fail correctly" Test ALL operators on both sides of ~
Syntax variation gap Tested . ~ . - z + w but missed ~ . - z + w Assumed one form was sufficient Test ALL documented syntax variations
Real-world disconnect Tested `foo bar` but missed sepal.length Used synthetic examples, not dataset patterns Use actual column names from famous datasets
Silent acceptance Didn't catch that invalid inputs produced valid-looking outputs Only checked for crashes, not correctness Verify invalid inputs produce errors, not garbage
Feature interaction blindspot Tested . and ^ separately, missed .^2 Tested features in isolation Phase 2.8: Test combinations of features
Specification compliance gap Accepted .^2 rejection without checking if roadmap requires it Didn't cross-reference with requirements Phase 2.1: Roadmap/requirements cross-reference
Rejection policy gap Tested y1 + y2 ~ x but missed cbind(y1, y2) ~ x Only tested one form of invalid input Phase 2.9: Test ALL syntactic forms of invalid input
Data alignment after filtering Subset/NA filtering updated matrix but not env vars Assumed all data sources filtered together Phase 3.x: Verify ALL data sources stay aligned
Empty collection edge cases Empty predictor matrix lost row count (0 rows vs N observations) Didn't verify empty structures maintain dimensions Phase 3.x: Test zero-dimension edge cases
Asymmetric DSL semantics Tested log(x) on RHS but not log(y) on LHS Assumed symmetry in DSL sides Phase 2.4: Test BOTH sides of DSL operators
Documentation-to-code naming drift Plan said NaAction.DROP, code had NaAction.OMIT Didn't verify enum values against actual source Cross-reference ALL named types with source code
Pattern consistency gap Missed missing spotlessApply in new section Didn't compare new commands against established conventions Compare new sections against patterns in earlier sections
Quality gate regression New test section omitted codenarcMain Didn't verify quality gates remained consistent Check that all quality gates (spotless, codenarcMain, codenarcTest) are present
Plan-to-implementation gap DesignMatrixBuilder listed for GroovyDoc but not touched Didn't cross-reference plan items with actual PR changes Verify every planned file/task appears in the diff
Sub-task traceability gap Parent tasks marked done but sub-tasks unchecked Assumed parent completion implied sub-task completion Verify sub-task status or document why skipped
Test data change analysis gap New test row was actually a bug fix to existing test Didn't analyze if data changes were bug fixes vs new coverage Analyze test data modifications for intent
Dependency scope gap EJML scope not specified (should be implementation) Didn't consider Gradle dependency scope implications Verify dependency scope against existing patterns
API-implementation leakage EJML types mentioned in API contract section Mixed API design with implementation decisions Keep API contracts abstract, implementation separate
Package naming confusion Both linear and linalg packages without rationale Didn't question similar package names coexisting Document intentional package separation or consolidate
Incomplete caller inventory Interop tests missed 4 of 6 MatrixAlgebra callers Didn't verify complete set of affected components Cross-reference all callers when listing interop tests
Facade coverage gap MatrixAlgebra has transpose/multiply but facade doesn't Didn't inventory existing vs exposed capabilities Compare internal capabilities against facade scope

Analysis Template

When an issue is found post-review, document:

  1. The bug: What was the incorrect behavior?
  2. The test that would have caught it: What specific input triggers it?
  3. Why it was missed: Which checklist item was insufficient?
  4. The fix: What was added to prevent recurrence?

Example Analysis:

Bug Test That Would Catch It Miss Type Prevention
Invalid operation accepted on wrong side of DSL Test ALL operators in ALL contexts Context sensitivity gap Phase 2.4: Context-sensitive semantics
Feature combination loses semantics Test features together, not just separately Feature interaction blindspot Phase 2.8: Feature interaction testing
Feature rejects valid combination per spec Check if roadmap/requirements say it should work Spec compliance gap Phase 2.1: Roadmap/requirements cross-reference
Invalid input accepted in alternative syntax Test ALL syntactic forms of invalid input Rejection policy gap Phase 2.9: Rejection policy completeness
Shorthand syntax rejected Test ALL documented syntax variations Syntax variation gap Phase 2.6: Syntax variation testing
Real-world patterns fail Use actual dataset patterns, not synthetic Real-world disconnect Phase 2.7: Real-world pattern testing

Detailed Example from PR #275 (formula parsing):

Example 1: Context-sensitive semantics

  • Bug: Formula.normalize('y - z ~ x') produced 'y ~ 1 + x' instead of erroring
  • Root cause: Response-side validation only checked for + (multi-response), not other operators
  • Why missed: Happy path bias - tested what should work, not what should fail
  • Fix: Added comprehensive response-side validation and testRejectsInvalidResponseSideExpressions()

Example 2: Specification compliance

  • Bug: Roadmap claimed . and ^ were supported, but .^2 was rejected
  • Root cause: Didn't verify feature combinations against requirements
  • Why missed: Assumed rejection was correct without checking roadmap
  • Fix: Added Phase 2.1 to require roadmap cross-reference before accepting rejections

Detailed Example from PR #277 (Design Matrix) - Codex findings that Claude missed:

Miss 1: Asymmetric DSL semantics (Response-side expression evaluation)

  • Bug: ModelFrame.of('log(y) ~ x', data) returned untransformed y values but metadata indicated transformed response
  • Root cause: extractResponseName() unwrapped log(y) to y for column lookup, but extractResponseColumn() didn't evaluate the expression
  • Why missed: Assumed ExpressionEvaluator was used for both sides; didn't trace response-side path separately
  • Fix: Add explicit test testTransformedResponseEvaluated() and either implement LHS expression evaluation or reject non-variable responses

Miss 2: Data alignment after filtering

  • Bug: Environment variables returned wrong values after .subset() or naAction(OMIT) because env wasn't filtered to match surviving rows
  • Root cause: Matrix rows filtered at line 256, but original env passed unchanged to DesignMatrixBuilder at line 280
  • Why missed: Focused on matrix data flow; didn't consider environment variables as parallel data source needing filtering
  • Fix: Filter env alongside matrix/weights/offset when subsetting, or document that environment variables must be pre-aligned

Miss 3: Empty structure dimensions

  • Bug: When all predictors dropped (e.g., single-level categorical), result.data had 0 rows but result.response had N rows
  • Root cause: DesignMatrixBuilder.buildMatrix() returned empty matrix with no rows when names.isEmpty()
  • Why missed: Test testTermsMetadataForDroppedTerm verified metadata but not matrix dimensions
  • Fix: Ensure empty design matrix preserves row count: Matrix.builder().rows(rowCount).columnNames([]).build()

Remember

The goal is not to find something to criticize. The goal is to ensure no preventable bugs reach production.

Silent data corruption is worse than a crash. Always verify that invalid inputs are rejected, not converted to something else.

Be thorough. Be skeptical. Be helpful.