Skip to content

Feat/user evals#18418

Open
PClmnt wants to merge 23 commits intomasterfrom
feat/user-evals
Open

Feat/user evals#18418
PClmnt wants to merge 23 commits intomasterfrom
feat/user-evals

Conversation

@PClmnt
Copy link
Copy Markdown
Collaborator

@PClmnt PClmnt commented Mar 30, 2026

Description

This PR adds the ability to run evaluates of multiple types on your agent instructions.

You can use the following methods.

  • Exact match
  • Contains
  • LLM as a Judge
  • Tool used

All tests are ran against the instructions present in the configuration tab at that moment in time.

It's currently gated against the AI_TESTS feature flag.

Screenshots

image

Launchcontrol

  • Adds the ability to run evaluations on your agent prompt and tools

Summary by cubic

Adds agent tests with a new Tests tab to create cases, run them, and review verdicts, final responses, and tool usage. The feature is behind AI_TESTS and includes UI, API, run pipeline, validation, and logs updates.

  • New Features

    • Tests tab in the builder to add/edit cases (input + optional context) and reviewers, run a single test, duplicate or delete tests, and view latest verdicts and the final response; visible only when AI_TESTS is enabled.
    • Reviewer types: exact match, contains text, tool used, and LLM judge (rubric-based); shared via @budibase/shared-core REVIEWERS helpers with input validation.
    • API and frontend-core client: GET/PUT /api/agent/:agentId/tests, POST /api/agent/:agentId/tests/run (optional caseId); server validates inputs and returns 403 when the feature is disabled.
    • Server run snapshots agent/model config, streams a response, evaluates reviewers (LLM judge uses structured output), records tool calls and request IDs, and indexes sessions; Logs classify “Test” sessions separately.
  • Bug Fixes

    • Preserve reviewer IDs when saving suites to keep result mappings stable.

Written for commit eada0c1. Summary will update on new commits.

@github-actions github-actions bot added firestorm Data/Infra/Revenue Team size/xl labels Mar 30, 2026
Comment thread packages/shared-core/src/duration.ts Outdated
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just moved from another location to be in shared-core.

@PClmnt PClmnt marked this pull request as ready for review March 30, 2026 13:36
@PClmnt PClmnt requested a review from a team as a code owner March 30, 2026 13:36
@PClmnt PClmnt removed the request for review from a team March 30, 2026 13:36
@PClmnt PClmnt requested a review from adrinr March 30, 2026 13:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e86592388e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions github-actions bot added the stale label Apr 6, 2026
@github-actions github-actions bot removed the stale label Apr 20, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 48 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/types/src/sdk/featureFlag.ts">

<violation number="1" location="packages/types/src/sdk/featureFlag.ts:6">
P2: The feature-flag identifier was renamed to `AI_TESTS`, which conflicts with the documented `AI_EVALS` rollout gate and can leave evals disabled when operators enable the documented flag.</violation>
</file>

<file name="packages/builder/src/pages/builder/workspace/[application]/agent/[agentId]/tests.svelte">

<violation number="1" location="packages/builder/src/pages/builder/workspace/[application]/agent/[agentId]/tests.svelte:94">
P2: Guard async suite-load responses against agent switches; otherwise stale responses can overwrite state with another agent’s tests.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread packages/types/src/sdk/featureFlag.ts
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 20 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/shared-core/src/agentTests.ts">

<violation number="1" location="packages/shared-core/src/agentTests.ts:82">
P2: `exact_match` is not actually exact: it lowercases and normalizes whitespace before comparison, so non-identical responses can pass.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread packages/shared-core/src/agentTests.ts Outdated
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/shared-core/src/agentTests.ts">

<violation number="1" location="packages/shared-core/src/agentTests.ts:137">
P2: Whitespace-only reviewer content now passes required validation because the check no longer trims input.</violation>
</file>

<file name="packages/server/src/sdk/workspace/ai/tests/crud.ts">

<violation number="1" location="packages/server/src/sdk/workspace/ai/tests/crud.ts:42">
P2: Keep the name fallback here; `trim()` on an untrusted request field can throw if the case arrives without a name.</violation>

<violation number="2" location="packages/server/src/sdk/workspace/ai/tests/crud.ts:45">
P2: Preserve reviewer IDs when saving the suite; otherwise runs can emit `reviewerId: undefined` for persisted reviewers.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread packages/shared-core/src/agentTests.ts
Comment thread packages/server/src/sdk/workspace/ai/tests/crud.ts Outdated
Comment thread packages/server/src/sdk/workspace/ai/tests/crud.ts Outdated
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/server/src/sdk/workspace/ai/tests/run.ts">

<violation number="1" location="packages/server/src/sdk/workspace/ai/tests/run.ts:384">
P1: Returning the raw run object drops persistence for test executions, so run history cannot be reconstructed after the request finishes.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread packages/server/src/sdk/workspace/ai/tests/run.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

firestorm Data/Infra/Revenue Team size/xl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant