Feat/user evals by PClmnt · Pull Request #18418 · Budibase/budibase

PClmnt · 2026-03-30T13:11:57Z

Description

This PR adds the ability to run evaluates of multiple types on your agent instructions.

You can use the following methods.

Exact match
Contains
LLM as a Judge
Tool used

All tests are ran against the instructions present in the configuration tab at that moment in time.

It's currently gated against the AI_TESTS feature flag.

Screenshots

Launchcontrol

Adds the ability to run evaluations on your agent prompt and tools

Summary by cubic

Adds agent tests with a new Tests tab to create cases, run them, and review verdicts, final responses, and tool usage. The feature is behind AI_TESTS and includes UI, API, run pipeline, validation, and logs updates.

New Features
- Tests tab in the builder to add/edit cases (input + optional context) and reviewers, run a single test, duplicate or delete tests, and view latest verdicts and the final response; visible only when AI_TESTS is enabled.
- Reviewer types: exact match, contains text, tool used, and LLM judge (rubric-based); shared via @budibase/shared-core REVIEWERS helpers with input validation.
- API and frontend-core client: GET/PUT /api/agent/:agentId/tests, POST /api/agent/:agentId/tests/run (optional caseId); server validates inputs and returns 403 when the feature is disabled.
- Server run snapshots agent/model config, streams a response, evaluates reviewers (LLM judge uses structured output), records tool calls and request IDs, and indexes sessions; Logs classify “Test” sessions separately.
Bug Fixes
- Preserve reviewer IDs when saving suites to keep result mappings stable.

^{Written for commit eada0c1. Summary will update on new commits.}

PClmnt · 2026-03-30T13:17:30Z

This was just moved from another location to be in shared-core.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e86592388e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cubic-dev-ai

2 issues found across 48 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/types/src/sdk/featureFlag.ts">

<violation number="1" location="packages/types/src/sdk/featureFlag.ts:6">
P2: The feature-flag identifier was renamed to `AI_TESTS`, which conflicts with the documented `AI_EVALS` rollout gate and can leave evals disabled when operators enable the documented flag.</violation>
</file>

<file name="packages/builder/src/pages/builder/workspace/[application]/agent/[agentId]/tests.svelte">

<violation number="1" location="packages/builder/src/pages/builder/workspace/[application]/agent/[agentId]/tests.svelte:94">
P2: Guard async suite-load responses against agent switches; otherwise stale responses can overwrite state with another agent’s tests.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai

1 issue found across 20 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/shared-core/src/agentTests.ts">

<violation number="1" location="packages/shared-core/src/agentTests.ts:82">
P2: `exact_match` is not actually exact: it lowercases and normalizes whitespace before comparison, so non-identical responses can pass.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai

3 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/shared-core/src/agentTests.ts">

<violation number="1" location="packages/shared-core/src/agentTests.ts:137">
P2: Whitespace-only reviewer content now passes required validation because the check no longer trims input.</violation>
</file>

<file name="packages/server/src/sdk/workspace/ai/tests/crud.ts">

<violation number="1" location="packages/server/src/sdk/workspace/ai/tests/crud.ts:42">
P2: Keep the name fallback here; `trim()` on an untrusted request field can throw if the case arrives without a name.</violation>

<violation number="2" location="packages/server/src/sdk/workspace/ai/tests/crud.ts:45">
P2: Preserve reviewer IDs when saving the suite; otherwise runs can emit `reviewerId: undefined` for persisted reviewers.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai

1 issue found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/server/src/sdk/workspace/ai/tests/run.ts">

<violation number="1" location="packages/server/src/sdk/workspace/ai/tests/run.ts:384">
P1: Returning the raw run object drops persistence for test executions, so run history cannot be reconstructed after the request finishes.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

Made-with: Cursor

…eat/user-evals

PClmnt added 5 commits March 25, 2026 15:40

evals

8164c21

Add LLM judge support to agent evals

dc2d27b

improved session handling of evluations:

9be3513

massive refactor and feature flag

5052569

Fix ordering bug

9aa5041

github-actions bot added firestorm Data/Infra/Revenue Team size/xl labels Mar 30, 2026

revert accidental deletion

b4638ca

PClmnt commented Mar 30, 2026

View reviewed changes

clean up

3d3cc17

PClmnt marked this pull request as ready for review March 30, 2026 13:36

PClmnt requested a review from a team as a code owner March 30, 2026 13:36

PClmnt removed the request for review from a team March 30, 2026 13:36

Merge branch 'master' into feat/user-evals

e865923

PClmnt requested a review from adrinr March 30, 2026 13:36

chatgpt-codex-connector bot reviewed Mar 30, 2026

View reviewed changes

Comment thread packages/builder/src/pages/builder/workspace/[application]/agent/[agentId]/evals.svelte Outdated

PClmnt added 2 commits March 30, 2026 15:45

fix eval crud spec mock initialization

e128e71

use real docIds in eval crud spec

32ce9b2

github-actions bot added the stale label Apr 6, 2026

PClmnt added 2 commits April 20, 2026 08:35

Merge remote-tracking branch 'origin/master' into feat/user-evals

c577384

refactor names

69c4ab3

github-actions bot removed the stale label Apr 20, 2026

cubic-dev-ai bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/types/src/sdk/featureFlag.ts

Comment thread packages/builder/src/pages/builder/workspace/[application]/agent/[agentId]/tests.svelte

massive clean up and tidy up

53b7639

cubic-dev-ai bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/shared-core/src/agentTests.ts Outdated

remove unnecessary lambdas:

10c4675

cubic-dev-ai bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/shared-core/src/agentTests.ts

Comment thread packages/server/src/sdk/workspace/ai/tests/crud.ts Outdated

Comment thread packages/server/src/sdk/workspace/ai/tests/crud.ts Outdated

remove test history for now

ff12991

cubic-dev-ai bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/server/src/sdk/workspace/ai/tests/run.ts

clean up normalisation

f5a5f2b

PClmnt and others added 7 commits April 20, 2026 15:19

lint

03e956b

Update packages/server/src/sdk/workspace/ai/tests/crud.ts

00defca

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

fix(ai-tests): preserve reviewer IDs when saving suite

1ae4b3d

Made-with: Cursor

Merge branch 'master' into feat/user-evals

8869b22

lint

fe504c3

Merge branch 'feat/user-evals' of github.com:Budibase/budibase into f…

1b6e9ba

…eat/user-evals

Merge branch 'master' into feat/user-evals

eada0c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/user evals#18418

Feat/user evals#18418
PClmnt wants to merge 23 commits intomasterfrom
feat/user-evals

PClmnt commented Mar 30, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

PClmnt Mar 30, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PClmnt commented Mar 30, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Screenshots

Launchcontrol

Summary by cubic

Uh oh!

PClmnt Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PClmnt commented Mar 30, 2026 •

edited by cubic-dev-ai bot

Loading