You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`pnpm --filter @superdoc-testing/evals run eval`| Run deterministic evals (reading + argument tests) |~$0.30 |
126
128
|`pnpm --filter @superdoc-testing/evals run eval:reading`| Run reading tool tests only |~$0.15 |
127
-
|`pnpm --filter @superdoc-testing/evals run eval:gdpval`| Run GDPval benchmark (Model+SuperDoc vs Model-Only) |~$1-2 |
128
129
|`pnpm --filter @superdoc-testing/evals run eval:view`| Open Promptfoo web UI with results | Free |
129
130
|`pnpm --filter @superdoc-testing/evals run baseline:save <label>`| Save versioned results snapshot | Free |
130
131
131
132
Tool definitions are extracted from `packages/sdk/tools/` via `evals/tools/extract.mjs`. Run `pnpm run generate:all` first if SDK artifacts are missing.
132
133
133
-
Test files are YAML in `evals/tests/`. Each test has a `vars.task` prompt and JavaScript assertions that check tool call structure (Level 1: tool selection + argument accuracy, not execution).
134
+
Test files are YAML in `evals/tests/`. Each test has a `vars.task` prompt and JavaScript assertions that check tool call structure (tool selection + argument accuracy, not execution).
134
135
135
136
The system prompt at `evals/prompts/agent.txt` is a copy of the proven prompt from `examples/eval-demo/lib/agent.ts`. Update both when changing the prompt.
136
137
138
+
### Level 2: GDPval Benchmark (Model+SuperDoc vs Model-Only)
139
+
140
+
| Command | What it does | Cost |
141
+
|---------|-------------|------|
142
+
|`pnpm --filter @superdoc-testing/evals run eval:gdpval`| Run GDPval benchmark |~$1-2 |
143
+
144
+
### Level 3: DOCX Agent Benchmark (real agents, real documents)
145
+
146
+
Runs actual Claude Code and Codex CLIs against DOCX tasks, comparing their performance with and without SuperDoc tools. 4 conditions x 2 agents x N tasks.
147
+
148
+
**Conditions:**
149
+
150
+
| Condition | What the agent gets |
151
+
|-----------|-------------------|
152
+
| baseline | No skill, agent figures out DOCX on its own |
0 commit comments