A config-driven framework for benchmarking generative models on multivariate time-series return scenarios.
The framework is built around one core rule:
the benchmark owns the dataset, the train/test split, the protocol definition, the rolling window extraction policy, the preprocessing pipeline, and the runtime selection; models only implement the generation logic.
That makes runs easier to reproduce and makes model comparisons much fairer.
- synthetic regime-switching factor stochastic-volatility scenario generation
- external dataset support from CSV and Parquet files
- benchmark-owned branched protocol settings:
- common:
kind,horizon,n_model_scenarios,n_reference_scenarios - forecast:
forecast.train_size,forecast.test_size,forecast.context_length,forecast.eval_stride,forecast.train_stride - unconditional windowed:
unconditional_windowed.train_size,unconditional_windowed.test_size,unconditional_windowed.eval_stride,unconditional_windowed.train_stride - unconditional path dataset:
unconditional_path_dataset.n_train_paths,unconditional_path_dataset.n_realized_paths
- common:
- explicit preprocessing pipelines per model
- built-in models:
- historical bootstrap
- historical bootstrap with stochastic volatility
- plugin discovery for external model packages
- plugin manifests and capability metadata for UI / CLI discovery
- CLI runner
- notebook API for browsing benchmarks, swapping datasets, injecting local models, and inspecting results
- Streamlit UI with dataset and device selection
- metrics and saved benchmark artifacts tagged with dataset metadata
A benchmark is most useful when the comparison contract is stable.
In this project, the stable pieces are owned by the benchmark:
- dataset loading
- split logic
- rolling evaluation origins
- context length and forecast horizon
- supervised training-window extraction stride
- preprocessing visibility
- metrics
- reporting and artifact storage
The model side is intentionally small:
- external model authors implement the structural contract in
ts_benchmark.model_contract - notebook users work through
ts_benchmark.notebook - optionally publish a plugin manifest so the UI and CLI can describe the model clearly
This lets a model under development live in its own repository and still be benchmarked against all existing models.
From the project root:
python -m pip install -e .Optional PyTorch-backed helpers:
python -m pip install -e .[torch]Optional UI dependencies:
python -m pip install -r requirements-ui.txtOptional MLflow tracking dependencies:
python -m pip install -e .[tracking]Official adapter subproject:
python -m pip install -e ./official_adaptersTimeGrad backend dependencies for the official adapter package:
python -m pip install -e ./official_adapters[timegrad]The core ts-benchmark package is intentionally installable without PyTorch.
That keeps benchmark browsing, config loading, dataset inspection, metrics,
results, notebook workflows, and CPU-only built-in baselines lightweight.
Install the optional torch extra only when you need PyTorch-backed helpers or
device-aware acceleration in the core package.
Likewise, install official adapter backend extras only when you actually want
to execute those adapter models in a given environment. In notebook workflows,
the recommended path is often to keep the main notebook env light and use
ts_benchmark.notebook.provision_adapter_venv(...) to create a dedicated
subprocess environment for heavyweight adapters such as TimeGrad.
The repo is structured as a small monorepo:
- root project:
ts-benchmark - sibling plugin project:
official_adapters/(ts-benchmark-official-adapters)
Validate a config:
ts-benchmark validate smoke_testRun a benchmark:
ts-benchmark run synthetic_basic_benchmarkList built-in and discovered model plugins:
ts-benchmark pluginsList them as JSON, including manifests and capability metadata:
ts-benchmark plugins --jsonMutable UI/notebook workspace state is stored under
$TS_BENCHMARK_HOME when set, otherwise under the XDG-style default
~/.local/share/ts-benchmark/.
Launch the UI:
streamlit run streamlit_app.pyThe supported public API is split by audience.
Use import ts_benchmark for the small high-level surface:
BenchmarkConfigProtocolBenchmarkResultsBenchmarkDiagnosticsBenchmarkRunArtifactsload_benchmark_configvalidate_benchmark_configdump_benchmark_configlist_benchmark_summariessummarize_benchmarkrun_benchmark_from_config
These are also public and intended for direct use:
ts_benchmark.benchmarkBenchmark definitions, protocol, config IO, and shipped benchmark browsing.ts_benchmark.runProgrammatic run execution and run configuration types.ts_benchmark.notebookNotebook-first browsing, execution, dataset injection, model injection, and result inspection.ts_benchmark.resultsResult objects and reporting helpers.ts_benchmark.metricsMetric config, selection, and ranking helpers.ts_benchmark.model_contractThe public structural contract for external model authors.ts_benchmark.trackingOptional MLflow integration helpers.
The following are importable but not part of the stable public API:
ts_benchmark.ui.*ts_benchmark.model.wrappers.*ts_benchmark.model.contractsts_benchmark.run.evaluatorts_benchmark.run.storagets_benchmark.dataset.providers.*
from ts_benchmark.benchmark import list_benchmark_summaries
from ts_benchmark.notebook import (
dataset_frame,
entrypoint_model,
provision_adapter_venv,
run_benchmark,
)
benchmarks = list_benchmark_summaries()
run = run_benchmark(
"smoke_test",
include=["scenarios", "diagnostics"],
with_model=entrypoint_model(
"my_local_model",
"/path/to/model.py:build_estimator",
ridge=0.1,
),
)
metrics = run.metrics()
band = run.scenario_band("my_local_model", evaluation_window=0, asset=0)
dataset = dataset_frame("smoke_test").frameFor dataset-first notebook workflows that rerun heavyweight official adapters, you can provision a dedicated subprocess env instead of installing those dependencies into the main notebook env:
timegrad_env = provision_adapter_venv(
"outputs/venvs/timegrad",
"pytorchts_timegrad",
)The benchmark controls:
protocol.kindhorizonn_model_scenariosn_reference_scenarios- branch-specific fields inside:
forecastunconditional_windowedunconditional_path_dataset
These live in the top-level protocol block of the JSON config and are passed into models through the runtime Python contract.
This means a model config should not duplicate these values inside model.params.
The loader validates this and rejects configs that try to redefine benchmark-owned protocol fields in the model block.
Sequence models often turn a training history into many supervised windows.
If each model extracts those windows differently, the benchmark stops being comparable.
The optional benchmark-level train_stride field lets the benchmark define a common training-window extraction stride for sequence models. Models read that value from train_data.protocol.train_stride instead of choosing their own hidden default.
For unconditional benchmarks, the benchmark also owns how training paths are constructed:
kind = "unconditional_windowed"means the benchmark cuts one long path into benchmark-owned training paths usinghorizonas the path length andunconditional_windowed.train_stridekind = "unconditional_path_dataset"means the dataset already provides multiple independent training paths, withunconditional_path_dataset.n_train_pathstraining paths andunconditional_path_dataset.n_realized_pathsheld-out realized paths
For the default unconditional benchmark task, path length is fixed:
unconditional_windowed: training path length equalshorizonunconditional_path_dataset: training path length and held-out realized path length equalhorizon
Today path_dataset is implemented for synthetic datasets. External CSV/Parquet datasets use windowed_path.
In both cases, the runtime normalizes unconditional training into a dataset of paths before handing it to public-contract models.
A benchmark config controls:
- benchmark identity
- dataset source and dataset-specific parameters
- protocol settings
- metric definitions
- model definitions
- run settings
- output options
{
"version": "1.0",
"benchmark": {
"name": "...",
"dataset": {...},
"protocol": {...},
"models": [...]
}
}Supported provider kinds:
syntheticcsvparquet
Synthetic example:
{
"benchmark": {
"name": "synthetic_regime_sv",
"dataset": {
"name": "synthetic_regime_sv",
"provider": {
"kind": "synthetic",
"config": {
"generator": "regime_switching_factor_sv",
"params": {"n_assets": 4, "seed": 11}
}
},
"schema": {
"layout": "tensor",
"frequency": "B"
},
"semantics": {},
"metadata": {}
}
}
}External CSV example:
{
"benchmark": {
"name": "my_returns",
"dataset": {
"name": "my_returns",
"provider": {
"kind": "csv",
"config": {
"path": "../data/my_returns.csv",
"dropna": "any"
}
},
"schema": {
"layout": "wide",
"time_column": "date",
"target_columns": ["SPX", "SX5E", "NKY"],
"frequency": "B"
},
"semantics": {
"target_kind": "returns"
},
"metadata": {}
}
}
}External price-file example:
{
"benchmark": {
"name": "my_prices",
"dataset": {
"name": "my_prices",
"provider": {
"kind": "csv",
"config": {
"path": "../data/my_prices.csv"
}
},
"schema": {
"layout": "wide",
"time_column": "date",
"frequency": "B"
},
"semantics": {
"target_kind": "prices",
"return_kind": "log"
}
},
"metadata": {}
},
"protocol": {...},
"models": [...]
}
}The protocol block is benchmark-owned and common to all models in the run:
{
"benchmark": {
"protocol": {
"kind": "forecast",
"horizon": 4,
"n_model_scenarios": 16,
"n_reference_scenarios": 32,
"forecast": {
"train_size": 180,
"test_size": 80,
"context_length": 12,
"eval_stride": 20,
"train_stride": 4
}
}
}
}Interpretation:
kind: selects the protocol branch, one offorecast,unconditional_windowed, orunconditional_path_datasethorizon: number of points each model must generate per evaluation caseforecast.train_size: number of rows used for model fitting in forecast modeforecast.test_size: number of rows reserved for rolling evaluation in forecast modeforecast.context_length: length of the conditioning historyforecast.eval_stride: spacing between evaluation origins in the held-out future regionforecast.train_stride: spacing used when the benchmark converts the training region into supervised forecast examplesunconditional_windowed.train_size: number of rows used for unconditional fitting when the benchmark windows one long pathunconditional_windowed.test_size: length of the held-out future region used for unconditional rolling evaluationunconditional_windowed.eval_stride: spacing between unconditional evaluation originsunconditional_windowed.train_stride: spacing used when the benchmark cuts unconditional training paths from one long pathunconditional_path_dataset.n_train_paths: number of benchmark-provided training pathsunconditional_path_dataset.n_realized_paths: number of held-out realized paths used at evaluationn_model_scenarios: number of scenarios requested from each model per evaluation windown_reference_scenarios: number of reference scenarios drawn when the dataset provides a true generator
Metrics are configured as metric-definition objects. In config files, the usual pattern is to select built-in metrics by name:
{
"benchmark": {
"metrics": [
{"name": "crps"},
{"name": "energy_score"},
{"name": "cross_correlation_error"}
]
}
}If the metrics block is omitted, the benchmark uses its built-in default metric set and automatically drops metrics that are not applicable to the current dataset, such as reference-scenario metrics on external CSV/Parquet data.
Each model carries its own preprocessing pipeline definition.
Example:
{
"pipeline": {
"name": "standardized",
"steps": [
{"type": "standard_scale", "params": {"with_mean": true, "with_std": true}},
{"type": "clip", "params": {"min_value": -5.0, "max_value": 5.0}}
]
}
}Built-in transforms include:
identitydemeanstandard_scalemin_max_scalerobust_scaleclipwinsorize
There are three supported model reference kinds:
- built-in
builtin - discovered external
plugin - direct Python
entrypoint
Built-in example:
{
"name": "historical_bootstrap",
"reference": {
"kind": "builtin",
"value": "historical_bootstrap"
},
"params": {
"block_size": 3
},
"pipeline": {
"name": "raw",
"steps": []
}
}Plugin example:
{
"name": "my_research_model",
"reference": {
"kind": "plugin",
"value": "my_research_model"
},
"params": {
"hidden_size": 64,
"dropout": 0.1
},
"pipeline": {
"name": "raw",
"steps": []
}
}Direct entrypoint example:
{
"name": "entrypoint_model",
"reference": {
"kind": "entrypoint",
"value": "my_package.my_module:build_model"
},
"params": {
"hidden_size": 64
},
"pipeline": {
"name": "raw",
"steps": []
}
}Important: do not place kind, horizon, n_model_scenarios, n_reference_scenarios, or branch-owned protocol fields such as forecast.context_length, forecast.train_size, unconditional_windowed.train_stride, or unconditional_path_dataset.n_train_paths inside model.params. Those belong to protocol.
Use the three model-reference modes for different stages of model maturity:
| Mode | Best for | Config value | Needs packaging/install | Shows up in ts-benchmark plugins / UI plugin discovery |
Typical owner |
|---|---|---|---|---|---|
| built-in | stable models maintained by this benchmark repo | reference.kind = "builtin" |
no extra install beyond the benchmark itself | yes | benchmark repo |
| external plugin | models you want to install, share, and discover cleanly across environments | reference.kind = "plugin" |
yes | yes | external model repo |
| direct entrypoint | local development and rapid iteration against in-progress code | reference.kind = "entrypoint" |
not necessarily, but the module must be importable | no | external model repo or local workspace |
Practical guidance:
- use
entrypointwhile a model is still under active development and you want the lowest-friction benchmark loop - use
pluginonce you want the model to be installable, discoverable in the CLI/UI, and runnable by short name across machines - use
builtinonly for models that are intentionally shipped as part of the benchmark package itself
Recommended workflow:
- start with
entrypointwhile you are iterating on model code and benchmark compatibility - package the model as an external
pluginonce you want clean installation, short-name configs, and CLI/UI discovery - promote the model to built-in only if the benchmark repo intends to ship, document, test, and maintain it as part of the benchmark itself
Important behavioral difference:
pluginmodels appear ints-benchmark pluginsand in the Streamlit plugin discovery panelentrypointmodels do not appear there; they are loaded only when a config explicitly references their Python import path- in the Streamlit UI,
entrypointmodels still appear in the "Models declared in current config" panel once they are present in the loaded JSON config
Promoting a model from external plugin to built-in is mainly a product and maintenance decision, not a capability requirement.
Advantages of built-in status:
- users can run the model with a short built-in reference value without separately installing a model package
- the benchmark repo can test, document, version, and release that model together with the rest of the framework
- example configs, CLI/UI discovery, and default benchmark workflows work out of the box for all benchmark users
- benchmark maintainers can add benchmark-owned wiring when needed, such as special construction logic or default runtime propagation
Tradeoff:
- once a model is built-in, the benchmark repo effectively owns ongoing compatibility, dependency management, and user support for that model
An entrypoint is just a direct Python import path of the form:
package.module:ClassOrFactory
This is the recommended development-time path when your model lives outside the benchmark repo and is not yet packaged as a plugin.
Operational implications:
- the Python process running the benchmark must be able to import that module
- this usually means one of:
- the model repo is installed in the environment
- the model repo is on
PYTHONPATH - the model code lives directly in the same importable workspace
- because
entrypointmodels are not plugin-discovered, they will not show up in the plugin listing commands or the UI plugin browser, but the Streamlit UI can still show them in the config-model summary for the currently loaded config
A plugin is an installable Python package that:
- registers a model factory through the benchmark's
ts_benchmark.modelsentry-point group - ships a packaged
ts_benchmark_plugin.tomlmetadata file next to the builder module
Operational implications:
- the plugin must be installed into the same Python environment used to run the benchmark CLI or Streamlit UI
- once installed, the model can be referenced by a short name like
"reference": {"kind": "plugin", "value": "my_model"} - the packaged plugin metadata file makes the model discoverable in the CLI/UI and enriches saved run metadata
- if you install or update a plugin while the Streamlit UI is already running, restart the UI process so discovery metadata is refreshed
By default, every model runs in the same Python environment as the benchmark process.
When one model family needs a conflicting dependency stack, you can keep the benchmark in its normal environment and move only that model into a dedicated subprocess/venv with a per-model execution block:
{
"name": "deepvar_external",
"reference": {
"kind": "plugin",
"value": "gluonts_deepvar"
},
"execution": {
"mode": "subprocess",
"venv": "eqbench-mxnet"
},
"params": {
"epochs": 1,
"batch_size": 8
},
"pipeline": {
"name": "raw",
"steps": []
}
}Key points:
- this does not change how the model is referenced; only
model.referenceidentifies the model - it only changes where that model executes
- the benchmark still passes the same benchmark-owned task and data semantics
- benchmark-level device selection still applies; the selected device is forwarded into the external runner through runtime metadata
- models without an
executionblock still run in-process in the current environment
Use this mode when:
- two model stacks need incompatible package versions
- one model family needs its own venv/container
- you want to keep the benchmark UI/CLI in a stable main environment while isolating a problematic backend
The shipped MXNet examples use this pattern for gluonts_deepvar and gluonts_gpvar.
{
"run": {
"seed": 21,
"execution": {
"device": "cuda:0",
"scheduler": "auto"
},
"output": {
"keep_scenarios": true
}
}
}The selected device is recorded in benchmark outputs and passed through runtime metadata to models.
This section is meant to answer:
- what data formats are supported
- how the benchmark interprets a file
- how rolling windows are extracted
- how synthetic and external data differ in the reported metrics
Synthetic datasets are useful for controlled experiments.
The included synthetic baseline is a regime-switching factor stochastic-volatility generator designed to reproduce common time-series stylized facts:
- heavy tails
- cross-asset dependence
- volatility clustering
- calm/stress regime changes
- mild leverage-like asymmetry
A simplified form is:
r_{t,i} = μ_{z_t} + β_i σ^m_t ε^m_t + s_i σ^{id}_{t,i} ε^{id}_{t,i}
where:
z_tis a latent regime stateσ^m_tis a market-level volatility processσ^{id}_{t,i}is an idiosyncratic volatility process- factor and idiosyncratic shocks jointly generate cross-sectional dependence and clustered volatility
Because the synthetic benchmark controls the true conditional data-generating process, it can also draw reference conditional scenarios from the same generator. That enables richer distributional metrics beyond realized-path scoring.
External datasets are for benchmarking on real or user-provided data.
Supported file types:
- CSV
- Parquet
The benchmark can ingest either:
- return series directly, or
- price series which it converts into returns
The simplest layout is:
- one date column
- one column per asset
- one row per timestamp
Example:
| date | SPX | SX5E | NKY |
|---|---|---|---|
| 2020-01-02 | 0.0071 | 0.0084 | 0.0012 |
| 2020-01-03 | -0.0068 | -0.0091 | -0.0047 |
If the file contains prices instead of returns, set:
"dataset": {
"semantics": {
"target_kind": "prices"
}
}and choose either:
"dataset": {
"semantics": {
"return_kind": "simple"
}
}or:
"dataset": {
"semantics": {
"return_kind": "log"
}
}dataset.provider.config.pathdataset.provider.config.dropnadataset.provider.config.read_kwargsdataset.schema.time_columndataset.schema.target_columnsdataset.semantics.target_kind(returnsorprices)dataset.semantics.return_kind(simpleorlog)
Given the loaded return matrix, the benchmark:
- takes the first
train_sizerows as the fit sample - takes the next
test_sizerows as the evaluation region - rolls forecast origins through the evaluation region using
eval_stride - exposes
context_lengthrows before each origin as the conditioning context - builds benchmark-owned forecast fit examples whose
historycontains all past rows up to the target origin and whosecontextis the trailing conditioning suffix - compares generated
horizon-step scenarios to the realized future path
This means the benchmark defines the evaluation windows once and all models are tested on the same windows.
These compare sampled predictive distributions to realized future paths:
crpsenergy_scorepredictive_mean_msecoverage_90_error
These are available on synthetic datasets because the benchmark can sample the true conditional future distribution:
mean_errorvolatility_errorskew_errorexcess_kurtosis_errorcross_correlation_errorautocorrelation_errorsquared_autocorrelation_errorvar_95_errores_95_errormax_drawdown_errormmd_rbf
On external datasets, the benchmark defaults to realized-path scoring metrics because there is no true latent generator available.
Saved benchmark tables include dataset metadata such as:
dataset_namedataset_sourcedevicehas_reference_scenariosprotocol_kindgeneration_modepath_constructioncontext_lengthhorizoneval_stridetrain_striden_train_pathsn_realized_paths
This makes it much easier to compare runs across multiple datasets.
The recommended model-author API is:
from ts_benchmark.model_contract import ...The key objects are:
TSGeneratorEstimatorFittedTSGeneratorDataSchemaTSSeriesTrainExampleTrainDataTaskSpecGenerationRequestGenerationResultModelCapabilitiesFitReport
Important shape conventions:
TSSeries.values:[time, target_dim]GenerationResult.samples:[num_samples, generated_time, target_dim]
Important task semantics:
task.mode = "forecast":task.horizonis the forecast horizonfit(train=...)receivesTrainData(examples=...)withhistory,context, andtarget
task.mode = "unconditional":task.horizonis the desired generated sequence lengthfit(train=...)receivesTrainData(examples=...)withhistory=Noneandcontext=None
Minimal estimator/generator shape:
estimator.fit(train, *, schema, task, valid=None, runtime=None) -> (generator, fit_report)
generator.capabilities() -> ModelCapabilities
generator.sample(request) -> GenerationResultThe benchmark also has an internal runtime ABI based on ScenarioModel,
TrainingData, and ScenarioRequest, but that is an internal integration
surface for built-ins, wrappers, and legacy adapters. External model authors
should prefer ts_benchmark.model_contract.
A packaged plugin manifest is optional but strongly recommended.
The benchmark uses manifests to populate the CLI, the Streamlit UI, and saved run metadata with information such as:
- display name
- model family
- version
- supported dataset sources
- device hints
- whether the model is multivariate
- whether it produces probabilistic samples
- whether it uses the benchmark device setting directly
Plugin metadata lives in a packaged ts_benchmark_plugin.toml resource. For a single-model package, the top-level shape is:
[manifest]
display_name = "My research model"
version = "0.1.0"
family = "diffusion"
description = "Research prototype for probabilistic time-series scenarios."
runtime_device_hints = ["cpu", "cuda"]
supported_dataset_sources = ["synthetic", "csv", "parquet"]
required_input = "returns"
default_pipeline = "standardized"
tags = ["research", "diffusion"]
[manifest.capabilities]
multivariate = true
probabilistic_sampling = true
benchmark_protocol_contract = true
explicit_preprocessing = true
uses_benchmark_device = trueFor multi-model packages, use:
[plugins.my_model.manifest]
display_name = "My research model"
default_pipeline = "standardized"This section is the intended onboarding path for anyone who wants to benchmark a new model against the models already included.
Recommended structure:
my-model-plugin/
pyproject.toml
src/
my_model_plugin/
__init__.py
plugin.py
Your model does not need to be added under src/ts_benchmark/model.
Decision rule:
- if you only want to run benchmarks against your local development code,
entrypointis usually enough - if you want short-name configs, CLI/UI discovery, and a cleaner install story, package the model as an external
plugin - only move a model into the benchmark repo itself when you want it maintained as a built-in benchmark model
Implement the public contract from ts_benchmark.model_contract.
from __future__ import annotations
import numpy as np
from ts_benchmark.model_contract import (
FitReport,
GenerationMode,
GenerationRequest,
GenerationResult,
ModelCapabilities,
)
class MyGenerator:
def __init__(self, mean: np.ndarray, std: np.ndarray):
self.mean = np.asarray(mean, dtype=float)
self.std = np.asarray(std, dtype=float)
def capabilities(self) -> ModelCapabilities:
return ModelCapabilities(
supported_modes=frozenset({GenerationMode.FORECAST, GenerationMode.UNCONDITIONAL}),
supports_multivariate_targets=True,
)
def sample(self, request: GenerationRequest) -> GenerationResult:
values = np.asarray(request.series.values, dtype=float)
horizon = int(request.task.horizon or 1)
rng = np.random.default_rng(None if request.runtime is None else request.runtime.seed)
draws = rng.normal(
loc=self.mean[None, None, :],
scale=self.std[None, None, :],
size=(request.num_samples, horizon, self.mean.shape[0]),
)
return GenerationResult(samples=draws)
def save(self, path):
raise NotImplementedError
class MyEstimator:
def fit(self, train, *, schema, task, valid=None, runtime=None):
del schema, task, valid, runtime
if getattr(task, "mode", None) == GenerationMode.UNCONDITIONAL:
values = np.concatenate(
[np.asarray(example.target.values, dtype=float) for example in train.examples],
axis=0,
)
else:
values = np.concatenate(
[
np.concatenate(
[
np.asarray(example.history.values, dtype=float),
np.asarray(example.target.values, dtype=float),
],
axis=0,
)
for example in train.examples
],
axis=0,
)
x = values.reshape(-1, values.shape[-1])
generator = MyGenerator(mean=x.mean(axis=0), std=x.std(axis=0, ddof=1) + 1e-6)
return generator, FitReport(train_metrics={"n_rows": float(x.shape[0])})
def build_estimator(**params):
del params
return MyEstimator()If you are writing an in-repo model or an internal wrapper, the older
ts_benchmark.model ScenarioModel contract still exists, but it is not the
recommended public integration path.
Use:
mode = task.mode
horizon = task.horizon
num_samples = request.num_samples
history = request.series.valuesDo not make benchmark-owned task settings a second hidden model config.
In particular, do not duplicate:
- generation mode
- horizon
- requested sample count
inside model.params.
The benchmark treats preprocessing as part of the experiment definition.
That means:
- your model should expect the benchmark to control preprocessing
- your model should not silently standardize or winsorize data unless that is explicitly part of the benchmark configuration or clearly documented in your plugin manifest / implementation
If your method genuinely requires a specific input normalization, document that in the manifest and in your model documentation.
Use:
default_pipelinefor the recommended pipelinerequired_pipelineonly when the model truly cannot be benchmarked correctly with any other pipeline
Recommended approach:
[manifest]
display_name = "My research model"
version = "0.1.0"
family = "gaussian"
description = "Example plugin description."
runtime_device_hints = ["cpu", "cuda"]
supported_dataset_sources = ["synthetic", "csv", "parquet"]
required_input = "returns"
default_pipeline = "raw"
[manifest.capabilities]
multivariate = true
probabilistic_sampling = true
benchmark_protocol_contract = true
explicit_preprocessing = true
uses_benchmark_device = trueAdd required_pipeline="raw" only if a different pipeline would make the model
semantically invalid rather than merely suboptimal.
In pyproject.toml:
[project.entry-points."ts_benchmark.models"]
my_model = "my_model_plugin.plugin:build_estimator"and in plugin.py:
def build_estimator(**params):
return MyEstimator(**params)This step is what turns your model from "importable Python code" into a discoverable benchmark plugin.
Without this step, you can still benchmark the model through a direct config reference.kind = "entrypoint", but it will not appear in plugin listings or the UI plugin browser.
Add a ts_benchmark_plugin.toml file next to your plugin module and include it in package data.
In pyproject.toml:
[tool.setuptools.package-data]
my_model_plugin = ["ts_benchmark_plugin.toml"]Example ts_benchmark_plugin.toml:
[manifest]
display_name = "My model"
default_pipeline = "raw"
[manifest.capabilities]
multivariate = true
probabilistic_sampling = trueThis is how discoverable plugin metadata is surfaced in the UI and CLI.
python -m pip install -e /path/to/my-model-pluginEditable install is the recommended workflow while your model is still under development.
Important:
- install the plugin into the same environment where you run
ts-benchmarkorstreamlit run streamlit_app.py - if the UI is already running when you install or update the plugin, restart the UI process
ts-benchmark pluginsor:
ts-benchmark plugins --jsonYour model should appear with its manifest and capability metadata.
Reference the plugin by name:
{
"benchmark": {
"models": [
{
"name": "my_model_run",
"reference": {
"kind": "plugin",
"value": "my_model"
},
"params": {
"hidden_size": 64,
"dropout": 0.1
},
"pipeline": {
"name": "raw",
"steps": []
}
}
]
}
}If you are still in the pre-plugin phase, the equivalent development-time config is:
{
"benchmark": {
"models": [
{
"name": "my_model_run",
"reference": {
"kind": "entrypoint",
"value": "my_model_plugin.plugin:build_model"
},
"params": {
"hidden_size": 64,
"dropout": 0.1
},
"pipeline": {
"name": "raw",
"steps": []
}
}
]
}
}That approach is often the fastest way to test a new model against the benchmark before you decide whether packaging it as a plugin is worth the extra ceremony.
Remember:
- model hyperparameters live in
model.params - benchmark protocol values live in
benchmark.protocol - preprocessing lives in
model.pipeline
ts-benchmark run my_config.jsonOutputs include:
- metrics
- ranks
- config snapshot
- model infos
- plugin manifest metadata
- summary metadata
- optionally saved scenarios
For fair comparisons, keep these choices fixed across models unless the whole experiment is explicitly about varying them:
- dataset
protocol.kindhorizon- forecast branch fields such as
forecast.train_size,forecast.test_size,forecast.context_length,forecast.eval_stride,forecast.train_stride - unconditional branch fields such as
unconditional_windowed.train_size,unconditional_windowed.test_size,unconditional_windowed.eval_stride,unconditional_windowed.train_stride,unconditional_path_dataset.n_train_paths,unconditional_path_dataset.n_realized_paths - scenario counts
- preprocessing pipeline
- benchmark seed
The benchmark has a benchmark-level runtime device field.
Model authors should read the runtime device through train_data.runtime.device and request.runtime.device if their implementation can use it.
If a model ignores device selection, document that in the manifest by setting uses_benchmark_device=False.
Current runtime behavior:
device: nullor UIautomeans:- use all visible CUDA GPUs if available
- otherwise use
mpsif available - otherwise fall back to
cpu
device: "cuda:0"pins the run to one specific GPUdevice: "cuda:0,cuda:1"restricts the run to a specific GPU subset
When more than one CUDA device is available and the benchmark contains more than one model, the benchmark schedules models round-robin across the selected GPUs in separate worker processes.
If a model is configured with a per-model external execution block, that device assignment is still forwarded into the child process. The child model may still fall back internally if its backend cannot actually honor that device.
A model can optionally implement:
def model_info(self) -> dict:
...This information is saved in run artifacts and is useful for debugging and reporting resolved configuration details.
Typical things to include:
- architecture sizes
- diffusion steps
- backend name
- resolved device
- training loss summary
- any derived internal settings that help users interpret the run
This repository includes a complete external plugin example under:
plugin_examples/eqbench_demo_gaussian_plugin/
and an example config under:
plugin_examples/demo_gaussian_config.json
That example now includes both a model entry point and a manifest entry point.
Resamples historical return vectors from the training set.
Strengths:
- preserves empirical marginal behavior
- preserves same-date cross-sectional dependence
- very simple baseline
Limitation:
- does not explicitly model dynamic volatility
Uses EWMA volatility scaling, bootstraps standardized residuals, and simulates future volatility before re-inflating residuals.
Validate:
ts-benchmark validate smoke_testRun:
ts-benchmark run synthetic_basic_benchmarkList plugins:
ts-benchmark pluginsList plugins as structured JSON:
ts-benchmark plugins --jsonThe UI lets you:
- load an example config
- upload a config JSON
- choose a bundled dataset or upload an external dataset
- select an execution device
- inspect discovered model plugins and their manifests
- inspect models declared in the current config, including direct
entrypointmodels - run the benchmark and inspect metrics, ranks, dataset summary, and saved model metadata
- browse MLflow experiments, runs, and logged benchmark artifacts from a tracking URI when
mlflowis installed
Launch it with:
streamlit run streamlit_app.pyBenchmark runs can be logged to MLflow through an optional run.tracking.mlflow block.
Example:
{
"run": {
"execution": {
"scheduler": "auto"
},
"output": {},
"tracking": {
"mlflow": {
"enabled": true,
"tracking_uri": "sqlite:///mlflow.db",
"experiment_name": "ts-benchmark-dev",
"run_name": "constant-vol-smoke",
"tags": {
"owner": "research"
},
"log_artifacts": true,
"log_model_info": true,
"log_diagnostics": true,
"log_scenarios": false
},
}
}
}When enabled, the benchmark logs:
- flattened benchmark/protocol/model parameters
- per-model metrics and average ranks
- functional smoke pass/fail summary when diagnostics are enabled
- benchmark artifacts such as
metrics.csv,ranks.csv, config JSON, summary JSON, model info, and optional diagnostics/scenarios
Tracking stays optional:
- if
run.tracking.mlflow.enabledisfalse, the benchmark behaves exactly as before - subprocess model workers do not create their own MLflow runs; only the parent benchmark run is logged
For new setups, prefer a database-backed tracking URI such as sqlite:///mlflow.db. MLflow's filesystem backend still works, but current MLflow versions warn that it is deprecated.
A typical run can save:
metrics.csvranks.csvbenchmark_config.jsonrun.jsonmodel_results.jsonsummary.jsonscenarios.npz(optional)
For each model result, the saved metadata includes:
- model config reference object (
kindandvalue) - model params
- pipeline summary
- execution info
- declared plugin manifest
- fitted model info, when provided by the model
- runtime-discovered manifest, when available
- metric results and ranks
- scenario output shape when scenarios were kept
At minimum, run metadata includes:
- dataset name
- dataset source
- selected device
- whether reference scenarios exist
protocol_kindcontext_lengthhorizoneval_stridetrain_stridepath_constructionn_train_paths
Useful starting points:
smoke_testsynthetic_basic_benchmarkplugin_examples/demo_gaussian_config.json
- Synthetic data is only the starting point; the benchmark is designed to support real external datasets as first-class inputs.
- The benchmark contract is intentionally small so that external researchers can add new models with minimal friction.
- The plugin manifest layer is descriptive and developer-facing: it improves discoverability without forcing model code to live inside the benchmark repository.