Cross-Validation & Backtesting

Time series cross-validation requires special handling because data has temporal ordering. These functions help you create proper train/test splits, forecast on folds, and prevent data leakage.

Function	Description	Category
`ts_cv_folds_by`	Create CV folds by number of folds (primary)	Fold Creation
`ts_cv_split_by`	Create splits with explicit cutoff dates	Fold Creation
`ts_cv_split_folds_by`	View fold date ranges (train/test boundaries)	Fold Creation
`ts_cv_split_index_by`	Memory-efficient split (index columns only)	Fold Creation
`ts_cv_forecast_by`	Generate forecasts for all CV folds	Forecasting
`ts_cv_hydrate_by`	Add features to folds with automatic masking	Hydration
`ts_fill_unknown_by`	Fill unknown future values safely	Leakage Prevention
`ts_mark_unknown_by`	Mark rows as known/unknown	Leakage Prevention

Showing 8 of 8

Fold Creation

ts_cv_folds_by

Create train/test splits for backtesting. This is the primary fold creation function.

ts_cv_folds_by(
    source VARCHAR,
    group_col COLUMN,
    date_col COLUMN,
    target_col COLUMN,
    n_folds BIGINT,
    horizon BIGINT,
    params MAP
) → TABLE

Source must be a table name string, NOT a CTE. Data must be pre-cleaned (no gaps).

Parameters:

Parameter	Type	Required	Description
`source`	VARCHAR	Yes	Table name (quoted string)
`group_col`	COLUMN	Yes	Series identifier (unquoted)
`date_col`	COLUMN	Yes	Date/timestamp column (unquoted)
`target_col`	COLUMN	Yes	Target value column (unquoted)
`n_folds`	BIGINT	Yes	Number of CV folds
`horizon`	BIGINT	Yes	Number of test periods per fold
`params`	MAP	No	Configuration (use `MAP{}` for defaults)

Params MAP Options:

Key	Type	Default	Description
`gap`	BIGINT	`0`	Periods between train end and test start
`embargo`	BIGINT	`0`	Periods excluded from training after previous test
`window_type`	VARCHAR	`'expanding'`	`'expanding'`, `'fixed'`, or `'sliding'`
`min_train_size`	BIGINT	`1`	Min training size (fixed/sliding only)
`initial_train_size`	BIGINT	auto	Periods before first fold
`skip_length`	BIGINT	`horizon`	Periods between folds (1=dense)
`clip_horizon`	BOOLEAN	`false`	Allow partial test windows

Returns: Exactly 5 columns:

Column	Type	Description
`group_col`	(input)	Series identifier
`date_col`	(input)	Timestamp
`target_col`	DOUBLE	Target value
`fold_id`	BIGINT	Fold number
`split`	VARCHAR	`'train'` or `'test'`

caution

Output has exactly 5 columns. Features are NOT passed through — use ts_cv_hydrate_by to add features to folds.

Example:

-- Create 3 folds with 6-day horizon
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 6, MAP{});

-- Fixed window with gap
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 5, 14,
    MAP{'window_type': 'fixed', 'gap': '2'});

ts_cv_split_by

Split using explicit cutoff dates instead of automatic fold generation.

ts_cv_split_by(
    source VARCHAR,
    group_col IDENTIFIER,
    date_col IDENTIFIER,
    target_col IDENTIFIER,
    training_end_times DATE[],
    horizon BIGINT,
    params MAP
) → TABLE

Returns: group_col, date_col, target_col, fold_id, split

Example:

-- Custom quarterly cutoffs
CREATE TABLE cv_splits AS
SELECT * FROM ts_cv_split_by('sales', store_id, date, revenue,
    ['2024-03-31'::DATE, '2024-06-30'::DATE, '2024-09-30'::DATE], 30, MAP{});

ts_cv_split_folds_by

View fold date ranges before running a full split. Useful for verifying fold boundaries.

ts_cv_split_folds_by(
    source VARCHAR,
    group_col VARCHAR,
    date_col VARCHAR,
    training_end_times DATE[],
    horizon BIGINT,
    frequency VARCHAR
) → TABLE(fold_id, train_start, train_end, test_start, test_end, horizon)

Example:

SELECT * FROM ts_cv_split_folds_by(
    'sales_data', 'store_id', 'date',
    ['2024-01-10'::DATE, '2024-01-15'::DATE, '2024-01-20'::DATE],
    5, '1d'
);

ts_cv_split_index_by

Memory-efficient alternative returning only index columns without duplicating data.

ts_cv_split_index_by(
    source VARCHAR,
    group_col IDENTIFIER,
    date_col IDENTIFIER,
    training_end_times DATE[],
    horizon BIGINT,
    frequency VARCHAR,
    params MAP
) → TABLE(group_col, date_col, fold_id, split)

Example:

CREATE TABLE cv_index AS
SELECT * FROM ts_cv_split_index_by(
    'sales_data', store_id, date,
    ['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);

Forecasting on Folds

ts_cv_forecast_by

Generate univariate forecasts for all CV folds. Univariate only — no exogenous support.

ts_cv_forecast_by(
    cv_folds VARCHAR,
    group_col COLUMN,
    date_col COLUMN,
    target_col COLUMN,
    method VARCHAR,
    params MAP
) → TABLE

caution

Input table must have fold_id and split columns (from ts_cv_folds_by or ts_cv_split_by). Passing raw data throws a clear error. Horizon is inferred from test rows — no frequency parameter needed.

Returns:

Column	Type	Description
`fold_id`	BIGINT	Fold number
`group_col`	(input)	Series identifier
`date_col`	(input)	Timestamp
`target_col`	DOUBLE	Actual value (from test set)
`split`	VARCHAR	Always `'test'`
`yhat`	DOUBLE	Point forecast
`yhat_lower`	DOUBLE	Lower prediction interval
`yhat_upper`	DOUBLE	Upper prediction interval
`model_name`	VARCHAR	Model used

Example:

-- Create folds first
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 7, MAP{});

-- Forecast on folds
CREATE TABLE cv_results AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'AutoETS',
    MAP{'seasonal_period': '7'});

Hydration

ts_cv_hydrate_by

Add feature columns to CV folds with automatic masking: train rows get actual values, test rows get filled values (preventing data leakage).

ts_cv_hydrate_by(
    cv_folds VARCHAR,
    source VARCHAR,
    group_col COLUMN,
    date_col COLUMN,
    unknown_features VARCHAR[],
    params MAP
) → TABLE

Parameters:

Parameter	Type	Required	Description
`cv_folds`	VARCHAR	Yes	Table with fold assignments
`source`	VARCHAR	Yes	Table containing features
`group_col`	COLUMN	Yes	Group column (unquoted)
`date_col`	COLUMN	Yes	Date column (unquoted)
`unknown_features`	VARCHAR[]	Yes	Feature columns to mask in test sets
`params`	MAP	No	Fill strategy configuration

Params MAP Options:

Key	Type	Default	Description
`strategy`	VARCHAR	`'last_value'`	`'last_value'`, `'null'`, or `'default'`
`fill_value`	VARCHAR	`''`	Value when `strategy='default'`

Features are output as direct columns — no MAP extraction needed.

Example:

SELECT store_id, date, revenue, fold_id, split, competitor_sales, actual_temp
FROM ts_cv_hydrate_by('cv_folds', 'sales', store_id, date,
    ['competitor_sales', 'actual_temp'], MAP{'strategy': 'last_value'});

Leakage Prevention

ts_fill_unknown_by

Fill unknown future values in test sets to prevent data leakage.

ts_fill_unknown_by(
    source VARCHAR,
    group_col COLUMN,
    date_col COLUMN,
    value_col COLUMN,
    cutoff_date DATE,
    params MAP
) → TABLE

Params MAP Options:

Key	Type	Default	Description
`strategy`	VARCHAR	`'last_value'`	`'last_value'`, `'null'`, or `'default'`
`fill_value`	DOUBLE	`0.0`	Value when `strategy='default'`

Example:

-- Forward fill temperature beyond cutoff
SELECT * FROM ts_fill_unknown_by(
    'backtest_data', category, date, temperature,
    '2023-06-01'::DATE, MAP{}
);

-- Set unknown to NULL
SELECT * FROM ts_fill_unknown_by(
    'backtest_data', category, date, temperature,
    '2023-06-01'::DATE, MAP{'strategy': 'null'}
);

ts_mark_unknown_by

Mark rows as known/unknown based on cutoff date without modifying values.

ts_mark_unknown_by(
    source VARCHAR,
    group_col COLUMN,
    date_col COLUMN,
    cutoff_date DATE
) → TABLE(*, is_unknown BOOLEAN, last_known_date TIMESTAMP)

Returns: All source columns plus:

is_unknown: TRUE for future rows
last_known_date: Per-group last known timestamp

Example:

SELECT
    category, date, temperature,
    is_unknown, last_known_date,
    CASE
        WHEN is_unknown THEN AVG(temperature) OVER (PARTITION BY category)
        ELSE temperature
    END AS temperature_filled
FROM ts_mark_unknown_by(
    'backtest_data', category, date,
    '2023-06-01'::DATE
)
ORDER BY category, date;

Complete Workflow Example

-- 1. Create folds
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 7, MAP{});

-- 2. Forecast per fold (for each model)
CREATE TABLE cv_naive AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'Naive', MAP{});

CREATE TABLE cv_autoets AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'AutoETS',
    MAP{'seasonal_period': '7'});

-- 3. Compare metrics (use scalar functions with GROUP BY)
SELECT 'Naive' AS model,
    ts_mae(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)) AS mae,
    ts_rmse(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)) AS rmse
FROM cv_naive GROUP BY ALL
UNION ALL
SELECT 'AutoETS',
    ts_mae(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)),
    ts_rmse(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds))
FROM cv_autoets GROUP BY ALL;

Fold Creation​

ts_cv_folds_by​

ts_cv_split_by​

ts_cv_split_folds_by​

ts_cv_split_index_by​

Forecasting on Folds​

ts_cv_forecast_by​

Hydration​

ts_cv_hydrate_by​

Leakage Prevention​

ts_fill_unknown_by​

ts_mark_unknown_by​

Complete Workflow Example​

Fold Creation

ts_cv_folds_by

ts_cv_split_by

ts_cv_split_folds_by

ts_cv_split_index_by

Forecasting on Folds

ts_cv_forecast_by

Hydration

ts_cv_hydrate_by

Leakage Prevention

ts_fill_unknown_by

ts_mark_unknown_by

Complete Workflow Example