Cross-Validation & Backtesting
Time series cross-validation requires special handling because data has temporal ordering. These functions help you create proper train/test splits, forecast on folds, and prevent data leakage.
| Function | Description | Category |
|---|---|---|
ts_cv_folds_by | Create CV folds by number of folds (primary) | Fold Creation |
ts_cv_split_by | Create splits with explicit cutoff dates | Fold Creation |
ts_cv_split_folds_by | View fold date ranges (train/test boundaries) | Fold Creation |
ts_cv_split_index_by | Memory-efficient split (index columns only) | Fold Creation |
ts_cv_forecast_by | Generate forecasts for all CV folds | Forecasting |
ts_cv_hydrate_by | Add features to folds with automatic masking | Hydration |
ts_fill_unknown_by | Fill unknown future values safely | Leakage Prevention |
ts_mark_unknown_by | Mark rows as known/unknown | Leakage Prevention |
Fold Creation
ts_cv_folds_by
Create train/test splits for backtesting. This is the primary fold creation function.
ts_cv_folds_by(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
target_col COLUMN,
n_folds BIGINT,
horizon BIGINT,
params MAP
) → TABLE
Source must be a table name string, NOT a CTE. Data must be pre-cleaned (no gaps).
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
source | VARCHAR | Yes | Table name (quoted string) |
group_col | COLUMN | Yes | Series identifier (unquoted) |
date_col | COLUMN | Yes | Date/timestamp column (unquoted) |
target_col | COLUMN | Yes | Target value column (unquoted) |
n_folds | BIGINT | Yes | Number of CV folds |
horizon | BIGINT | Yes | Number of test periods per fold |
params | MAP | No | Configuration (use MAP{} for defaults) |
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
gap | BIGINT | 0 | Periods between train end and test start |
embargo | BIGINT | 0 | Periods excluded from training after previous test |
window_type | VARCHAR | 'expanding' | 'expanding', 'fixed', or 'sliding' |
min_train_size | BIGINT | 1 | Min training size (fixed/sliding only) |
initial_train_size | BIGINT | auto | Periods before first fold |
skip_length | BIGINT | horizon | Periods between folds (1=dense) |
clip_horizon | BOOLEAN | false | Allow partial test windows |
Returns: Exactly 5 columns:
| Column | Type | Description |
|---|---|---|
group_col | (input) | Series identifier |
date_col | (input) | Timestamp |
target_col | DOUBLE | Target value |
fold_id | BIGINT | Fold number |
split | VARCHAR | 'train' or 'test' |
Output has exactly 5 columns. Features are NOT passed through — use ts_cv_hydrate_by to add features to folds.
Example:
-- Create 3 folds with 6-day horizon
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 6, MAP{});
-- Fixed window with gap
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 5, 14,
MAP{'window_type': 'fixed', 'gap': '2'});
ts_cv_split_by
Split using explicit cutoff dates instead of automatic fold generation.
ts_cv_split_by(
source VARCHAR,
group_col IDENTIFIER,
date_col IDENTIFIER,
target_col IDENTIFIER,
training_end_times DATE[],
horizon BIGINT,
params MAP
) → TABLE
Returns: group_col, date_col, target_col, fold_id, split
Example:
-- Custom quarterly cutoffs
CREATE TABLE cv_splits AS
SELECT * FROM ts_cv_split_by('sales', store_id, date, revenue,
['2024-03-31'::DATE, '2024-06-30'::DATE, '2024-09-30'::DATE], 30, MAP{});
ts_cv_split_folds_by
View fold date ranges before running a full split. Useful for verifying fold boundaries.
ts_cv_split_folds_by(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR
) → TABLE(fold_id, train_start, train_end, test_start, test_end, horizon)
Example:
SELECT * FROM ts_cv_split_folds_by(
'sales_data', 'store_id', 'date',
['2024-01-10'::DATE, '2024-01-15'::DATE, '2024-01-20'::DATE],
5, '1d'
);
ts_cv_split_index_by
Memory-efficient alternative returning only index columns without duplicating data.
ts_cv_split_index_by(
source VARCHAR,
group_col IDENTIFIER,
date_col IDENTIFIER,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR,
params MAP
) → TABLE(group_col, date_col, fold_id, split)
Example:
CREATE TABLE cv_index AS
SELECT * FROM ts_cv_split_index_by(
'sales_data', store_id, date,
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);
Forecasting on Folds
ts_cv_forecast_by
Generate univariate forecasts for all CV folds. Univariate only — no exogenous support.
ts_cv_forecast_by(
cv_folds VARCHAR,
group_col COLUMN,
date_col COLUMN,
target_col COLUMN,
method VARCHAR,
params MAP
) → TABLE
Input table must have fold_id and split columns (from ts_cv_folds_by or ts_cv_split_by). Passing raw data throws a clear error. Horizon is inferred from test rows — no frequency parameter needed.
Returns:
| Column | Type | Description |
|---|---|---|
fold_id | BIGINT | Fold number |
group_col | (input) | Series identifier |
date_col | (input) | Timestamp |
target_col | DOUBLE | Actual value (from test set) |
split | VARCHAR | Always 'test' |
yhat | DOUBLE | Point forecast |
yhat_lower | DOUBLE | Lower prediction interval |
yhat_upper | DOUBLE | Upper prediction interval |
model_name | VARCHAR | Model used |
Example:
-- Create folds first
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 7, MAP{});
-- Forecast on folds
CREATE TABLE cv_results AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'AutoETS',
MAP{'seasonal_period': '7'});
Hydration
ts_cv_hydrate_by
Add feature columns to CV folds with automatic masking: train rows get actual values, test rows get filled values (preventing data leakage).
ts_cv_hydrate_by(
cv_folds VARCHAR,
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
unknown_features VARCHAR[],
params MAP
) → TABLE
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
cv_folds | VARCHAR | Yes | Table with fold assignments |
source | VARCHAR | Yes | Table containing features |
group_col | COLUMN | Yes | Group column (unquoted) |
date_col | COLUMN | Yes | Date column (unquoted) |
unknown_features | VARCHAR[] | Yes | Feature columns to mask in test sets |
params | MAP | No | Fill strategy configuration |
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
strategy | VARCHAR | 'last_value' | 'last_value', 'null', or 'default' |
fill_value | VARCHAR | '' | Value when strategy='default' |
Features are output as direct columns — no MAP extraction needed.
Example:
SELECT store_id, date, revenue, fold_id, split, competitor_sales, actual_temp
FROM ts_cv_hydrate_by('cv_folds', 'sales', store_id, date,
['competitor_sales', 'actual_temp'], MAP{'strategy': 'last_value'});
Leakage Prevention
ts_fill_unknown_by
Fill unknown future values in test sets to prevent data leakage.
ts_fill_unknown_by(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
value_col COLUMN,
cutoff_date DATE,
params MAP
) → TABLE
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
strategy | VARCHAR | 'last_value' | 'last_value', 'null', or 'default' |
fill_value | DOUBLE | 0.0 | Value when strategy='default' |
Example:
-- Forward fill temperature beyond cutoff
SELECT * FROM ts_fill_unknown_by(
'backtest_data', category, date, temperature,
'2023-06-01'::DATE, MAP{}
);
-- Set unknown to NULL
SELECT * FROM ts_fill_unknown_by(
'backtest_data', category, date, temperature,
'2023-06-01'::DATE, MAP{'strategy': 'null'}
);
ts_mark_unknown_by
Mark rows as known/unknown based on cutoff date without modifying values.
ts_mark_unknown_by(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
cutoff_date DATE
) → TABLE(*, is_unknown BOOLEAN, last_known_date TIMESTAMP)
Returns: All source columns plus:
is_unknown: TRUE for future rowslast_known_date: Per-group last known timestamp
Example:
SELECT
category, date, temperature,
is_unknown, last_known_date,
CASE
WHEN is_unknown THEN AVG(temperature) OVER (PARTITION BY category)
ELSE temperature
END AS temperature_filled
FROM ts_mark_unknown_by(
'backtest_data', category, date,
'2023-06-01'::DATE
)
ORDER BY category, date;
Complete Workflow Example
-- 1. Create folds
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 7, MAP{});
-- 2. Forecast per fold (for each model)
CREATE TABLE cv_naive AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'Naive', MAP{});
CREATE TABLE cv_autoets AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'AutoETS',
MAP{'seasonal_period': '7'});
-- 3. Compare metrics (use scalar functions with GROUP BY)
SELECT 'Naive' AS model,
ts_mae(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)) AS mae,
ts_rmse(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)) AS rmse
FROM cv_naive GROUP BY ALL
UNION ALL
SELECT 'AutoETS',
ts_mae(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)),
ts_rmse(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds))
FROM cv_autoets GROUP BY ALL;