Skip to main content

Cross-Validation & Backtesting

Time series cross-validation requires special handling because data has temporal ordering. These functions help you create proper train/test splits, forecast on folds, and prevent data leakage.

FunctionDescriptionCategory
ts_cv_folds_byCreate CV folds by number of folds (primary)Fold Creation
ts_cv_split_byCreate splits with explicit cutoff datesFold Creation
ts_cv_split_folds_byView fold date ranges (train/test boundaries)Fold Creation
ts_cv_split_index_byMemory-efficient split (index columns only)Fold Creation
ts_cv_forecast_byGenerate forecasts for all CV foldsForecasting
ts_cv_hydrate_byAdd features to folds with automatic maskingHydration
ts_fill_unknown_byFill unknown future values safelyLeakage Prevention
ts_mark_unknown_byMark rows as known/unknownLeakage Prevention
Showing 8 of 8

Fold Creation

ts_cv_folds_by

Create train/test splits for backtesting. This is the primary fold creation function.

ts_cv_folds_by(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
target_col COLUMN,
n_folds BIGINT,
horizon BIGINT,
params MAP
)TABLE

Source must be a table name string, NOT a CTE. Data must be pre-cleaned (no gaps).

Parameters:

ParameterTypeRequiredDescription
sourceVARCHARYesTable name (quoted string)
group_colCOLUMNYesSeries identifier (unquoted)
date_colCOLUMNYesDate/timestamp column (unquoted)
target_colCOLUMNYesTarget value column (unquoted)
n_foldsBIGINTYesNumber of CV folds
horizonBIGINTYesNumber of test periods per fold
paramsMAPNoConfiguration (use MAP{} for defaults)

Params MAP Options:

KeyTypeDefaultDescription
gapBIGINT0Periods between train end and test start
embargoBIGINT0Periods excluded from training after previous test
window_typeVARCHAR'expanding''expanding', 'fixed', or 'sliding'
min_train_sizeBIGINT1Min training size (fixed/sliding only)
initial_train_sizeBIGINTautoPeriods before first fold
skip_lengthBIGINThorizonPeriods between folds (1=dense)
clip_horizonBOOLEANfalseAllow partial test windows

Returns: Exactly 5 columns:

ColumnTypeDescription
group_col(input)Series identifier
date_col(input)Timestamp
target_colDOUBLETarget value
fold_idBIGINTFold number
splitVARCHAR'train' or 'test'
caution

Output has exactly 5 columns. Features are NOT passed through — use ts_cv_hydrate_by to add features to folds.

Example:

-- Create 3 folds with 6-day horizon
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 6, MAP{});

-- Fixed window with gap
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 5, 14,
MAP{'window_type': 'fixed', 'gap': '2'});

ts_cv_split_by

Split using explicit cutoff dates instead of automatic fold generation.

ts_cv_split_by(
source VARCHAR,
group_col IDENTIFIER,
date_col IDENTIFIER,
target_col IDENTIFIER,
training_end_times DATE[],
horizon BIGINT,
params MAP
)TABLE

Returns: group_col, date_col, target_col, fold_id, split

Example:

-- Custom quarterly cutoffs
CREATE TABLE cv_splits AS
SELECT * FROM ts_cv_split_by('sales', store_id, date, revenue,
['2024-03-31'::DATE, '2024-06-30'::DATE, '2024-09-30'::DATE], 30, MAP{});

ts_cv_split_folds_by

View fold date ranges before running a full split. Useful for verifying fold boundaries.

ts_cv_split_folds_by(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR
)TABLE(fold_id, train_start, train_end, test_start, test_end, horizon)

Example:

SELECT * FROM ts_cv_split_folds_by(
'sales_data', 'store_id', 'date',
['2024-01-10'::DATE, '2024-01-15'::DATE, '2024-01-20'::DATE],
5, '1d'
);

ts_cv_split_index_by

Memory-efficient alternative returning only index columns without duplicating data.

ts_cv_split_index_by(
source VARCHAR,
group_col IDENTIFIER,
date_col IDENTIFIER,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR,
params MAP
)TABLE(group_col, date_col, fold_id, split)

Example:

CREATE TABLE cv_index AS
SELECT * FROM ts_cv_split_index_by(
'sales_data', store_id, date,
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);

Forecasting on Folds

ts_cv_forecast_by

Generate univariate forecasts for all CV folds. Univariate only — no exogenous support.

ts_cv_forecast_by(
cv_folds VARCHAR,
group_col COLUMN,
date_col COLUMN,
target_col COLUMN,
method VARCHAR,
params MAP
)TABLE
caution

Input table must have fold_id and split columns (from ts_cv_folds_by or ts_cv_split_by). Passing raw data throws a clear error. Horizon is inferred from test rows — no frequency parameter needed.

Returns:

ColumnTypeDescription
fold_idBIGINTFold number
group_col(input)Series identifier
date_col(input)Timestamp
target_colDOUBLEActual value (from test set)
splitVARCHARAlways 'test'
yhatDOUBLEPoint forecast
yhat_lowerDOUBLELower prediction interval
yhat_upperDOUBLEUpper prediction interval
model_nameVARCHARModel used

Example:

-- Create folds first
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 7, MAP{});

-- Forecast on folds
CREATE TABLE cv_results AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'AutoETS',
MAP{'seasonal_period': '7'});

Hydration

ts_cv_hydrate_by

Add feature columns to CV folds with automatic masking: train rows get actual values, test rows get filled values (preventing data leakage).

ts_cv_hydrate_by(
cv_folds VARCHAR,
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
unknown_features VARCHAR[],
params MAP
)TABLE

Parameters:

ParameterTypeRequiredDescription
cv_foldsVARCHARYesTable with fold assignments
sourceVARCHARYesTable containing features
group_colCOLUMNYesGroup column (unquoted)
date_colCOLUMNYesDate column (unquoted)
unknown_featuresVARCHAR[]YesFeature columns to mask in test sets
paramsMAPNoFill strategy configuration

Params MAP Options:

KeyTypeDefaultDescription
strategyVARCHAR'last_value''last_value', 'null', or 'default'
fill_valueVARCHAR''Value when strategy='default'

Features are output as direct columns — no MAP extraction needed.

Example:

SELECT store_id, date, revenue, fold_id, split, competitor_sales, actual_temp
FROM ts_cv_hydrate_by('cv_folds', 'sales', store_id, date,
['competitor_sales', 'actual_temp'], MAP{'strategy': 'last_value'});

Leakage Prevention

ts_fill_unknown_by

Fill unknown future values in test sets to prevent data leakage.

ts_fill_unknown_by(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
value_col COLUMN,
cutoff_date DATE,
params MAP
)TABLE

Params MAP Options:

KeyTypeDefaultDescription
strategyVARCHAR'last_value''last_value', 'null', or 'default'
fill_valueDOUBLE0.0Value when strategy='default'

Example:

-- Forward fill temperature beyond cutoff
SELECT * FROM ts_fill_unknown_by(
'backtest_data', category, date, temperature,
'2023-06-01'::DATE, MAP{}
);

-- Set unknown to NULL
SELECT * FROM ts_fill_unknown_by(
'backtest_data', category, date, temperature,
'2023-06-01'::DATE, MAP{'strategy': 'null'}
);

ts_mark_unknown_by

Mark rows as known/unknown based on cutoff date without modifying values.

ts_mark_unknown_by(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
cutoff_date DATE
)TABLE(*, is_unknown BOOLEAN, last_known_date TIMESTAMP)

Returns: All source columns plus:

  • is_unknown: TRUE for future rows
  • last_known_date: Per-group last known timestamp

Example:

SELECT
category, date, temperature,
is_unknown, last_known_date,
CASE
WHEN is_unknown THEN AVG(temperature) OVER (PARTITION BY category)
ELSE temperature
END AS temperature_filled
FROM ts_mark_unknown_by(
'backtest_data', category, date,
'2023-06-01'::DATE
)
ORDER BY category, date;

Complete Workflow Example

-- 1. Create folds
CREATE TABLE cv_folds AS
SELECT * FROM ts_cv_folds_by('sales', store_id, date, revenue, 3, 7, MAP{});

-- 2. Forecast per fold (for each model)
CREATE TABLE cv_naive AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'Naive', MAP{});

CREATE TABLE cv_autoets AS
SELECT * FROM ts_cv_forecast_by('cv_folds', store_id, date, revenue, 'AutoETS',
MAP{'seasonal_period': '7'});

-- 3. Compare metrics (use scalar functions with GROUP BY)
SELECT 'Naive' AS model,
ts_mae(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)) AS mae,
ts_rmse(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)) AS rmse
FROM cv_naive GROUP BY ALL
UNION ALL
SELECT 'AutoETS',
ts_mae(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds)),
ts_rmse(LIST(y ORDER BY ds), LIST(yhat ORDER BY ds))
FROM cv_autoets GROUP BY ALL;
🍪 Cookie Settings