Skip to main content

Cross-Validation & Backtesting

Time series cross-validation requires special handling because data has temporal ordering. These functions help you create proper train/test splits, handle unknown features during backtesting, and prevent data leakage.

FunctionDescriptionCategory
ts_backtest_autoOne-liner backtest - complete CV in a single callCore
ts_cv_generate_foldsAuto-generate fold boundaries based on data rangeCore
ts_cv_split_foldsView fold date ranges (train/test boundaries)Core
ts_cv_splitCreate train/test splits with fold assignmentsCore
ts_cv_split_indexMemory-efficient alternative (index only)Core
ts_cv_forecast_byGenerate forecasts for all CV folds in parallelCore
ts_hydrate_splitJoin CV splits with single masked columnHydration
ts_hydrate_split_fullJoin CV splits with all columnsHydration
ts_hydrate_featuresHydrate CV splits with external featuresHydration
ts_hydrate_split_strictMetadata-only hydration (fail-safe)Hydration
ts_check_leakageAudit CV pipeline for data leakageLeakage Prevention
ts_prepare_regression_inputPrepare data for regression modelsLeakage Prevention
ts_fill_unknownFill unknown future values safelyLeakage Prevention
ts_mark_unknownMark rows as known/unknownLeakage Prevention
Showing 14 of 14

One-Liner Backtest

ts_backtest_auto

Complete backtesting in a single function call. Combines fold generation, data splitting, forecasting, and evaluation.

ts_backtest_auto(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
target_col COLUMN,
horizon BIGINT,
folds BIGINT,
frequency VARCHAR,
params MAP,
features VARCHAR[],
metric VARCHAR
)TABLE

Parameters:

ParameterTypeDescription
sourceVARCHARTable containing time series data
group_colCOLUMNColumn identifying each series (unquoted)
date_colCOLUMNDate/timestamp column (unquoted)
target_colCOLUMNTarget value to forecast (unquoted)
horizonBIGINTNumber of periods to forecast ahead
foldsBIGINTNumber of CV folds
frequencyVARCHARData frequency ('1d', '1h', '1w', '1mo', '1q', '1y')
paramsMAPModel and CV parameters
featuresVARCHAR[]Optional regressor columns (default NULL)
metricVARCHARMetric for fold_metric_score (default 'rmse')

Params MAP Options:

KeyTypeDefaultDescription
methodVARCHAR'AutoETS'Forecasting model
gapBIGINT0Periods between train end and test start
embargoBIGINT0Periods to exclude from training
window_typeVARCHAR'expanding''expanding', 'fixed', or 'sliding'
min_train_sizeBIGINT1Minimum training periods
initial_train_sizeBIGINT50% of dataFirst fold training size
skip_lengthBIGINThorizonPeriods between fold start times
clip_horizonBOOLEANfalseInclude partial test windows

Output Columns:

ColumnTypeDescription
fold_idBIGINTCV fold number
group_colANYSeries identifier
dateTIMESTAMPForecast date
forecastDOUBLEPoint forecast
actualDOUBLEActual value
errorDOUBLEforecast - actual
abs_errorDOUBLE|forecast - actual|
lower_90DOUBLELower 90% prediction interval
upper_90DOUBLEUpper 90% prediction interval
model_nameVARCHARModel used
fold_metric_scoreDOUBLECalculated metric for fold

Example:

-- Basic backtest with 5 folds
SELECT * FROM ts_backtest_auto(
'sales_data', store_id, date, revenue,
7, 5, '1d', MAP{}
);

-- Custom parameters with Theta model
SELECT * FROM ts_backtest_auto(
'sales_data', store_id, date, revenue, 7, 5, '1d',
MAP{'method': 'Theta', 'gap': '2', 'window_type': 'fixed'}
);

-- Aggregate results by model
SELECT
model_name,
AVG(fold_metric_score) AS avg_rmse,
AVG(abs_error) AS mae
FROM ts_backtest_auto(
'sales_data', store_id, date, revenue,
7, 5, '1d', MAP{'method': 'AutoETS'}
)
GROUP BY model_name;

Fold Generation

ts_cv_generate_folds

Automatically generate fold boundaries based on data range.

ts_cv_generate_folds(
source VARCHAR,
date_col VARCHAR,
n_folds BIGINT,
horizon BIGINT,
frequency VARCHAR,
params MAP
)TABLE(training_end_times DATE[])

Parameters:

ParameterTypeDescription
sourceVARCHARTable containing time series data
date_colVARCHARDate/timestamp column name
n_foldsBIGINTNumber of CV folds to generate
horizonBIGINTNumber of periods per test set
frequencyVARCHARData frequency string
paramsMAPOptional parameters

Params MAP Options:

KeyTypeDefaultDescription
initial_train_sizeBIGINT50% of dataFirst fold training size
skip_lengthBIGINThorizonPeriods between fold start times
clip_horizonBOOLEANfalseInclude partial test windows

Example:

-- Generate 3 folds (default: 50% initial training)
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 3, 5, '1d', MAP{});

-- Custom initial training size
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 3, 5, '1d',
MAP{'initial_train_size': '10'});

-- Dense folds (1 period spacing)
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 10, 7, '1d',
MAP{'skip_length': '1'});

ts_cv_split_folds

Generate fold boundaries showing train/test date ranges.

ts_cv_split_folds(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR
)TABLE(fold_id, train_start, train_end, test_start, test_end, horizon)

Example:

SELECT * FROM ts_cv_split_folds(
'sales_data', 'store_id', 'date',
['2024-01-10'::DATE, '2024-01-15'::DATE, '2024-01-20'::DATE],
5, '1d'
);

Creating Splits

ts_cv_split

Split time series data into train/test sets for cross-validation.

ts_cv_split(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
target_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR,
params MAP
)TABLE(group_col, date_col, target_col, fold_id BIGINT, split VARCHAR)

Parameters:

ParameterTypeDescription
sourceVARCHARTable name
group_colVARCHARGrouping column name
date_colVARCHARDate/timestamp column
target_colVARCHARTarget value column
training_end_timesDATE[]Dates marking end of training periods
horizonBIGINTNumber of test periods per fold
frequencyVARCHARData frequency string
paramsMAPOptional parameters

Params MAP Options:

KeyTypeDefaultDescription
window_typeVARCHAR'expanding''expanding', 'fixed', or 'sliding'
min_train_sizeBIGINT1Minimum periods in training set

Example:

-- Expanding window (default)
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);

-- Fixed window (8-day training)
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d',
MAP{'window_type': 'fixed', 'min_train_size': '8'}
);

ts_cv_split_index

Memory-efficient alternative returning only index columns without duplicating data.

ts_cv_split_index(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR,
params MAP
)TABLE(group_col, date_col, fold_id BIGINT, split VARCHAR)

Params MAP Options:

KeyTypeDefaultDescription
window_typeVARCHAR'expanding'Training window type
min_train_sizeBIGINT1Minimum training periods
gapBIGINT0Gap between train and test
embargoBIGINT0Embargo after test set

Example:

CREATE TABLE cv_index AS
SELECT * FROM ts_cv_split_index(
'sales_data', 'store_id', 'date',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);

Forecasting on Splits

ts_cv_forecast_by

Generate point forecasts and prediction intervals for all CV folds.

ts_cv_forecast_by(
cv_splits VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
target_col VARCHAR,
method VARCHAR,
horizon BIGINT,
params MAP,
frequency VARCHAR
)TABLE(fold_id, id, forecast_step, date, point_forecast, lower_90, upper_90, model_name)

Example:

-- Create CV splits
CREATE TABLE cv_data AS
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);

-- Generate forecasts for each fold
SELECT * FROM ts_cv_forecast_by(
'(SELECT * FROM cv_data WHERE split = ''train'')',
'store_id', 'date', 'sales',
'AutoETS', 5, MAP{'seasonal_period': '7'}, '1d'
);

Hydration Functions

Hydration functions safely join CV splits with additional data while preventing data leakage.

ts_hydrate_split

Safely join CV splits with source data, masking a single column in the test set.

ts_hydrate_split(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
unknown_col EXPR,
params MAP
)TABLE(group_col, date_col, fold_id, split, unknown_col)

Params MAP Options:

KeyTypeDefaultDescription
strategyVARCHAR'null''null', 'last_value', or 'default'
fill_valueDOUBLE0.0Value when strategy='default'

Example:

SELECT * FROM ts_hydrate_split(
'cv_index', 'weather_data',
store_id, date, temperature,
MAP{'strategy': 'last_value'}
);

ts_hydrate_split_full

Join CV splits with ALL source columns, adding split metadata.

ts_hydrate_split_full(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
params MAP
)TABLE(fold_id, split, _is_test, _train_cutoff, <all source columns>)

Returns: All source columns plus:

  • fold_id: CV fold identifier
  • split: 'train' or 'test'
  • _is_test: Boolean flag for masking
  • _train_cutoff: Training cutoff timestamp

Example:

SELECT
fold_id, split, store_id, date, sales,
day_of_week,
CASE WHEN _is_test THEN NULL ELSE competitor_sales END AS competitor_sales
FROM ts_hydrate_split_full(
'cv_index', 'features_table', store_id, date, MAP{}
);

ts_hydrate_features

Hydrate CV splits with additional features from source.

ts_hydrate_features(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
params MAP
)TABLE

Example:

SELECT
fold_id, split, group_col, date_col, target_col,
promotion_planned,
CASE WHEN _is_test THEN NULL ELSE weather_actual END AS weather
FROM ts_hydrate_features(
'cv_data', 'features_table', store_id, date, MAP{}
);

ts_hydrate_split_strict

Maximally fail-safe join returning ONLY metadata columns.

ts_hydrate_split_strict(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
params MAP
)TABLE(fold_id, split, group_col, date_col, _is_test, _train_cutoff)

Example:

SELECT
hs.fold_id, hs.split, hs.group_col, hs.date_col,
src.sales,
CASE WHEN hs._is_test THEN NULL ELSE src.competitor_sales END
FROM ts_hydrate_split_strict(
'cv_index', 'sales_data', store_id, date, MAP{}
) hs
JOIN sales_data src
ON hs.group_col = src.store_id
AND hs.date_col = src.date;

Data Leakage Prevention

ts_check_leakage

Audit a CV pipeline result for potential data leakage.

ts_check_leakage(
source VARCHAR,
is_test_col EXPR,
params MAP
)TABLE(status, test_row_count, train_row_count, recommendation)

Example:

SELECT * FROM ts_check_leakage(
'prepared_cv_data', _is_test, MAP{}
);

ts_prepare_regression_input

Prepare data for regression models in CV backtest, masking target column for test rows.

ts_prepare_regression_input(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
target_col EXPR,
params MAP
)TABLE(fold_id, split, group_col, date_col, masked_target, _is_test, <source columns>)

Params MAP Options:

KeyTypeDescription
include_featuresVARCHAR[]Feature columns to include
known_featuresVARCHAR[]Available at forecast time

Example:

CREATE TABLE regression_input AS
SELECT * FROM ts_prepare_regression_input(
'cv_index', 'sales_data',
store_id, date, sales, MAP{}
);

ts_fill_unknown

Fill unknown future values in test sets to prevent data leakage.

ts_fill_unknown(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
value_col VARCHAR,
cutoff_date DATE,
params MAP
)TABLE(group_col, date_col, value_col)

Params MAP Options:

KeyTypeDefaultDescription
strategyVARCHAR'last_value''last_value', 'null', or 'default'
fill_valueDOUBLE0.0Value when strategy='default'

Example:

-- Forward fill temperature
SELECT * FROM ts_fill_unknown(
'backtest_data', 'category', 'date', 'temperature',
'2023-06-01'::DATE, MAP{}
);

-- Set unknown to NULL
SELECT * FROM ts_fill_unknown(
'backtest_data', 'category', 'date', 'temperature',
'2023-06-01'::DATE, MAP{'strategy': 'null'}
);

-- Set unknown to constant
SELECT * FROM ts_fill_unknown(
'backtest_data', 'category', 'date', 'temperature',
'2023-06-01'::DATE,
MAP{'strategy': 'default', 'fill_value': '65.0'}
);

ts_mark_unknown

Mark rows as known/unknown based on cutoff date without modifying values.

ts_mark_unknown(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
cutoff_date DATE
)TABLE(*, is_unknown BOOLEAN, last_known_date TIMESTAMP)

Returns: All source columns plus:

  • is_unknown: TRUE for future rows
  • last_known_date: Per-group last known timestamp

Example:

SELECT
category, date, temperature,
is_unknown, last_known_date,
CASE
WHEN is_unknown THEN AVG(temperature) OVER (PARTITION BY category)
ELSE temperature
END AS temperature_filled
FROM ts_mark_unknown(
'backtest_data', 'category', 'date',
'2023-06-01'::DATE
)
ORDER BY category, date;

Complete Workflow Example

Here's a complete backtesting workflow using the individual functions:

-- 1. Generate fold boundaries
WITH folds AS (
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 3, 7, '1d', MAP{})
),

-- 2. Create train/test splits
cv_splits AS (
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
(SELECT training_end_times FROM folds), 7, '1d', MAP{}
)
),

-- 3. Generate forecasts
forecasts AS (
SELECT * FROM ts_cv_forecast_by(
'(SELECT * FROM cv_splits WHERE split = ''train'')',
'store_id', 'date', 'sales',
'AutoETS', 7, MAP{}, '1d'
)
),

-- 4. Join with actuals
results AS (
SELECT
f.fold_id,
f.store_id,
f.date,
f.point_forecast AS forecast,
c.sales AS actual,
ABS(f.point_forecast - c.sales) AS abs_error
FROM forecasts f
JOIN cv_splits c
ON f.store_id = c.store_id
AND f.date = c.date
AND f.fold_id = c.fold_id
WHERE c.split = 'test'
)

-- 5. Aggregate metrics
SELECT
fold_id,
AVG(abs_error) AS mae,
SQRT(AVG(abs_error * abs_error)) AS rmse
FROM results
GROUP BY fold_id;
🍪 Cookie Settings