Cross-Validation & Backtesting
Time series cross-validation requires special handling because data has temporal ordering. These functions help you create proper train/test splits, handle unknown features during backtesting, and prevent data leakage.
| Function | Description | Category |
|---|---|---|
ts_backtest_auto | One-liner backtest - complete CV in a single call | Core |
ts_cv_generate_folds | Auto-generate fold boundaries based on data range | Core |
ts_cv_split_folds | View fold date ranges (train/test boundaries) | Core |
ts_cv_split | Create train/test splits with fold assignments | Core |
ts_cv_split_index | Memory-efficient alternative (index only) | Core |
ts_cv_forecast_by | Generate forecasts for all CV folds in parallel | Core |
ts_hydrate_split | Join CV splits with single masked column | Hydration |
ts_hydrate_split_full | Join CV splits with all columns | Hydration |
ts_hydrate_features | Hydrate CV splits with external features | Hydration |
ts_hydrate_split_strict | Metadata-only hydration (fail-safe) | Hydration |
ts_check_leakage | Audit CV pipeline for data leakage | Leakage Prevention |
ts_prepare_regression_input | Prepare data for regression models | Leakage Prevention |
ts_fill_unknown | Fill unknown future values safely | Leakage Prevention |
ts_mark_unknown | Mark rows as known/unknown | Leakage Prevention |
One-Liner Backtest
ts_backtest_auto
Complete backtesting in a single function call. Combines fold generation, data splitting, forecasting, and evaluation.
ts_backtest_auto(
source VARCHAR,
group_col COLUMN,
date_col COLUMN,
target_col COLUMN,
horizon BIGINT,
folds BIGINT,
frequency VARCHAR,
params MAP,
features VARCHAR[],
metric VARCHAR
) → TABLE
Parameters:
| Parameter | Type | Description |
|---|---|---|
source | VARCHAR | Table containing time series data |
group_col | COLUMN | Column identifying each series (unquoted) |
date_col | COLUMN | Date/timestamp column (unquoted) |
target_col | COLUMN | Target value to forecast (unquoted) |
horizon | BIGINT | Number of periods to forecast ahead |
folds | BIGINT | Number of CV folds |
frequency | VARCHAR | Data frequency ('1d', '1h', '1w', '1mo', '1q', '1y') |
params | MAP | Model and CV parameters |
features | VARCHAR[] | Optional regressor columns (default NULL) |
metric | VARCHAR | Metric for fold_metric_score (default 'rmse') |
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
method | VARCHAR | 'AutoETS' | Forecasting model |
gap | BIGINT | 0 | Periods between train end and test start |
embargo | BIGINT | 0 | Periods to exclude from training |
window_type | VARCHAR | 'expanding' | 'expanding', 'fixed', or 'sliding' |
min_train_size | BIGINT | 1 | Minimum training periods |
initial_train_size | BIGINT | 50% of data | First fold training size |
skip_length | BIGINT | horizon | Periods between fold start times |
clip_horizon | BOOLEAN | false | Include partial test windows |
Output Columns:
| Column | Type | Description |
|---|---|---|
fold_id | BIGINT | CV fold number |
group_col | ANY | Series identifier |
date | TIMESTAMP | Forecast date |
forecast | DOUBLE | Point forecast |
actual | DOUBLE | Actual value |
error | DOUBLE | forecast - actual |
abs_error | DOUBLE | |forecast - actual| |
lower_90 | DOUBLE | Lower 90% prediction interval |
upper_90 | DOUBLE | Upper 90% prediction interval |
model_name | VARCHAR | Model used |
fold_metric_score | DOUBLE | Calculated metric for fold |
Example:
-- Basic backtest with 5 folds
SELECT * FROM ts_backtest_auto(
'sales_data', store_id, date, revenue,
7, 5, '1d', MAP{}
);
-- Custom parameters with Theta model
SELECT * FROM ts_backtest_auto(
'sales_data', store_id, date, revenue, 7, 5, '1d',
MAP{'method': 'Theta', 'gap': '2', 'window_type': 'fixed'}
);
-- Aggregate results by model
SELECT
model_name,
AVG(fold_metric_score) AS avg_rmse,
AVG(abs_error) AS mae
FROM ts_backtest_auto(
'sales_data', store_id, date, revenue,
7, 5, '1d', MAP{'method': 'AutoETS'}
)
GROUP BY model_name;
Fold Generation
ts_cv_generate_folds
Automatically generate fold boundaries based on data range.
ts_cv_generate_folds(
source VARCHAR,
date_col VARCHAR,
n_folds BIGINT,
horizon BIGINT,
frequency VARCHAR,
params MAP
) → TABLE(training_end_times DATE[])
Parameters:
| Parameter | Type | Description |
|---|---|---|
source | VARCHAR | Table containing time series data |
date_col | VARCHAR | Date/timestamp column name |
n_folds | BIGINT | Number of CV folds to generate |
horizon | BIGINT | Number of periods per test set |
frequency | VARCHAR | Data frequency string |
params | MAP | Optional parameters |
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
initial_train_size | BIGINT | 50% of data | First fold training size |
skip_length | BIGINT | horizon | Periods between fold start times |
clip_horizon | BOOLEAN | false | Include partial test windows |
Example:
-- Generate 3 folds (default: 50% initial training)
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 3, 5, '1d', MAP{});
-- Custom initial training size
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 3, 5, '1d',
MAP{'initial_train_size': '10'});
-- Dense folds (1 period spacing)
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 10, 7, '1d',
MAP{'skip_length': '1'});
ts_cv_split_folds
Generate fold boundaries showing train/test date ranges.
ts_cv_split_folds(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR
) → TABLE(fold_id, train_start, train_end, test_start, test_end, horizon)
Example:
SELECT * FROM ts_cv_split_folds(
'sales_data', 'store_id', 'date',
['2024-01-10'::DATE, '2024-01-15'::DATE, '2024-01-20'::DATE],
5, '1d'
);
Creating Splits
ts_cv_split
Split time series data into train/test sets for cross-validation.
ts_cv_split(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
target_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR,
params MAP
) → TABLE(group_col, date_col, target_col, fold_id BIGINT, split VARCHAR)
Parameters:
| Parameter | Type | Description |
|---|---|---|
source | VARCHAR | Table name |
group_col | VARCHAR | Grouping column name |
date_col | VARCHAR | Date/timestamp column |
target_col | VARCHAR | Target value column |
training_end_times | DATE[] | Dates marking end of training periods |
horizon | BIGINT | Number of test periods per fold |
frequency | VARCHAR | Data frequency string |
params | MAP | Optional parameters |
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
window_type | VARCHAR | 'expanding' | 'expanding', 'fixed', or 'sliding' |
min_train_size | BIGINT | 1 | Minimum periods in training set |
Example:
-- Expanding window (default)
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);
-- Fixed window (8-day training)
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d',
MAP{'window_type': 'fixed', 'min_train_size': '8'}
);
ts_cv_split_index
Memory-efficient alternative returning only index columns without duplicating data.
ts_cv_split_index(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
training_end_times DATE[],
horizon BIGINT,
frequency VARCHAR,
params MAP
) → TABLE(group_col, date_col, fold_id BIGINT, split VARCHAR)
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
window_type | VARCHAR | 'expanding' | Training window type |
min_train_size | BIGINT | 1 | Minimum training periods |
gap | BIGINT | 0 | Gap between train and test |
embargo | BIGINT | 0 | Embargo after test set |
Example:
CREATE TABLE cv_index AS
SELECT * FROM ts_cv_split_index(
'sales_data', 'store_id', 'date',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);
Forecasting on Splits
ts_cv_forecast_by
Generate point forecasts and prediction intervals for all CV folds.
ts_cv_forecast_by(
cv_splits VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
target_col VARCHAR,
method VARCHAR,
horizon BIGINT,
params MAP,
frequency VARCHAR
) → TABLE(fold_id, id, forecast_step, date, point_forecast, lower_90, upper_90, model_name)
Example:
-- Create CV splits
CREATE TABLE cv_data AS
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
['2024-01-10'::DATE, '2024-01-15'::DATE], 5, '1d', MAP{}
);
-- Generate forecasts for each fold
SELECT * FROM ts_cv_forecast_by(
'(SELECT * FROM cv_data WHERE split = ''train'')',
'store_id', 'date', 'sales',
'AutoETS', 5, MAP{'seasonal_period': '7'}, '1d'
);
Hydration Functions
Hydration functions safely join CV splits with additional data while preventing data leakage.
ts_hydrate_split
Safely join CV splits with source data, masking a single column in the test set.
ts_hydrate_split(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
unknown_col EXPR,
params MAP
) → TABLE(group_col, date_col, fold_id, split, unknown_col)
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
strategy | VARCHAR | 'null' | 'null', 'last_value', or 'default' |
fill_value | DOUBLE | 0.0 | Value when strategy='default' |
Example:
SELECT * FROM ts_hydrate_split(
'cv_index', 'weather_data',
store_id, date, temperature,
MAP{'strategy': 'last_value'}
);
ts_hydrate_split_full
Join CV splits with ALL source columns, adding split metadata.
ts_hydrate_split_full(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
params MAP
) → TABLE(fold_id, split, _is_test, _train_cutoff, <all source columns>)
Returns: All source columns plus:
fold_id: CV fold identifiersplit: 'train' or 'test'_is_test: Boolean flag for masking_train_cutoff: Training cutoff timestamp
Example:
SELECT
fold_id, split, store_id, date, sales,
day_of_week,
CASE WHEN _is_test THEN NULL ELSE competitor_sales END AS competitor_sales
FROM ts_hydrate_split_full(
'cv_index', 'features_table', store_id, date, MAP{}
);
ts_hydrate_features
Hydrate CV splits with additional features from source.
ts_hydrate_features(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
params MAP
) → TABLE
Example:
SELECT
fold_id, split, group_col, date_col, target_col,
promotion_planned,
CASE WHEN _is_test THEN NULL ELSE weather_actual END AS weather
FROM ts_hydrate_features(
'cv_data', 'features_table', store_id, date, MAP{}
);
ts_hydrate_split_strict
Maximally fail-safe join returning ONLY metadata columns.
ts_hydrate_split_strict(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
params MAP
) → TABLE(fold_id, split, group_col, date_col, _is_test, _train_cutoff)
Example:
SELECT
hs.fold_id, hs.split, hs.group_col, hs.date_col,
src.sales,
CASE WHEN hs._is_test THEN NULL ELSE src.competitor_sales END
FROM ts_hydrate_split_strict(
'cv_index', 'sales_data', store_id, date, MAP{}
) hs
JOIN sales_data src
ON hs.group_col = src.store_id
AND hs.date_col = src.date;
Data Leakage Prevention
ts_check_leakage
Audit a CV pipeline result for potential data leakage.
ts_check_leakage(
source VARCHAR,
is_test_col EXPR,
params MAP
) → TABLE(status, test_row_count, train_row_count, recommendation)
Example:
SELECT * FROM ts_check_leakage(
'prepared_cv_data', _is_test, MAP{}
);
ts_prepare_regression_input
Prepare data for regression models in CV backtest, masking target column for test rows.
ts_prepare_regression_input(
cv_splits VARCHAR,
source VARCHAR,
src_group_col EXPR,
src_date_col EXPR,
target_col EXPR,
params MAP
) → TABLE(fold_id, split, group_col, date_col, masked_target, _is_test, <source columns>)
Params MAP Options:
| Key | Type | Description |
|---|---|---|
include_features | VARCHAR[] | Feature columns to include |
known_features | VARCHAR[] | Available at forecast time |
Example:
CREATE TABLE regression_input AS
SELECT * FROM ts_prepare_regression_input(
'cv_index', 'sales_data',
store_id, date, sales, MAP{}
);
ts_fill_unknown
Fill unknown future values in test sets to prevent data leakage.
ts_fill_unknown(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
value_col VARCHAR,
cutoff_date DATE,
params MAP
) → TABLE(group_col, date_col, value_col)
Params MAP Options:
| Key | Type | Default | Description |
|---|---|---|---|
strategy | VARCHAR | 'last_value' | 'last_value', 'null', or 'default' |
fill_value | DOUBLE | 0.0 | Value when strategy='default' |
Example:
-- Forward fill temperature
SELECT * FROM ts_fill_unknown(
'backtest_data', 'category', 'date', 'temperature',
'2023-06-01'::DATE, MAP{}
);
-- Set unknown to NULL
SELECT * FROM ts_fill_unknown(
'backtest_data', 'category', 'date', 'temperature',
'2023-06-01'::DATE, MAP{'strategy': 'null'}
);
-- Set unknown to constant
SELECT * FROM ts_fill_unknown(
'backtest_data', 'category', 'date', 'temperature',
'2023-06-01'::DATE,
MAP{'strategy': 'default', 'fill_value': '65.0'}
);
ts_mark_unknown
Mark rows as known/unknown based on cutoff date without modifying values.
ts_mark_unknown(
source VARCHAR,
group_col VARCHAR,
date_col VARCHAR,
cutoff_date DATE
) → TABLE(*, is_unknown BOOLEAN, last_known_date TIMESTAMP)
Returns: All source columns plus:
is_unknown: TRUE for future rowslast_known_date: Per-group last known timestamp
Example:
SELECT
category, date, temperature,
is_unknown, last_known_date,
CASE
WHEN is_unknown THEN AVG(temperature) OVER (PARTITION BY category)
ELSE temperature
END AS temperature_filled
FROM ts_mark_unknown(
'backtest_data', 'category', 'date',
'2023-06-01'::DATE
)
ORDER BY category, date;
Complete Workflow Example
Here's a complete backtesting workflow using the individual functions:
-- 1. Generate fold boundaries
WITH folds AS (
SELECT training_end_times
FROM ts_cv_generate_folds('sales_data', 'date', 3, 7, '1d', MAP{})
),
-- 2. Create train/test splits
cv_splits AS (
SELECT * FROM ts_cv_split(
'sales_data', 'store_id', 'date', 'sales',
(SELECT training_end_times FROM folds), 7, '1d', MAP{}
)
),
-- 3. Generate forecasts
forecasts AS (
SELECT * FROM ts_cv_forecast_by(
'(SELECT * FROM cv_splits WHERE split = ''train'')',
'store_id', 'date', 'sales',
'AutoETS', 7, MAP{}, '1d'
)
),
-- 4. Join with actuals
results AS (
SELECT
f.fold_id,
f.store_id,
f.date,
f.point_forecast AS forecast,
c.sales AS actual,
ABS(f.point_forecast - c.sales) AS abs_error
FROM forecasts f
JOIN cv_splits c
ON f.store_id = c.store_id
AND f.date = c.date
AND f.fold_id = c.fold_id
WHERE c.split = 'test'
)
-- 5. Aggregate metrics
SELECT
fold_id,
AVG(abs_error) AS mae,
SQRT(AVG(abs_error * abs_error)) AS rmse
FROM results
GROUP BY fold_id;