Diagnostics
Model validation, assumption checking, and residual analysis.
Quick Reference
| Function | Description | SQL Signature |
|---|---|---|
| VIF | Multicollinearity detection | anofox_stats_vif(x) -> LIST(DOUBLE) |
| VIF (agg) | VIF from row-wise data | anofox_stats_vif_agg(x) -> LIST(DOUBLE) |
| Residuals | Comprehensive residual analysis | anofox_stats_residuals_diagnostics_agg(y, y_hat, [x]) -> STRUCT |
| AIC | Akaike Information Criterion | anofox_stats_aic(rss, n, k) -> DOUBLE |
| BIC | Bayesian Information Criterion | anofox_stats_bic(rss, n, k) -> DOUBLE |
VIF (Variance Inflation Factor)
Detect multicollinearity among predictors.
Scalar Version
For matrix input.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
x | LIST(LIST(DOUBLE)) | Yes | - | Predictor matrix |
Example
SELECT anofox_stats_vif(
[[1.0, 2.0], [1.5, 2.5], [2.0, 3.0]]
) as vif_values;
Aggregate Version
For row-wise input.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
x | LIST(DOUBLE) | Yes | - | Predictors per row |
Output
Returns LIST(DOUBLE) - VIF for each predictor.
Interpretation
| VIF Range | Multicollinearity |
|---|---|
| < 5 | Acceptable |
| 5-10 | Moderate |
| > 10 | Severe (consider removing/combining) |
Example
WITH vif_result AS (
SELECT anofox_stats_vif_agg([price, promotion, competitor_price]) as vif
FROM sales_data
)
SELECT
vif[1] as price_vif,
vif[2] as promotion_vif,
vif[3] as competitor_vif,
CASE WHEN max(v) > 10 THEN 'SEVERE'
WHEN max(v) > 5 THEN 'MODERATE'
ELSE 'OK' END as status
FROM vif_result, (SELECT unnest(vif) as v FROM vif_result) t
GROUP BY vif[1], vif[2], vif[3];
Residual Diagnostics
Comprehensive residual analysis including leverage and influence.
Scalar Version
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
y | LIST(DOUBLE) | Yes | - | Actual values |
y_hat | LIST(DOUBLE) | Yes | - | Predicted values |
x | LIST(LIST(DOUBLE)) | No | - | Predictor matrix |
options | MAP | No | - | Configuration options |
Aggregate Version
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
y | DOUBLE | Yes | - | Actual value |
y_hat | DOUBLE | Yes | - | Predicted value |
x | LIST(DOUBLE) | No | - | Predictors |
Output
| Field | Type | Description |
|---|---|---|
residuals | LIST(DOUBLE) | Raw residuals (y - y_hat) |
standardized_residuals | LIST(DOUBLE) | Standardized residuals |
studentized_residuals | LIST(DOUBLE) | Studentized residuals |
leverage | LIST(DOUBLE) | Hat values (diagonal of H matrix) |
cooks_distance | LIST(DOUBLE) | Cook's D (influence measure) |
dffits | LIST(DOUBLE) | DFFITS (influence on fitted values) |
outliers | LIST(BOOLEAN) | Flagged outliers |
Interpretation
| Metric | Threshold | Meaning |
|---|---|---|
| Leverage | > 2(k+1)/n | High leverage point |
| Cook's D | > 4/n | Influential observation |
| |Studentized residual| | > 3 | Potential outlier |
Example
WITH model AS (
SELECT
y,
anofox_stats_predict([x1, x2], coefficients, intercept) as y_hat,
[x1, x2] as x
FROM sales_data, fitted_model
),
diag AS (
SELECT anofox_stats_residuals_diagnostics_agg(y, y_hat, x) as d
FROM model
)
SELECT
row_number() OVER () as obs,
d.residuals[obs] as residual,
d.leverage[obs] as leverage,
d.cooks_distance[obs] as cooks_d,
d.outliers[obs] as is_outlier
FROM diag
WHERE d.cooks_distance[obs] > 4.0 / count(*) OVER ();
Information Criteria
Model selection metrics.
AIC (Akaike Information Criterion)
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
rss | DOUBLE | Yes | - | Residual sum of squares |
n | BIGINT | Yes | - | Number of observations |
k | BIGINT | Yes | - | Number of parameters |
Formula: AIC = n × log(RSS/n) + 2k
Example
SELECT anofox_stats_aic(rss, n, k) as aic;
BIC (Bayesian Information Criterion)
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
rss | DOUBLE | Yes | - | Residual sum of squares |
n | BIGINT | Yes | - | Number of observations |
k | BIGINT | Yes | - | Number of parameters |
Formula: BIC = n × log(RSS/n) + k × log(n)
Example
SELECT anofox_stats_bic(rss, n, k) as bic;
Usage Guidelines
- Lower values = better model
- BIC penalizes complexity more than AIC
- Use AIC for prediction, BIC for explanation
Model Comparison Example
WITH models AS (
SELECT
'Model 1' as name,
anofox_stats_ols_fit_agg(y, [x1]) as m
FROM data
UNION ALL
SELECT
'Model 2',
anofox_stats_ols_fit_agg(y, [x1, x2])
FROM data
UNION ALL
SELECT
'Model 3',
anofox_stats_ols_fit_agg(y, [x1, x2, x3])
FROM data
)
SELECT
name,
m.r_squared,
m.aic,
m.bic,
ROW_NUMBER() OVER (ORDER BY m.bic) as bic_rank
FROM models
ORDER BY m.bic;
Prediction
Generate predictions from fitted coefficients.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
x | LIST(LIST(DOUBLE)) | Yes | - | New predictor values |
coefficients | LIST(DOUBLE) | Yes | - | Fitted coefficients |
intercept | DOUBLE | Yes | - | Intercept term |
Returns: LIST(DOUBLE) - predicted values
Example
WITH fitted AS (
SELECT
(anofox_stats_ols_fit_agg(y, [x1, x2])).coefficients as coef,
(anofox_stats_ols_fit_agg(y, [x1, x2])).intercept as intercept
FROM training_data
)
SELECT
anofox_stats_predict([[10, 5], [15, 8], [20, 10]], coef, intercept) as forecasts
FROM fitted;
Normality Tests
For testing residual normality, see Parametric Tests:
anofox_stats_shapiro_wilk_agg- Shapiro-Wilk testanofox_stats_jarque_bera_agg- Jarque-Bera testanofox_stats_dagostino_k2_agg- D'Agostino K² test
Complete Diagnostic Workflow
Full model validation workflow:
-- 1. Fit model
WITH fitted AS (
SELECT anofox_stats_ols_fit_agg(
revenue,
[marketing_spend, seasonality, competitor_activity],
true, true, 0.95
) as model
FROM sales_data
),
-- 2. Check multicollinearity
vif_check AS (
SELECT anofox_stats_vif_agg([marketing_spend, seasonality, competitor_activity]) as vif
FROM sales_data
),
-- 3. Get predictions
predictions AS (
SELECT
s.revenue as actual,
anofox_stats_predict(
[[s.marketing_spend, s.seasonality, s.competitor_activity]],
f.model.coefficients,
f.model.intercept
)[1] as predicted
FROM sales_data s, fitted f
),
-- 4. Test residual normality
normality AS (
SELECT anofox_stats_jarque_bera_agg(actual - predicted) as jb
FROM predictions
)
SELECT
'R-squared' as metric, round(model.r_squared, 3)::VARCHAR as value FROM fitted
UNION ALL
SELECT 'Max VIF', round(max(v), 2)::VARCHAR FROM vif_check, unnest(vif) as v
UNION ALL
SELECT 'Normality p-value', round(jb.p_value, 4)::VARCHAR FROM normality
UNION ALL
SELECT 'Model Valid',
CASE WHEN model.r_squared > 0.5
AND (SELECT max(v) FROM vif_check, unnest(vif) as v) < 10
AND jb.p_value > 0.05
THEN 'YES' ELSE 'NO' END
FROM fitted, normality;