Predicting values 🎱

Tip

Solution to the following problem:

predicting-values.md

Result Set

Regardless of the database, the result set should look like:

x	dataset_1	dataset_2	dataset_3	dataset_4
16	11.0	11.0	11.0	11.0
17	11.5	11.5	11.5	11.5
18	12.0	12.0	12.0	12.0

This is one of the interesting things about Anscombe's quartet (and is the reason Anscombe created it): the four datasets have the same line of best fit, but look very different when plotted!

Expand for the DDL

solution(x, dataset_1, dataset_2, dataset_3, dataset_4) as (
    values
        (16, 11.0, 11.0, 11.0, 11.0),
        (17, 11.5, 11.5, 11.5, 11.5),
        (18, 12.0, 12.0, 12.0, 12.0)
)

Solution

Some SQL solutions per database are provided below.

DuckDB (using regr_slope and regr_intercept)

with

unpivoted as (
    unpivot anscombes_quartet
    on
        (dataset_1__x, dataset_1__y) as dataset_1,
        (dataset_2__x, dataset_2__y) as dataset_2,
        (dataset_3__x, dataset_3__y) as dataset_3,
        (dataset_4__x, dataset_4__y) as dataset_4,
    into
        name dataset
        value x, y
),

coefficients as (
    select
        dataset,
        regr_slope(y, x) as m,
        regr_intercept(y, x) as c,
    from unpivoted
    group by dataset
),

predictions as (
    select
        dataset,
        x,
        round(m * x + c, 1) as y
    from coefficients
        cross join (values (16), (17), (18)) as v(x)
)

pivot predictions
on dataset
using any_value(y)
order by x

DuckDB (manual calculations)

with

unpivoted as (
    unpivot anscombes_quartet
    on
        (dataset_1__x, dataset_1__y) as dataset_1,
        (dataset_2__x, dataset_2__y) as dataset_2,
        (dataset_3__x, dataset_3__y) as dataset_3,
        (dataset_4__x, dataset_4__y) as dataset_4,
    into
        name dataset
        value x, y
),

coefficients as (
    select
        dataset,
        avg(x) as avg_x,
        avg(y) as avg_y,
        avg(x * x) as avg_xx,
        avg(x * y) as avg_xy,
        (avg_x * avg_y - avg_xy) / (avg_x * avg_x - avg_xx) as m,
        avg_y - m * avg_x as c,
    from unpivoted
    group by dataset
),

predictions as (
    select
        dataset,
        x,
        round(m * x + c, 1) as y
    from coefficients
        cross join (values (16), (17), (18)) as v(x)
)

pivot predictions
on dataset
using any_value(y)
order by x