Skip to content

Predicting values 🎱

Tip

Solution to the following problem:

Result Set

Regardless of the database, the result set should look like:

x dataset_1 dataset_2 dataset_3 dataset_4
16 11.0 11.0 11.0 11.0
17 11.5 11.5 11.5 11.5
18 12.0 12.0 12.0 12.0

This is one of the interesting things about Anscombe's quartet (and is the reason Anscombe created it): the four datasets have the same line of best fit, but look very different when plotted!

Expand for the DDL
solution(x, dataset_1, dataset_2, dataset_3, dataset_4) as (
    values
    (16, 11.0, 11.0, 11.0, 11.0),
    (17, 11.5, 11.5, 11.5, 11.5),
    (18, 12.0, 12.0, 12.0, 12.0)
)

Solution

Some SQL solutions per database are provided below.

DuckDB

Here's a solution using the regr_slope and regr_intercept functions:

with

unpivoted as (
    unpivot anscombes_quartet
    on
        (dataset_1__x, dataset_1__y) as dataset_1,
        (dataset_2__x, dataset_2__y) as dataset_2,
        (dataset_3__x, dataset_3__y) as dataset_3,
        (dataset_4__x, dataset_4__y) as dataset_4,
    into
        name dataset
        value x, y
),

coefficients as (
    select
        dataset,
        regr_slope(y, x) as m,
        regr_intercept(y, x) as c,
    from unpivoted
    group by dataset
),

predictions as (
    select
        dataset,
        x,
        round(m * x + c, 1) as y
    from coefficients
        cross join (values (16), (17), (18)) as v(x)
)

pivot predictions
on dataset
using any_value(y)
order by x

...and one doing this manually:

with

unpivoted as (
    unpivot anscombes_quartet
    on
        (dataset_1__x, dataset_1__y) as dataset_1,
        (dataset_2__x, dataset_2__y) as dataset_2,
        (dataset_3__x, dataset_3__y) as dataset_3,
        (dataset_4__x, dataset_4__y) as dataset_4,
    into
        name dataset
        value x, y
),

coefficients as (
    select
        dataset,
        avg(x) as avg_x,
        avg(y) as avg_y,
        avg(x * x) as avg_xx,
        avg(x * y) as avg_xy,
        (avg_x * avg_y - avg_xy) / (avg_x * avg_x - avg_xx) as m,
        avg_y - m * avg_x as c,
    from unpivoted
    group by dataset
),

predictions as (
    select
        dataset,
        x,
        round(m * x + c, 1) as y
    from coefficients
        cross join (values (16), (17), (18)) as v(x)
)

pivot predictions
on dataset
using any_value(y)
order by x