Predicting values 🎱
Result Set
Regardless of the database, the result set should look like:
x | dataset_1 | dataset_2 | dataset_3 | dataset_4 |
---|---|---|---|---|
16 | 11.0 | 11.0 | 11.0 | 11.0 |
17 | 11.5 | 11.5 | 11.5 | 11.5 |
18 | 12.0 | 12.0 | 12.0 | 12.0 |
This is one of the interesting things about Anscombe's quartet (and is the reason Anscombe created it): the four datasets have the same line of best fit, but look very different when plotted!
Expand for the DDL
solution(x, dataset_1, dataset_2, dataset_3, dataset_4) as (
values
(16, 11.0, 11.0, 11.0, 11.0),
(17, 11.5, 11.5, 11.5, 11.5),
(18, 12.0, 12.0, 12.0, 12.0)
)
Solution
Some SQL solutions per database are provided below.
DuckDB
Here's a solution using the regr_slope
and regr_intercept
functions:
with
unpivoted as (
unpivot anscombes_quartet
on
(dataset_1__x, dataset_1__y) as dataset_1,
(dataset_2__x, dataset_2__y) as dataset_2,
(dataset_3__x, dataset_3__y) as dataset_3,
(dataset_4__x, dataset_4__y) as dataset_4,
into
name dataset
value x, y
),
coefficients as (
select
dataset,
regr_slope(y, x) as m,
regr_intercept(y, x) as c,
from unpivoted
group by dataset
),
predictions as (
select
dataset,
x,
round(m * x + c, 1) as y
from coefficients
cross join (values (16), (17), (18)) as v(x)
)
pivot predictions
on dataset
using any_value(y)
order by x
...and one doing this manually:
with
unpivoted as (
unpivot anscombes_quartet
on
(dataset_1__x, dataset_1__y) as dataset_1,
(dataset_2__x, dataset_2__y) as dataset_2,
(dataset_3__x, dataset_3__y) as dataset_3,
(dataset_4__x, dataset_4__y) as dataset_4,
into
name dataset
value x, y
),
coefficients as (
select
dataset,
avg(x) as avg_x,
avg(y) as avg_y,
avg(x * x) as avg_xx,
avg(x * y) as avg_xy,
(avg_x * avg_y - avg_xy) / (avg_x * avg_x - avg_xx) as m,
avg_y - m * avg_x as c,
from unpivoted
group by dataset
),
predictions as (
select
dataset,
x,
round(m * x + c, 1) as y
from coefficients
cross join (values (16), (17), (18)) as v(x)
)
pivot predictions
on dataset
using any_value(y)
order by x