Predicting values 🎱
Scenario
Some students are studying Anscombe's quartet and have been asked to predict the y
values for a given set of x
values for each of the four datasets using linear regression.
Question
For each of the four datasets in Anscombe's quartet, use linear regression to predict the y
values for x
values 16
, 17
, and 18
.
The output should have a row for each x
value (16
, 17
, 18
), with the columns:
x
dataset_1
as the predicted value for dataset 1, rounded to 1 decimal placedataset_2
as the predicted value for dataset 2, rounded to 1 decimal placedataset_3
as the predicted value for dataset 3, rounded to 1 decimal placedataset_4
as the predicted value for dataset 4, rounded to 1 decimal place
Order the output by x
.
Expand for the DDL
create table anscombes_quartet (
dataset_1__x int,
dataset_1__y decimal(5, 2),
dataset_2__x int,
dataset_2__y decimal(5, 2),
dataset_3__x int,
dataset_3__y decimal(5, 2),
dataset_4__x int,
dataset_4__y decimal(5, 2),
);
insert into anscombes_quartet
values
(10, 8.04, 10, 9.14, 10, 7.46, 8, 6.58),
( 8, 6.95, 8, 8.14, 8, 6.77, 8, 5.76),
(13, 7.58, 13, 8.74, 13, 12.74, 8, 7.71),
( 9, 8.81, 9, 8.77, 9, 7.11, 8, 8.84),
(11, 8.33, 11, 9.26, 11, 7.81, 8, 8.47),
(14, 9.96, 14, 8.10, 14, 8.84, 8, 7.04),
( 6, 7.24, 6, 6.13, 6, 6.08, 8, 5.25),
( 4, 4.26, 4, 3.10, 4, 5.39, 19, 12.50),
(12, 10.84, 12, 9.13, 12, 8.15, 8, 5.56),
( 7, 4.82, 7, 7.26, 7, 6.42, 8, 7.91),
( 5, 5.68, 5, 4.74, 5, 5.73, 8, 6.89)
;
There are plenty of resources online that walk through the maths behind linear regression, such as:
The solution can be found at:
Sample input
Use linear regression to predict the y
values for x
values 6
and 8
, using the following datasets:
dataset_1__x | dataset_1__y | dataset_2__x | dataset_2__y |
---|---|---|---|
1 | 2.00 | 1 | 9.12 |
2 | 4.00 | 3 | 31.18 |
3 | 6.00 | 5 | 55.27 |
4 | 8.00 | 7 | 61.12 |
with datasets(dataset_1__x, dataset_1__y, dataset_2__x, dataset_2__y) as (
values
(1, 2.00, 1, 9.12),
(2, 4.00, 3, 31.18),
(3, 6.00, 5, 55.27),
(4, 8.00, 7, 61.12)
)
Sample output
x | dataset_1 | dataset_2 |
---|---|---|
6 | 12 | 57.2 |
8 | 16 | 75.2 |
solution(x, dataset_1, dataset_2) as (
values
(6, 12, 57.2),
(8, 16, 75.2)
)
Hint 1
Unpivot the datasets so that you have a table with headers dataset
, x
, and y
, then apply the linear regression, and finally pivot the results back.
Hint 2
For databases that support them, use the regr_slope
and regr_intercept
functions (or equivalent) to calculate the slope and intercept of the regression line. Otherwise, you'll need to calculate these manually 😄