Skip to content

Predicting values 🎱

Scenario

Some students are studying Anscombe's quartet and have been asked to predict the y values for a given set of x values for each of the four datasets using linear regression.

Question

For each of the four datasets in Anscombe's quartet, use linear regression to predict the y values for x values 16, 17, and 18.

The output should have a row for each x value (16, 17, 18), with the columns:

  • x
  • dataset_1 as the predicted value for dataset 1, rounded to 1 decimal place
  • dataset_2 as the predicted value for dataset 2, rounded to 1 decimal place
  • dataset_3 as the predicted value for dataset 3, rounded to 1 decimal place
  • dataset_4 as the predicted value for dataset 4, rounded to 1 decimal place

Order the output by x.

Expand for the DDL
create table anscombes_quartet (
    dataset_1__x int,
    dataset_1__y decimal(5, 2),
    dataset_2__x int,
    dataset_2__y decimal(5, 2),
    dataset_3__x int,
    dataset_3__y decimal(5, 2),
    dataset_4__x int,
    dataset_4__y decimal(5, 2),
);
insert into anscombes_quartet
values
    (10,  8.04, 10, 9.14, 10,  7.46,  8,  6.58),
    ( 8,  6.95,  8, 8.14,  8,  6.77,  8,  5.76),
    (13,  7.58, 13, 8.74, 13, 12.74,  8,  7.71),
    ( 9,  8.81,  9, 8.77,  9,  7.11,  8,  8.84),
    (11,  8.33, 11, 9.26, 11,  7.81,  8,  8.47),
    (14,  9.96, 14, 8.10, 14,  8.84,  8,  7.04),
    ( 6,  7.24,  6, 6.13,  6,  6.08,  8,  5.25),
    ( 4,  4.26,  4, 3.10,  4,  5.39, 19, 12.50),
    (12, 10.84, 12, 9.13, 12,  8.15,  8,  5.56),
    ( 7,  4.82,  7, 7.26,  7,  6.42,  8,  7.91),
    ( 5,  5.68,  5, 4.74,  5,  5.73,  8,  6.89)
;

There are plenty of resources online that walk through the maths behind linear regression, such as:

The solution can be found at:


Sample input

Use linear regression to predict the y values for x values 6 and 8, using the following datasets:

dataset_1__x dataset_1__y dataset_2__x dataset_2__y
1 2.00 1 9.12
2 4.00 3 31.18
3 6.00 5 55.27
4 8.00 7 61.12
with datasets(dataset_1__x, dataset_1__y, dataset_2__x, dataset_2__y) as (
    values
        (1, 2.00, 1,  9.12),
        (2, 4.00, 3, 31.18),
        (3, 6.00, 5, 55.27),
        (4, 8.00, 7, 61.12)
)
Sample output
x dataset_1 dataset_2
6 12 57.2
8 16 75.2
solution(x, dataset_1, dataset_2) as (
    values
        (6, 12, 57.2),
        (8, 16, 75.2)
)
Hint 1

Unpivot the datasets so that you have a table with headers dataset, x, and y, then apply the linear regression, and finally pivot the results back.

Hint 2

For databases that support them, use the regr_slope and regr_intercept functions (or equivalent) to calculate the slope and intercept of the regression line. Otherwise, you'll need to calculate these manually 😄