Skip to content

Temperature anomaly detection 🔍

Scenario

Some scientists are studying temperature data from various sites.

They are interested in identifying temperature readings that are significantly higher than the surrounding readings.

Question

Given the temperature data below, select the temperature readings that are at least 10% higher than the average of the previous 2 and following 2 readings for the same site.

Do not include the current reading in the average calculation, and use the calculated average temperature as the denominator for the 10% calculation.

If there are fewer than 2 readings before or 2 after the current reading, do not include the reading in the output.

The output should only have the temperature readings above the threshold, with the columns:

  • site_id
  • reading_datetime
  • temperature
  • average_temperature as the average of the 4 readings around the current reading (2 each side), rounded to 4 decimal places
  • percentage_increase as the percentage increase of the current reading over the average_temperature, rounded to 4 decimal places

Order the output by site_id and reading_datetime.

Expand for the DDL
create table readings (
    site_id          integer,
    reading_datetime timestamp,
    temperature      decimal(5, 2) not null,
    primary key (site_id, reading_datetime)
);
insert into readings
values
    (1, '2021-01-01 03:26:23', 20.02),
    (1, '2021-01-01 19:52:46', 20.17),
    (1, '2021-01-02 02:01:17', 22.43),
    (1, '2021-01-02 21:02:34', 19.91),
    (1, '2021-01-03 04:12:56', 20.11),
    (1, '2021-01-03 20:23:12', 20.22),
    (1, '2021-01-04 05:34:23', 20.04),
    (1, '2021-01-04 21:45:34', 22.69),
    (1, '2021-01-05 06:56:45', 20.50),
    (1, '2021-01-05 22:07:56', 20.32),
    (1, '2021-01-06 07:18:07', 20.17),
    (1, '2021-01-06 23:29:18', 23.58),
    (2, '2021-01-01 04:30:10', 21.52),
    (2, '2021-01-01 17:12:25', 21.48),
    (2, '2021-01-02 01:59:43', 23.10),
    (2, '2021-01-02 20:05:53', 18.19),
    (2, '2021-01-03 03:17:12', 21.23),
    (2, '2021-01-03 19:25:20', 21.27),
    (2, '2021-01-04 04:33:34', 21.51),
    (2, '2021-01-04 20:41:45', 21.49),
    (2, '2021-01-05 05:49:56', 21.52),
    (2, '2021-01-05 21:58:07', 21.48),
    (2, '2021-01-06 07:06:18', 21.50),
    (2, '2021-01-06 23:14:29', 21.52)
;

The solution can be found at:


Sample input
site_id reading_datetime temperature
1 2021-06-01 02:12:31 26.17
1 2021-06-01 21:17:12 26.32
1 2021-06-02 01:19:56 29.58
1 2021-06-02 19:35:32 27.06
1 2021-06-03 03:14:53 26.26
1 2021-06-03 20:47:42 28.37
with readings(site_id, reading_datetime, temperature) as (
    values
        (1, '2021-06-01 02:12:31'::timestamp, 26.17),
        (1, '2021-06-01 21:17:12'::timestamp, 26.32),
        (1, '2021-06-02 01:19:56'::timestamp, 29.58),
        (1, '2021-06-02 19:35:32'::timestamp, 27.06),
        (1, '2021-06-03 03:14:53'::timestamp, 26.26),
        (1, '2021-06-03 20:47:42'::timestamp, 28.37)
)
Sample output
site_id reading_datetime temperature average_temperature percentage_increase
1 2021-06-02 01:19:56 29.58 26.4525 11.8231
solution(site_id, reading_datetime, temperature, average_temperature, percentage_increase) as (
    values
        (1, '2021-06-02 01:19:56'::timestamp, 29.58, 26.4525, 11.8231)
)
Hint 1

Use a window function (or two!) to calculate the average temperature of the surrounding readings.

Hint 2

Use another window function to identify rows with at least 4 surrounding readings (2 before and 2 after).