Temperature anomaly detection 🔍
Scenario
Some scientists are studying temperature data from various sites.
They are interested in identifying temperature readings that are significantly higher than the surrounding readings.
Question
Given the temperature data below, select the temperature readings that are at least 10% higher than the average of the previous 2 and following 2 readings for the same site.
Do not include the current reading in the average calculation, and use the calculated average temperature as the denominator for the 10% calculation.
If there are fewer than 2 readings before or 2 after the current reading, do not include the reading in the output.
The output should only have the temperature readings above the threshold, with the columns:
site_id
reading_datetime
temperature
average_temperature
as the average of the 4 readings around the current reading (2 each side), rounded to 4 decimal placespercentage_increase
as the percentage increase of the current reading over theaverage_temperature
, rounded to 4 decimal places
Order the output by site_id
and reading_datetime
.
Expand for the DDL
create table readings (
site_id integer,
reading_datetime timestamp,
temperature decimal(5, 2) not null,
primary key (site_id, reading_datetime)
);
insert into readings
values
(1, '2021-01-01 03:26:23', 20.02),
(1, '2021-01-01 19:52:46', 20.17),
(1, '2021-01-02 02:01:17', 22.43),
(1, '2021-01-02 21:02:34', 19.91),
(1, '2021-01-03 04:12:56', 20.11),
(1, '2021-01-03 20:23:12', 20.22),
(1, '2021-01-04 05:34:23', 20.04),
(1, '2021-01-04 21:45:34', 22.69),
(1, '2021-01-05 06:56:45', 20.50),
(1, '2021-01-05 22:07:56', 20.32),
(1, '2021-01-06 07:18:07', 20.17),
(1, '2021-01-06 23:29:18', 23.58),
(2, '2021-01-01 04:30:10', 21.52),
(2, '2021-01-01 17:12:25', 21.48),
(2, '2021-01-02 01:59:43', 23.10),
(2, '2021-01-02 20:05:53', 18.19),
(2, '2021-01-03 03:17:12', 21.23),
(2, '2021-01-03 19:25:20', 21.27),
(2, '2021-01-04 04:33:34', 21.51),
(2, '2021-01-04 20:41:45', 21.49),
(2, '2021-01-05 05:49:56', 21.52),
(2, '2021-01-05 21:58:07', 21.48),
(2, '2021-01-06 07:06:18', 21.50),
(2, '2021-01-06 23:14:29', 21.52)
;
The solution can be found at:
Sample input
site_id | reading_datetime | temperature |
---|---|---|
1 | 2021-06-01 02:12:31 | 26.17 |
1 | 2021-06-01 21:17:12 | 26.32 |
1 | 2021-06-02 01:19:56 | 29.58 |
1 | 2021-06-02 19:35:32 | 27.06 |
1 | 2021-06-03 03:14:53 | 26.26 |
1 | 2021-06-03 20:47:42 | 28.37 |
with readings(site_id, reading_datetime, temperature) as (
values
(1, '2021-06-01 02:12:31'::timestamp, 26.17),
(1, '2021-06-01 21:17:12'::timestamp, 26.32),
(1, '2021-06-02 01:19:56'::timestamp, 29.58),
(1, '2021-06-02 19:35:32'::timestamp, 27.06),
(1, '2021-06-03 03:14:53'::timestamp, 26.26),
(1, '2021-06-03 20:47:42'::timestamp, 28.37)
)
Sample output
site_id | reading_datetime | temperature | average_temperature | percentage_increase |
---|---|---|---|---|
1 | 2021-06-02 01:19:56 | 29.58 | 26.4525 | 11.8231 |
solution(site_id, reading_datetime, temperature, average_temperature, percentage_increase) as (
values
(1, '2021-06-02 01:19:56'::timestamp, 29.58, 26.4525, 11.8231)
)
Hint 1
Use a window function (or two!) to calculate the average temperature of the surrounding readings.
Hint 2
Use another window function to identify rows with at least 4 surrounding readings (2 before and 2 after).