Metric correlation 🔀
Result Set
Regardless of the database (since they should all use a similar collation), the result set should look like:
segment | metric_pair | correlation |
---|---|---|
0 | metric_1, metric_4 | 0.8195 |
1 | metric_2, metric_4 | 0.6741 |
2 | metric_2, metric_4 | 0.8539 |
3 | metric_3, metric_5 | 0.9975 |
4 | metric_1, metric_4 | 0.9122 |
5 | metric_3, metric_5 | 0.9985 |
6 | metric_3, metric_5 | 0.9961 |
7 | metric_2, metric_4 | 0.5686 |
8 | metric_2, metric_4 | 0.8405 |
9 | metric_1, metric_5 | 0.9989 |
10 | metric_2, metric_5 | 0.8042 |
Expand for the DDL
solution(segment, metric_pair, correlation) as (
values
( 0, 'metric_1, metric_4', 0.8195),
( 1, 'metric_2, metric_4', 0.6741),
( 2, 'metric_2, metric_4', 0.8539),
( 3, 'metric_3, metric_5', 0.9975),
( 4, 'metric_1, metric_4', 0.9122),
( 5, 'metric_3, metric_5', 0.9985),
( 6, 'metric_3, metric_5', 0.9961),
( 7, 'metric_2, metric_4', 0.5686),
( 8, 'metric_2, metric_4', 0.8405),
( 9, 'metric_1, metric_5', 0.9989),
(10, 'metric_2, metric_5', 0.8042)
)
Solution
Some SQL solutions per database are provided below.
DuckDB
with
unpivoted as (
unpivot metrics
on columns(* exclude (segment, customer_id))
into
name metric_name
value metric_value
),
joined as (
select
l.customer_id,
l.segment,
l.metric_name as l_metric_name,
r.metric_name as r_metric_name,
l.metric_value as l_metric_value,
r.metric_value as r_metric_value
from unpivoted as l
inner join unpivoted as r
on l.customer_id = r.customer_id
and l.segment = r.segment
and l.metric_name < r.metric_name /* No point in pairing metrics twice */
)
select
segment,
l_metric_name || ', ' || r_metric_name as metric_pair,
round(corr(l_metric_value, r_metric_value), 4) as correlation
from joined
group by segment, l_metric_name, r_metric_name
qualify correlation = max(correlation) over (partition by segment)
order by segment, metric_pair