Combining dimensions 🔗
Scenario
A human resources department has several dimension tables with information about their employees.
For reporting purposes, they need to combine these dimensions into a single dimension table which shows all the information about each employee.
Question
Write a query which combines the three dimension tables below into a single dimension table.
The output should have the columns from the three tables combined:
employee_id
valid_from
valid_until
date_of_birth
gender
ethnicity
job_title
salary
email
phone
The number of rows in the output depends on the number of changes that need to be accounted for, and the valid_from
and valid_until
columns should be recalculated as necessary to account for these changes.
Order the output by employee_id
and valid_from
.
Expand for the DDL
create table dim_employee_demographics (
employee_id int,
valid_from date,
valid_until date,
date_of_birth date,
gender varchar,
ethnicity varchar,
primary key (employee_id, valid_from)
);
insert into dim_employee_demographics
values
(1, '2021-06-13', '2022-06-08', '2004-02-18', 'Female', 'Malaysian'),
(2, '2021-10-19', '9999-12-31', '1963-12-12', 'Female', 'Navajo'),
(3, '2022-01-29', '9999-12-31', '2000-10-17', 'Genderqueer', 'White'),
(4, '2022-04-28', '2024-03-12', '1987-12-13', 'Male', 'Black'),
(1, '2022-06-09', '9999-12-31', '2004-02-18', 'Non-binary', 'Malaysian'),
(5, '2022-08-31', '9999-12-31', '1999-09-10', 'Female', 'Asian'),
(4, '2024-03-13', '9999-12-31', '1987-12-13', 'Female', 'Black')
;
create table dim_employee_career (
employee_id int,
valid_from date,
valid_until date,
job_title varchar,
salary decimal(10, 2),
primary key (employee_id, valid_from)
);
insert into dim_employee_career
values
(1, '2021-06-13', '2023-05-27', 'Teacher', 5000.00),
(1, '2023-05-28', '9999-12-31', 'Teacher', 6000.00),
(2, '2021-10-19', '2023-11-26', 'Data Analyst', 4000.00),
(2, '2023-11-27', '2024-03-01', 'Data Analyst', 6500.00),
(2, '2024-03-02', '9999-12-31', 'Engineering Manager', 7000.00),
(3, '2022-01-29', '2023-04-02', 'Software Engineer', 6000.00),
(3, '2023-04-03', '9999-12-31', 'Software Engineer', 8000.00),
(4, '2022-06-12', '9999-12-31', 'Founder', null)
;
create table dim_employee_contact (
employee_id int,
valid_from date,
valid_until date,
email varchar,
phone varchar,
primary key (employee_id, valid_from)
);
insert into dim_employee_contact
values
(1, '2021-06-13', '2024-01-29', 'c.perot0@gmail.com', null),
(1, '2024-01-30', '9999-12-31', 'c.perot0@gmail.com', '1986474151'),
(2, '2021-10-19', '2024-04-04', null, null),
(2, '2024-04-05', '9999-12-31', 'hpicard1@bing.com', null),
(4, '2022-06-12', '2022-12-01', 'tbayford3@hotmail.co.uk', '01246 209863'),
(4, '2022-12-02', '2023-11-11', 'tbayford3@hotmail.co.uk', '01752 492269'),
(4, '2023-11-12', '9999-12-31', 'tmacalinden@hotmail.co.uk', '01270 530950'),
(5, '2023-02-17', '9999-12-31', null, null)
;
The "dimensions" in this context are Slowly Changing Dimensions of type 2 from the Star Schema modelling framework.
The solution can be found at:
Sample input
Employee Demographics Dimension
employee_id | valid_from | valid_until | date_of_birth | gender | ethnicity |
---|---|---|---|---|---|
1 | 2021-07-12 | 9999-12-31 | 1995-02-24 | Female | Black |
2 | 2023-12-07 | 9999-12-31 | 1999-12-12 | Male | Asian |
Employee Career Dimension
employee_id | valid_from | valid_until | job_title | salary |
---|---|---|---|---|
1 | 2021-07-12 | 2023-02-18 | Student | 0.00 |
1 | 2023-02-19 | 9999-12-31 | Pianist | 2000.00 |
2 | 2023-12-08 | 9999-12-31 | Paramedic | 4000.00 |
Employee Contact Dimension
employee_id | valid_from | valid_until | phone | |
---|---|---|---|---|
1 | 2021-07-12 | 9999-12-31 | abcde@gmail.com | 123-456-789 |
2 | 2023-12-08 | 2023-12-31 | null | 01234 567890 |
2 | 2024-01-01 | 9999-12-31 | something@mail.net | 0300 123 456 |
with
dim_employee_demographics(employee_id, valid_from, valid_until, date_of_birth, gender, ethnicity) as (
values
(1, '2021-07-12'::date, '9999-12-31'::date, '1995-02-24'::date, 'Female', 'White'),
(2, '2023-12-07'::date, '9999-12-31'::date, '1999-12-12'::date, 'Male', 'Asian')
),
dim_employee_career(employee_id, valid_from, valid_until, job_title, salary) as (
values
(1, '2021-07-12'::date, '2023-02-18'::date, 'Student', 0.00),
(1, '2023-02-19'::date, '9999-12-31'::date, 'Pianist', 2000.00),
(2, '2023-12-08'::date, '9999-12-31'::date, 'Paramedic', 4000.00)
),
dim_employee_contact(employee_id, valid_from, valid_until, email, phone) as (
values
(1, '2021-07-12'::date, '9999-12-31'::date, 'abcde@gmail.com', '123-456-789'),
(2, '2023-12-08'::date, '2023-12-31'::date, null, '01234 567890'),
(2, '2024-01-01'::date, '9999-12-31'::date, 'something@mail.net', '0300 123 456')
)
Sample output
employee_id | valid_from | valid_until | date_of_birth | gender | ethnicity | job_title | salary | phone | |
---|---|---|---|---|---|---|---|---|---|
1 | 2021-07-12 | 2023-02-18 | 1995-02-24 | Female | Black | Student | 0.00 | abcde@gmail.com | 123-456-789 |
1 | 2023-02-19 | 9999-12-31 | 1995-02-24 | Female | Black | Pianist | 2000.00 | abcde@gmail.com | 123-456-789 |
2 | 2023-12-07 | 2023-12-07 | 1999-12-12 | Male | Asian | null | null | null | null |
2 | 2023-12-08 | 2023-12-31 | 1999-12-12 | Male | Asian | Paramedic | 4000.00 | null | 01234 567890 |
2 | 2024-01-01 | 9999-12-31 | 1999-12-12 | Male | Asian | Paramedic | 4000.00 | something@mail.net | 0300 123 456 |
solution(employee_id, valid_from, valid_until, date_of_birth, gender, ethnicity, job_title, salary, email, phone) as (
values
(1, '2021-07-12'::date, '2023-02-18'::date, '1995-02-24'::date, 'Female', 'Black', 'Student', 0.00, 'abcde@gmail.com', '123-456-789'),
(1, '2023-02-19'::date, '9999-12-31'::date, '1995-02-24'::date, 'Female', 'Black', 'Pianist', 2000.00, 'abcde@gmail.com', '123-456-789'),
(2, '2023-12-07'::date, '2023-12-07'::date, '1999-12-12'::date, 'Male', 'Asian', null, null, null, null),
(2, '2023-12-08'::date, '2023-12-31'::date, '1999-12-12'::date, 'Male', 'Asian', 'Paramedic', 4000.00, null, '01234 567890'),
(2, '2024-01-01'::date, '9999-12-31'::date, '1999-12-12'::date, 'Male', 'Asian', 'Paramedic', 4000.00, 'something@mail.net', '0300 123 456')
)
Hint 1
The valid_from
values in the output will be each of the distinct valid_from
values from the three tables for the given employee, so start by constructing an "axis" which is the distinct combination of all employee_id
and valid_from
values across the three tables.
Hint 2
For databases that support it, use an ASOF
join to add the latest information from each dimension to the "axis" for each employee_id
and valid_from
combination.
For databases that don't support ASOF
joins, calculate the valid_until
values for the valid_from
values in the "axis" and then join the dimension tables to the axis by where the axis valid_from
-valid_until
range in contained within the dimension valid_from
-valid_until
range.