In this exercise we will process and explore the content from streaming services such as Netflix and Disney+. We will cover joining data from multiple sources, exploratory data analysis, and no-code visualizations.
To start off, I use a PostgreSQL integration with specific credentials. I then
merge all into one and add a service column as an identifier. UNION is used to not have duplicates
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'disney' AS service
FROM disney
UNION
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'hulu' AS service
FROM hulu
/* Selects all service data */
WITH service_table AS (
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'disney' AS service
FROM disney
UNION
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'hulu' AS service
FROM hulu
)
/* Join genres to service data */
SELECT *
FROM service_table t
LEFT JOIN genres g ON t.title = g.film
Add the genre information in our previous query with left join.
I use common table expression because it breaks apart the logic of the query a bit more, which makes it more readable.
We use left join to match data on the service data. If we used inner-right join we would then sacrifice information of service data.
/* Selects all service data */
WITH service_table AS (
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'disney' AS service
FROM disney
UNION
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'hulu' AS service
FROM hulu
),
/* Join genres to service data */
all_data AS (
SELECT *
FROM service_table t
LEFT JOIN genres g ON t.title = g.film
)
/* Handle missing values*/
SELECT
SUM(CASE WHEN imdb IS NULL THEN 1 ELSE 0 END) AS imbd_missing,
SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) AS age_missing,
SUM(CASE WHEN rotten_tomatoes IS NULL THEN 1 ELSE 0 END) AS rt_missing
FROM all_data