Analyzing Streaming Service Content

In this exercise we will process and explore the content from streaming services such as Netflix and Disney+. We will cover joining data from multiple sources, exploratory data analysis, and no-code visualizations.

Στιγμιότυπο 2022-11-23, 12.45.18 πμ.png

To start off, I use a PostgreSQL integration with specific credentials. I then

merge all into one and add a service column as an identifier. UNION is used to not have duplicates

SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'disney' AS service
FROM disney
UNION 
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'hulu' AS service
FROM hulu

Στιγμιότυπο 2022-11-23, 4.45.53 μμ.png

/* Selects all service data */ 
WITH service_table AS (
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'disney' AS service
FROM disney
UNION 
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'hulu' AS service
FROM hulu
)
/* Join genres to service data */ 

SELECT *
FROM service_table t
LEFT JOIN genres g ON t.title = g.film

Στιγμιότυπο 2022-11-23, 4.46.51 μμ.png

Στιγμιότυπο 2022-11-23, 4.46.32 μμ.png

Add the genre information in our previous query with left join.

I use common table expression because it breaks apart the logic of the query a bit more, which makes it more readable.

We use left join to match data on the service data. If we used inner-right join we would then sacrifice information of service data.

/* Selects all service data */ 
WITH service_table AS (
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'disney' AS service
FROM disney
UNION 
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'hulu' AS service
FROM hulu
),
/* Join genres to service data */ 
all_data AS (
    SELECT *
	FROM service_table t
	LEFT JOIN genres g ON t.title = g.film
			)
/* Handle missing values*/          
SELECT 
SUM(CASE WHEN imdb IS NULL THEN 1 ELSE 0 END) AS imbd_missing,
SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) AS age_missing,
SUM(CASE WHEN rotten_tomatoes IS NULL THEN 1 ELSE 0 END) AS rt_missing
FROM all_data