IMDb 10,000 Netflix Movies and Shows¶
End-to-End data analysis project by Mayank Sharma
Table of contents¶
- Analysis Overview
- Introduction
- Required libraries
- The problem domain
- Step 1: Answering the question
- Step 2: Checking the data
- Step 3: Tidying the data
- String Formatting Error
- Inconsistent Year Formats
- One-Line Columns
- Restructuring Genre Column
- Missing Values
- Exploratory analysis
- Conclusions
Overview¶
import warnings
warnings.filterwarnings("ignore") # For sns styling-related warnings
top_10_views()
bottom_10_views()
top_10_rating()
bottom_10_rating()
series_length()
Introduction¶
In the time it takes to scroll through Netflix and choose a show, thousands of new ratings, reviews, and streaming interactions are generated across the globe. Platforms like Netflix and IMDb continuously accumulate vast amounts of data about movies and TV shows — from audience ratings and genres to release years, runtimes, and popularity trends. Hidden within this data are powerful insights about viewer preferences, industry patterns, and the evolution of entertainment over time.
As the volume of entertainment data grows, so does the importance of Data Science in transforming raw information into meaningful stories. Drawing from disciplines such as statistics, programming, and domain knowledge in media analytics, Data Science allows us to clean, structure, analyze, and interpret complex datasets in a way that supports informed decision-making.
In this project, I will analyze and clean an extracted dataset of IMDb’s Netflix Top 10,000 movies and TV shows. The goal is to walk through a complete data analysis workflow — from data cleaning and preprocessing to exploratory analysis and insight generation — demonstrating how raw entertainment data can be transformed into actionable understanding.
This notebook is made as a structured, end-to-end analysis learning project. If you identify areas for improvement, alternative interpretations, or additional insights worth exploring, comments and PR's are welcome.
Required Libraries¶
The primary libraries used are:
- NumPy: Provides a fast numerical array structure and helper functions.
- pandas: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
- scikit-learn: The essential Machine Learning package in Python.
- matplotlib: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
- Seaborn: Advanced statistical plotting library.
import kagglehub
path = kagglehub.dataset_download("bharatnatrayn/movies-dataset-for-feature-extracion-prediction")
print("Path to dataset:", path)
Using Colab cache for faster access to the 'movies-dataset-for-feature-extracion-prediction' dataset. Path to dataset: /kaggle/input/movies-dataset-for-feature-extracion-prediction
The Problem Domain¶
Our task is exploring and extracting insights from a large dataset of the IMDb Netflix Top 10,000 movies and TV shows. The dataset contains information such as title, year, genre, rating, votes, runtime, gross earnings, cast, and brief descriptions.
We are working with data sourced from IMDb and content available on Netflix. However, the raw dataset is not analysis-ready — it contains missing values, inconsistent formatting (newline characters in genres and descriptions), mixed data types (years with variants like “2010–2022”, "2010-", "2010"), and null values in several fields. Therefore, a major component of this project involves data cleaning and preprocessing before performing meaningful analysis.
Project objective is not to build a predictive model, but rather to:
- Clean and standardize the dataset
- Determine useful features
- Explore trends in ratings, genres, and release years
- Analyze relationships between ratings, votes, runtime, and gross revenue
Step 1: Shaping Goal¶
The first step is to clearly define the problem and determine what the target is.
The type of data analytic question?
This project is primarily:
Exploratory – Understanding distributions of ratings, genres, runtimes, and release years.
Descriptive – Summarizing trends in Netflix’s most popular or highest-rated content.
The metric for success?
Since this is an exploratory data analysis (EDA) project, success will be measured by:
- Successfully cleaning and structuring messy raw data
- Reducing missing or inconsistent values
- Producing clear summary statistics and visualizations
- Identifying meaningful, data-supported insights
The context for the question and its business application?
Streaming platforms like Netflix rely heavily on data-driven decision-making. Insights derived from IMDb rating trends, vote counts, genre distributions, and runtime preferences can help inform:
- Content acquisition strategies
- Investment in productions
- Marketing decisions
- Understanding audience engagement patterns
Step 2: Checking the data¶
Let's take a look at the data we're working with.
Questions to answer:
- Is there anything wrong with the data?
- Are there any quirks with the data?
- Do we need to fix or remove any of the data?
Loading dataset into a pandas dataframe:
import pandas as pd
movies = pd.read_csv(f"{path}/movies.csv")
movies.head()
| MOVIES | YEAR | GENRE | RATING | ONE-LINE | STARS | VOTES | RunTime | Gross | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Blood Red Sky | (2021) | \nAction, Horror, Thriller | 6.1 | \nA woman with a mysterious illness is forced ... | \n Director:\nPeter Thorwarth\n| \n Star... | 21,062 | 121.0 | NaN |
| 1 | Masters of the Universe: Revelation | (2021– ) | \nAnimation, Action, Adventure | 5.0 | \nThe war for Eternia begins again in what may... | \n \n Stars:\nChris Wood, \nSara... | 17,870 | 25.0 | NaN |
| 2 | The Walking Dead | (2010–2022) | \nDrama, Horror, Thriller | 8.2 | \nSheriff Deputy Rick Grimes wakes up from a c... | \n \n Stars:\nAndrew Lincoln, \n... | 885,805 | 44.0 | NaN |
| 3 | Rick and Morty | (2013– ) | \nAnimation, Adventure, Comedy | 9.2 | \nAn animated series that follows the exploits... | \n \n Stars:\nJustin Roiland, \n... | 414,849 | 23.0 | NaN |
| 4 | Army of Thieves | (2021) | \nAction, Crime, Horror | NaN | \nA prequel, set before the events of Army of ... | \n Director:\nMatthias Schweighöfer\n| \n ... | NaN | NaN | NaN |
The dataset is in a tabular format, and the first row defines the column headers:
MOVIES, YEAR, GENRE, RATING, ONE-LINE, STARS, VOTES, RunTime, Gross
which are descriptive enough to understand what each feature represents without needing documentation. Each row corresponds to a single title, either a movie or a TV show
However, dataset is not perfectly curated and requires careful inspection before analysis.
Next, it's always a good idea to look at the distribution of our data — especially the outliers.
Let's print out some summary statistics about the data set.
movies.describe()
| RATING | RunTime | |
|---|---|---|
| count | 8179.000000 | 7041.000000 |
| mean | 6.921176 | 68.688539 |
| std | 1.220232 | 47.258056 |
| min | 1.100000 | 1.000000 |
| 25% | 6.200000 | 36.000000 |
| 50% | 7.100000 | 60.000000 |
| 75% | 7.800000 | 95.000000 |
| max | 9.900000 | 853.000000 |
We can see several useful values from this table. For example, we see that several entries are missing in both RATING and RunTime
Other than that, table like this is rarely useful unless we know that data should fall in a particular range.
It's better to visualize the data in some way. Visualization makes outliers and errors immediately stand out, whereas they might go unnoticed in a large table of numbers.
Step 3: Tidying the data¶
Now that we've identified several errors in the data set, we need to fix them before we proceed with the analysis.
Let's walk through the issues one-by-one.
1. Handling Duplicates¶
We must start by chceking for duplicate rows in dataset, removing them early on, saving compute and trouble later.
movies.duplicated().sum()
np.int64(431)
There are 431 duplicate rows in dataset, due to errornous scrapping. We can use DataFrame duplicates() to fix this.
movies = movies.drop_duplicates()
2. String Formatting Error¶
Fields like 'GENRE', 'STARS', 'ONE-LINE' have unexpected '\n' characters, possibly due to issues in scrapping data.
We can use DataFrame to fix this.
movies['GENRE'][0]
'\nAction, Horror, Thriller '
movies['STARS'][0]
'\n Director:\nPeter Thorwarth\n| \n Stars:\nPeri Baumeister, \nCarl Anton Koch, \nAlexander Scheer, \nKais Setti\n'
movies['ONE-LINE'][0]
'\nA woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.'
Using python string functions replace() to remove newline and strip() to remove spaces in front and back
movies_format = movies.copy()
cols = ['GENRE','STARS','ONE-LINE']
for col in cols:
movies_format[col] = movies_format[col].str.replace('\n','').str.strip()
print(movies_format['GENRE'][0])
print(movies_format['STARS'][0])
print(movies_format['ONE-LINE'][0])
Action, Horror, Thriller Director:Peter Thorwarth| Stars:Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.
3. Inconsistent Year Formats¶
The YEAR column contains multiple formats: (2001) → A movie (2001–2007) → A completed TV series (2021– ) or (2021–) → An ongoing TV series
Before analysis, we must:
Remove parentheses
Strip extra spaces
Standardize dash formatting
Engineer More Useful Features
movies['YEAR'][0:3]
| YEAR | |
|---|---|
| 0 | (2021) |
| 1 | (2021– ) |
| 2 | (2010–2022) |
Instead of keeping YEAR as complex string, we create structured features:
Start_Year
Extract the first year (e.g. 2001 from '2001-2006').
End_Year
If format is 2001–2007 → End_Year = 2007
If format is 2021– → End_Year = 2021
If format is 2001 → End_Year = 2001
Is_Movie
True if format is a single year (e.g., 2001)
False if a range exists
Is_Ongoing
True if format is 2021– (no end year)
False otherwise
This transformation converts a messy string into structured, analysis-ready features that enable:
- Comparing movies vs series
- Analyzing trends in ongoing content
- Studying longevity of shows
- Examining production patterns over time
Let's start by removing '(' & ')' parenthesis and whitespaces, using replace and strip
movies_year = movies_format.copy()
movies_year['YEAR'] = movies_year['YEAR'].str.replace('(', '').str.replace(')', '').str.strip()
movies_year['YEAR'][0:3]
| YEAR | |
|---|---|
| 0 | 2021 |
| 1 | 2021– |
| 2 | 2010–2022 |
Creating new features: Start_Year and End_Year
movies_year[['Start_Year', 'End_Year']] = (movies_year['YEAR'].str.split('–', expand=True))
movies_year['Start_Year'] = pd.to_numeric(movies_year['Start_Year'], errors='coerce')
movies_year['End_Year'] = pd.to_numeric(movies_year['End_Year'], errors='coerce')
Creating new Features Is_Series and Is_Ongoing
movies_year['Is_Series'] = movies_year['YEAR'].str.contains('–', na=False)
movies_year['Is_Ongoing'] = (movies_year['Is_Series'] & movies_year['End_Year'].isna())
Making End_year = Start_year for Movies, and On-going series to minimize NA/NULL values. We also drop String Year since useful features have already been extracted.
movies_year.loc[movies_year["Is_Series"].eq(False), "End_Year"] = movies_year["Start_Year"]
movies_year.loc[movies_year["Is_Series"] & movies_year["Is_Ongoing"], "End_Year"] = movies_year["Start_Year"]
movies_year.drop(columns=['YEAR'], inplace=True)
4. ONE-LINE & Votes Column¶
The ONE-LINE column contains short textual descriptions of each title. Can be useful for NLP-based sentiment or keyword analysis, it does not fit well into this project.
Decision: We will drop the ONE-LINE column to simplify the dataset and reduce unnecessary dimensionality.
VOTES column on the other hand contains commas , and is store as string instead of desired numerical format for working with it.
movies_year = movies_year.drop(columns=['ONE-LINE'])
movies_year['VOTES'] = movies_year['VOTES'].str.replace(',', '', regex=False)
movies_year['VOTES'] = pd.to_numeric(movies_year['VOTES'], errors='coerce')
5. Unstructured STARS Column¶
The STARS column contains mixed-format strings such as:
"Director: Augustine Frizzell | Stars: Shailene Woodley, Joe Alwyn, Wendy Nottingham, Felicity Jones"
Or,
"Stars: Chase Stokes, Madelyn Cline, Madison Bailey, Jonathan Daviss"
Some issues in above format are:
- Director and actors are mentioned in single string.
- The format is inconsistent, sometimes including Director.
For meaningful analysis, this structure is not ideal. Therefore, we must decompose this column into structured features.
Extracting Director Column¶
If the string contains "Director:", we will extract the director’s name else insert NA. This gives us a clean Director column suitable for:
- Grouping by director
- Counting titles per director
- Average rating per director
movies_star = movies_year.copy()
movies_star['Director'] = movies_star['STARS'].str.extract(r'Director:\s*([^|]+)')
movies_star['Director'] = movies_star['Director'].str.strip()
movies_star.loc[movies_star['Director'].isna(), 'Director'] = 'NA'
Extracting Stars as Column¶
From the "Stars:" portion of the string, we can extract list of Stars in the movie or show, creating a structured Stars column containing a list/array of actor names.
This transformation allows:
- Grouping by Actor
- Studying most frequent actors
- Examining actor-rating relationships
# Removing director
movies_star['Stars_Clean'] = movies_star['STARS'].str.replace(r'Director:.*?\|', '', regex=True)
# Removing 'Stars:'
movies_star['Stars_Clean'] = movies_star['Stars_Clean'].str.replace('Stars:', '', regex=False).str.strip()
movies_star['Stars_List'] = movies_star['Stars_Clean'].str.split(',')
# Stripping whitespace
movies_star['Stars_List'] = movies_star['Stars_List'].apply(
lambda x: [actor.strip() for actor in x] if isinstance(x, list) else x
)
movies_star.drop(columns=['STARS','Stars_Clean'], inplace=True)
Creating Star Count Feature¶
Once we have a clean list of actors, we compute star count per label
This engineered feature enables analysis such as:
- Do titles with larger casts receive higher ratings?
- Are movies associated with different cast sizes than series?
- Is there any relationship between star count and votes?
movies_star['Star_Count'] = movies_star['Stars_List'].apply(
lambda x: len(x) if isinstance(x, list) else 0
)
movies_star.head()
| MOVIES | GENRE | RATING | VOTES | RunTime | Gross | Start_Year | End_Year | Is_Series | Is_Ongoing | Director | Stars_List | Star_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Blood Red Sky | Action, Horror, Thriller | 6.1 | 21062.0 | 121.0 | NaN | 2021.0 | 2021.0 | False | False | Peter Thorwarth | [Peri Baumeister, Carl Anton Koch, Alexander S... | 4 |
| 1 | Masters of the Universe: Revelation | Animation, Action, Adventure | 5.0 | 17870.0 | 25.0 | NaN | 2021.0 | 2021.0 | True | True | NA | [Chris Wood, Sarah Michelle Gellar, Lena Heade... | 4 |
| 2 | The Walking Dead | Drama, Horror, Thriller | 8.2 | 885805.0 | 44.0 | NaN | 2010.0 | 2022.0 | True | False | NA | [Andrew Lincoln, Norman Reedus, Melissa McBrid... | 4 |
| 3 | Rick and Morty | Animation, Adventure, Comedy | 9.2 | 414849.0 | 23.0 | NaN | 2013.0 | 2013.0 | True | True | NA | [Justin Roiland, Chris Parnell, Spencer Gramme... | 4 |
| 4 | Army of Thieves | Action, Crime, Horror | NaN | NaN | NaN | NaN | 2021.0 | 2021.0 | False | False | Matthias Schweighöfer | [Matthias Schweighöfer, Nathalie Emmanuel, Rub... | 4 |
6. Restructuring GENRE Column¶
Now that we have cleaned newline characters from the GENRE column, we encounter another structural issue, genres are stored as a comma-separated string.
Such as,
"Action, Horror, Thriller"
"Drama, Romance"
"Animation, Action, Adventure"
While readable, this format is not ideal for analysis. In its current form:
- We cannot easily count how many titles belong to each genre.
- We cannot compute average rating per genre.
- Multi-genre titles remain compressed into a single cell.
For platforms like Netflix multi-label categorical variables are common. However, data cleaning principles require each categorical value to be represented in a structured way.
Let's convert GENRE column into a List
movies_star['GENRE'] = movies_star['GENRE'].str.split(',')
# whitespace
movies_star['GENRE'] = movies_star['GENRE'].apply(
lambda x: [genre.strip() for genre in x] if isinstance(x, list) else x
)
For genre-level analysis, we need a dataset where:
- Each row represents one title–genre pair
- Multi-genre titles appear multiple times (once per genre)
This process is called exploding the dataset. Let's do this on a different DataFrame meant for genre-level analysis only.
movies_genre = movies_star.copy()
movies_genre = movies_genre.explode('GENRE')
movies_genre.head()
| MOVIES | GENRE | RATING | VOTES | RunTime | Gross | Start_Year | End_Year | Is_Series | Is_Ongoing | Director | Stars_List | Star_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Blood Red Sky | Action | 6.1 | 21062.0 | 121.0 | NaN | 2021.0 | 2021.0 | False | False | Peter Thorwarth | [Peri Baumeister, Carl Anton Koch, Alexander S... | 4 |
| 0 | Blood Red Sky | Horror | 6.1 | 21062.0 | 121.0 | NaN | 2021.0 | 2021.0 | False | False | Peter Thorwarth | [Peri Baumeister, Carl Anton Koch, Alexander S... | 4 |
| 0 | Blood Red Sky | Thriller | 6.1 | 21062.0 | 121.0 | NaN | 2021.0 | 2021.0 | False | False | Peter Thorwarth | [Peri Baumeister, Carl Anton Koch, Alexander S... | 4 |
| 1 | Masters of the Universe: Revelation | Animation | 5.0 | 17870.0 | 25.0 | NaN | 2021.0 | 2021.0 | True | True | NA | [Chris Wood, Sarah Michelle Gellar, Lena Heade... | 4 |
| 1 | Masters of the Universe: Revelation | Action | 5.0 | 17870.0 | 25.0 | NaN | 2021.0 | 2021.0 | True | True | NA | [Chris Wood, Sarah Michelle Gellar, Lena Heade... | 4 |
import seaborn as sns
import matplotlib.pyplot as plt
genre_counts = movies_genre['GENRE'].value_counts().reset_index()
genre_counts.columns = ['Genre', 'Count']
plt.figure(figsize=(10,6))
sns.barplot(
data=genre_counts,
x='Genre',
y='Count',
palette='Set2'
)
plt.xticks(rotation=45)
plt.title('Frequency of Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
/tmp/ipython-input-3438722267.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
7. Missing Values¶
After tidying and restructuring the dataset, the next critical step in the data analysis pipeline is assessing missing values.
Missing data is not just a technical inconvenience, it can introduce bias, distort statistical summaries, and weaken model performance if not handled thoughtfully.
movies_star.isnull().sum()
| 0 | |
|---|---|
| MOVIES | 0 |
| GENRE | 78 |
| RATING | 1400 |
| VOTES | 1400 |
| RunTime | 2560 |
| Gross | 9108 |
| Start_Year | 1579 |
| End_Year | 1574 |
| Is_Series | 0 |
| Is_Ongoing | 0 |
| Director | 0 |
| Stars_List | 0 |
| Star_Count | 0 |
Reasons for Missingness¶
Instead of randomly filling or dropping values, we must first understand why they are missing.
GENRE(80)
This is relatively small compared to the dataset size. Possible causes:
Incomplete IMDb tagging
Since genre is critical for our analysis, we may:
- Drop rows if proportion is very small
- label them as "Unknown"
movies_star['GENRE'] = movies_star['GENRE'].fillna('Unknown')
RATINGandVOTES(~1820)
These two columns are missing together, which suggests titles may not yet have sufficient user ratings, such as for recently released content, Low-engagement titles.
Since the dataset is missing much of Revenue data and Rating is critical to primary analytical goal, we will drop missing value rows which will not be useful for this project.
movies_star = movies_star.dropna(subset=['RATING', 'VOTES'])
RunTime(2958)
Missing runtime could occur because:
- TV shows may not have consistent episode durations listed
- Data scraping inconsistencies
- Older or limited-release titles
For runtime analysis, we will impute movies to global mean of movies and shows to global mean of shows
movies_star.groupby('Is_Series')['RunTime'].mean()
| RunTime | |
|---|---|
| Is_Series | |
| False | 90.492119 |
| True | 39.310864 |
Movies have an average RunTime of ~90.4921 minutes and Series have that of ~39.3108 minutes (Per episode). Using Mean Impution:
movies_star.loc[(movies_star['Is_Series'] == False) & (movies_star['RunTime'].isna()), 'RunTime'] = 90.492119 # Movies
movies_star.loc[(movies_star['Is_Series'] == True) & (movies_star['RunTime'].isna()), 'RunTime'] = 39.310864 # Shows
Gross(9539)
This is most significant missingness. The majority of titles lack gross revenue due to data source not listing it largely.
Given the extremely high missing proportion, Gross cannot be used as a primary analytical variable, financial analysis using this column would be highly biased and it may be excluded from most analyses.
movies_clean = movies_star.copy()
movies_clean.drop(columns=['Gross'], inplace=True)
Start_YearandEnd_Year(~1690 each)
These were engineered from the YEAR column. In some cases, both Start_Year and End_Year are missing, likely due to missing values in the original YEAR column.
Shall rows where both values are missing should be explicitly marked as 0 (instead of remaining NaN), we can update them conditionally.
This preserves rows while clearly flagging incomplete temporal information.
# Condition mask
mask = movies_clean['Start_Year'].isna() & movies_year['End_Year'].isna()
movies_clean.loc[mask, ['Start_Year', 'End_Year']] = 0
# Condition: Start_Year is NA and End_Year is not NA
mask = movies_clean['Start_Year'].isna() & movies_clean['End_Year'].notna()
movies_clean.loc[mask, 'Start_Year'] = movies_clean.loc[mask, 'End_Year']
We have removed or flagged all missing values, and have a clean data.
Important to note:
- 1690 values miss both
Start_DateandEnd_Dateand have been marked as 0. - 80 values missing
GENREhave been marked as 'Unknown'
8. Outliers¶
After handling missing values, the next critical step in data preprocessing is identifying and treating outliers.
Outliers are observations that deviate significantly from the majority of the data. They may arise due to:
- Data entry errors
- Measurement errors
- Experimental anomalies
- Natural but rare extreme values
- Genuine variability in the population
Outliers are not inherently “bad.” However, if left unchecked, they can:
- Skew mean and standard deviation
- Distort statistical inference
- Negatively affect distance-based models like KNN.
At the same time, removing outliers blindly may eliminate valuable information—especially in domains like finance, healthcare, or fraud detection where rare events are meaningful.
plt.figure(figsize=(6, 4))
movies_clean[['Start_Year', 'End_Year']].boxplot()
plt.xticks(rotation=45)
plt.title("Start & End Year Outlier Detection")
plt.show()
No Unexpected values exist in Start_Year and End_Year columns. Removing outliers in VOTES, RunTime will shadow valueable exceptional cases.
Visualization¶
Dataset¶
Movies: 53.6% Series: 46.4%
The indicates the dataset is fairly balanced between Movies and Series, with a slight dominance of movies.
- The dataset represents both formats well.
- Comparative analysis (Movies vs Series) is meaningful because neither class overwhelmingly dominates.
- Slight bias toward movies may slightly influence overall averages.
Average Runtime: Movies vs Series¶
This visualization compares the average runtime of movies and series. Runtime behavior is structurally different for both formats, so this helps validate content-type differences.
Movie average runtime ≈ 90 minutes
Series average runtime ≈ 38 minutes
This shows movies are more than twice as long as individual series episodes.
Movies are long-form content (~1.5 hours average).
Series episodes are shorter (~40 minutes), consistent with episodic structure.
Runtime is strongly dependent on content type.
Runtime can be a predictive feature for distinguishing movies from series.
Average Rating: Movies vs Series¶
This compares the average audience rating between Movies and Series. It helps determine which format tends to receive higher audience appreciation.
avg_rating = movies_clean.groupby("Is_Series")["RATING"].mean()
avg_rating.index = avg_rating.index.map({True: 'Series', False: 'Movie'})
plt.figure(figsize=(6,4))
ax = sns.barplot(x=avg_rating.index, y=avg_rating.values)
plt.title("Average Rating: Movies vs Series")
plt.ylabel("Average Rating")
plt.xlabel("Content Type")
for i, v in enumerate(avg_rating.values):
ax.text(i, v + 0.05, f"{v:.2f}", ha='center')
plt.show()
Result
Movies: 6.5
Series: 7.2
We can see that series have noticeably higher average ratings than movies.
Series tend to be rated more favorably.
Possible reasons:
Longer character development
More audience engagement over time
Rating bias (only successful series survive multiple seasons)
Content type influences rating behavior.
Ongoing vs Completed Series¶
This compares the number of ongoing series versus completed series.
series_df = movies_clean[movies_clean['Is_Series'] == True]
ongoing_counts = series_df['Is_Ongoing'].value_counts()
ongoing_counts.index = ongoing_counts.index.map({True: 'Ongoing', False: 'Completed'})
plt.figure(figsize=(7,5))
ax = sns.barplot(x=ongoing_counts.index,
y=ongoing_counts.values,
hue=ongoing_counts.index,
legend=False,
palette="Set3")
plt.title("Ongoing vs Completed Series")
plt.xlabel("Series Status")
plt.ylabel("Count")
for i, value in enumerate(ongoing_counts.values):
ax.text(i,
value + 1,
str(value),
ha='center',
va='bottom')
plt.show()
Result
Ongoing: 2444
Completed: 1350
There are significantly more ongoing series than completed ones.
The dataset is skewed toward currently running content.
Modern streaming platforms produce long-running series.
Ongoing shows may have:
Inflated ratings due to current hype
Incomplete lifecycle data
Lifecycle status
Is_Ongoingcan influence ratings and vote patterns.
Distribution of Audience Rating¶
This visualization represents the frequency distribution of ratings across all titles. It helps us understand:
Central tendency (where most ratings lie)
Spread (variability of ratings)
Skewness (bias toward high or low ratings)
Presence of outliers (extremely high/low-rated titles)
It also reveals how audiences generally evaluate content in this dataset
plt.figure()
sns.histplot(movies_clean["RATING"].dropna(), bins=20, kde=True)
plt.title("Distribution of Audience Ratings")
plt.xlabel("Rating")
plt.ylabel("Count of Titles")
plt.show()
Result
Majority of ratings lie between 6.5 and 8
Peak around 7.5
Interpretation
- Central Tendency
The clustering around 7–7.5 indicates that most content is perceived as above average but not exceptional.
IMDb-style rating systems typically show:
5 = average
6–7 = good
8+ = excellent
Your dataset suggests most content is rated as “good”.
Popularity vs Rating¶
This plot analyzes the relationship between Votes and audience rating.
It helps determine:
Whether higher quality content is rated-more
Whether less voted movies are rated less critically.
plt.figure(figsize=(8,6))
sns.scatterplot(data=movies_format, x="VOTES", y="RATING")
plt.xscale("log")
plt.title("Rating vs Votes")
plt.show()
correlation = movies_clean[["RATING", "VOTES"]].corr()
print("Correlation between Rating and Votes:")
print(correlation)
Correlation between Rating and Votes:
RATING VOTES
RATING 1.000000 0.103792
VOTES 0.103792 1.000000
Indicates a very weak, Positive relationship between RATING and VOTESindicating small bias towards more rated movies being highly rated.
Runtime vs Rating¶
This plot analyzes the relationship between content length and audience rating.
It helps determine:
Whether longer content is perceived as higher quality
Whether shorter content underperforms
Whether runtime impacts audience satisfaction
corr_runtime_rating = movies_clean[["RunTime", "RATING"]].corr()
corr_runtime_rating
| RunTime | RATING | |
|---|---|---|
| RunTime | 1.000000 | -0.215801 |
| RATING | -0.215801 | 1.000000 |
This indicates moderate negative correlation between RunTime and RATING, suggesting longer titles are receiving lower rating.
top_runtime_records = movies_clean.sort_values(by='RunTime', ascending=False).head(1)
top_runtime_records
| MOVIES | GENRE | RATING | VOTES | RunTime | Start_Year | End_Year | Is_Series | Is_Ongoing | Director | Stars_List | Star_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1902 | El tiempo entre costuras | [Adventure, Drama, History] | 8.3 | 3876.0 | 853.0 | 2013.0 | 2014.0 | True | False | NA | [Adriana Ugarte, Mari Carmen Sánchez, Tristán ... | 4 |
Many outliers represent actual long movies, it will not be ideal to drop them.
movies_clean[movies_clean['Is_Series'] == True].head()
| MOVIES | GENRE | RATING | VOTES | RunTime | Start_Year | End_Year | Is_Series | Is_Ongoing | Director | Stars_List | Star_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Masters of the Universe: Revelation | [Animation, Action, Adventure] | 5.0 | 17870.0 | 25.0 | 2021.0 | 2021.0 | True | True | NA | [Chris Wood, Sarah Michelle Gellar, Lena Heade... | 4 |
| 2 | The Walking Dead | [Drama, Horror, Thriller] | 8.2 | 885805.0 | 44.0 | 2010.0 | 2022.0 | True | False | NA | [Andrew Lincoln, Norman Reedus, Melissa McBrid... | 4 |
| 3 | Rick and Morty | [Animation, Adventure, Comedy] | 9.2 | 414849.0 | 23.0 | 2013.0 | 2013.0 | True | True | NA | [Justin Roiland, Chris Parnell, Spencer Gramme... | 4 |
| 5 | Outer Banks | [Action, Crime, Drama] | 7.6 | 25858.0 | 50.0 | 2020.0 | 2020.0 | True | True | NA | [Chase Stokes, Madelyn Cline, Madison Bailey, ... | 4 |
| 7 | Dexter | [Crime, Drama, Mystery] | 8.6 | 665387.0 | 53.0 | 2006.0 | 2013.0 | True | False | NA | [Michael C. Hall, Jennifer Carpenter, David Za... | 4 |
Masters of the Universe: Revelation https://www.imdb.com/title/tt10826054/
The Walking Dead https://www.imdb.com/title/tt1520211/
Rick and Morty https://www.imdb.com/title/tt2861424/
movies_clean.head(10)
| MOVIES | GENRE | RATING | VOTES | RunTime | Start_Year | End_Year | Is_Series | Is_Ongoing | Director | Stars_List | Star_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Blood Red Sky | [Action, Horror, Thriller] | 6.1 | 21062.0 | 121.0 | 2021.0 | 2021.0 | False | False | Peter Thorwarth | [Peri Baumeister, Carl Anton Koch, Alexander S... | 4 |
| 1 | Masters of the Universe: Revelation | [Animation, Action, Adventure] | 5.0 | 17870.0 | 25.0 | 2021.0 | 2021.0 | True | True | NA | [Chris Wood, Sarah Michelle Gellar, Lena Heade... | 4 |
| 2 | The Walking Dead | [Drama, Horror, Thriller] | 8.2 | 885805.0 | 44.0 | 2010.0 | 2022.0 | True | False | NA | [Andrew Lincoln, Norman Reedus, Melissa McBrid... | 4 |
| 3 | Rick and Morty | [Animation, Adventure, Comedy] | 9.2 | 414849.0 | 23.0 | 2013.0 | 2013.0 | True | True | NA | [Justin Roiland, Chris Parnell, Spencer Gramme... | 4 |
| 5 | Outer Banks | [Action, Crime, Drama] | 7.6 | 25858.0 | 50.0 | 2020.0 | 2020.0 | True | True | NA | [Chase Stokes, Madelyn Cline, Madison Bailey, ... | 4 |
| 6 | The Last Letter from Your Lover | [Drama, Romance] | 6.8 | 5283.0 | 110.0 | 2021.0 | 2021.0 | False | False | Augustine Frizzell | [Shailene Woodley, Joe Alwyn, Wendy Nottingham... | 4 |
| 7 | Dexter | [Crime, Drama, Mystery] | 8.6 | 665387.0 | 53.0 | 2006.0 | 2013.0 | True | False | NA | [Michael C. Hall, Jennifer Carpenter, David Za... | 4 |
| 8 | Never Have I Ever | [Comedy] | 7.9 | 34530.0 | 30.0 | 2020.0 | 2020.0 | True | True | NA | [Maitreyi Ramakrishnan, Poorna Jagannathan, Da... | 4 |
| 9 | Virgin River | [Drama, Romance] | 7.4 | 27279.0 | 44.0 | 2019.0 | 2019.0 | True | True | NA | [Alexandra Breckenridge, Martin Henderson, Col... | 4 |
| 10 | Gunpowder Milkshake | [Action, Adventure, Thriller] | 6.0 | 17989.0 | 114.0 | 2021.0 | 2021.0 | False | False | Navot Papushado | [Karen Gillan, Lena Headey, Carla Gugino, Mich... | 4 |
decade_avg = (
movies_clean
.loc[movies_clean["Start_Year"] != 0]
.assign(Decade=(movies_clean["Start_Year"] // 10) * 10)
.groupby("Decade", as_index=False)["RATING"]
.mean()
.sort_values("Decade")
)
plt.figure(figsize=(10, 6))
sns.lineplot(data=decade_avg, x="Decade", y="RATING", marker="o")
plt.title("Average Rating by Decade")
plt.xlabel("Decade")
plt.ylabel("Average Rating")
plt.ylim(5.5, 7.5)
plt.tight_layout()
plt.show()
Movies from decade of 2000's seem to perform the best among titles, with ratings rising over time
year_count = movies_clean["Start_Year"].loc[movies_clean["Start_Year"] != 0].value_counts().sort_index()
plt.figure(figsize=(10,6))
year_count.plot()
plt.title("Movies over Years")
plt.show()
year_count = movies_clean["Start_Year"].loc[movies_clean["Start_Year"] >1995].value_counts().sort_index()
plt.figure(figsize=(10,6))
year_count.plot()
plt.title("Number of Movies Featured")
plt.show()
Insights¶
Movies vs Series¶
df = movies_clean.copy()
df["Content_Type"] = df["Is_Series"].map({
False: "Movies",
True: "Series"
})
performance = df.groupby("Content_Type").agg({
"RATING":"mean",
"VOTES":"mean"
}).reset_index()
plt.figure(figsize=(8,5))
sns.barplot(data=performance, x="Content_Type", y="VOTES")
plt.title("Average Votes: Movies vs Series")
plt.xlabel("Content Type")
plt.ylabel("Average Votes")
plt.show()
performance
| Content_Type | RATING | VOTES | |
|---|---|---|---|
| 0 | Movies | 6.489209 | 18840.200046 |
| 1 | Series | 7.415999 | 10883.643121 |
Observation 1: Series Are Rated Significantly Higher
TV series consistently receive higher audience ratings than movies, suggesting that long-form storytelling may foster deeper emotional engagement and stronger viewer satisfaction.
Why might this happen?
More time for character development
Multi-episode narrative arcs
Stronger fan communities, viewer loyalty over seasons
Doing well with younger audiences
Observation 2: Movies Generate More Votes on Average
Movies receive ~73% more votes on average.
Despite lower ratings, movies attract significantly higher average vote counts, indicating broader reach but potentially less audience engagement.
This suggests:
- Movies have wider reach.
- They are easier to consume (single sitting).
Genres¶
By Popularity¶
def top_10_views():
df_exploded = df.explode("GENRE")
genre_votes_sum = (
df_exploded
.groupby("GENRE")
.agg({
"VOTES": "sum",
"RATING": "mean",
"MOVIES": "count"
})
.sort_values("VOTES", ascending=False)
)
top_genres_votes = genre_votes_sum.head(10)
plt.figure(figsize=(9,6))
sns.barplot(
x=top_genres_votes["VOTES"],
y=top_genres_votes.index,
palette="Set2"
)
plt.title("Top 10 Genres by Total Audience Engagement")
plt.xlabel("Total Votes")
plt.ylabel("Genre")
plt.show()
global x
x = top_genres_votes
top_10_views()
x
/tmp/ipython-input-2598965299.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
| VOTES | RATING | MOVIES | |
|---|---|---|---|
| GENRE | |||
| Drama | 70167787.0 | 7.091197 | 3499 |
| Action | 43766718.0 | 7.097396 | 1843 |
| Adventure | 35370287.0 | 7.295955 | 1335 |
| Crime | 33788787.0 | 7.080233 | 1376 |
| Comedy | 31948275.0 | 6.825589 | 2419 |
| Thriller | 19654895.0 | 6.333719 | 777 |
| Mystery | 13553679.0 | 7.089417 | 737 |
| Animation | 12697131.0 | 7.377619 | 1403 |
| Sci-Fi | 12224640.0 | 6.582593 | 270 |
| Fantasy | 11527266.0 | 6.996264 | 455 |
def bottom_10_views():
genre_votes_sum = (
df_exploded
.groupby("GENRE")
.agg({
"VOTES": "sum",
"RATING": "mean",
"MOVIES": "count"
})
.sort_values("VOTES", ascending=True)
)
top_genres_votes = genre_votes_sum.head(11).iloc[1:]
plt.figure(figsize=(9,6))
sns.barplot(
x=top_genres_votes["VOTES"],
y=top_genres_votes.index,
palette="Set2"
)
plt.title("Bottom 10 Genres by Total Audience Engagement")
plt.xlabel("Total Votes")
plt.ylabel("Genre")
plt.show()
global x
x = top_genres_votes
bottom_10_views()
x
/tmp/ipython-input-4038548646.py:16: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
| VOTES | RATING | MOVIES | |
|---|---|---|---|
| GENRE | |||
| News | 29672.0 | 7.061111 | 18 |
| Talk-Show | 36987.0 | 6.960870 | 23 |
| Game-Show | 77194.0 | 6.369792 | 96 |
| Film-Noir | 154485.0 | 7.016667 | 12 |
| Western | 258652.0 | 6.720000 | 20 |
| Reality-TV | 432936.0 | 6.626437 | 348 |
| Short | 449292.0 | 6.747753 | 178 |
| Sport | 897016.0 | 6.852500 | 160 |
| War | 1009873.0 | 6.986667 | 45 |
| Musical | 1109089.0 | 7.001923 | 52 |
By Average Rating¶
def top_10_rating():
df_exploded = df.explode("GENRE")
genre_perf_rating = (
df_exploded
.groupby("GENRE")
.agg({
"RATING": "mean",
"VOTES": "mean"
})
.sort_values("RATING", ascending=False)
)
top_genres_rating = genre_perf_rating.head(10)
plt.figure(figsize=(8,6))
sns.barplot(
x=top_genres_rating["RATING"],
y=top_genres_rating.index,
palette="Set2"
)
plt.title("Top 10 Genres by Average Rating")
plt.xlabel("Average Rating")
plt.ylabel("Genre")
plt.xlim(7, 7.5)
plt.show()
global x
x = top_genres_rating
top_10_rating()
x
/tmp/ipython-input-3877282698.py:17: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
| RATING | VOTES | |
|---|---|---|
| GENRE | ||
| Animation | 7.377619 | 9049.986458 |
| History | 7.326172 | 9132.699219 |
| Adventure | 7.295955 | 26494.597004 |
| Documentary | 7.175704 | 2576.133803 |
| Action | 7.097396 | 23747.540966 |
| Drama | 7.091197 | 20053.668763 |
| Mystery | 7.089417 | 18390.337856 |
| Crime | 7.080233 | 24555.804506 |
| News | 7.061111 | 1648.444444 |
| Biography | 7.054104 | 23733.220149 |
def bottom_10_rating():
genre_perf_rating = (
df_exploded
.groupby("GENRE")
.agg({
"RATING": "mean",
"VOTES": "mean"
})
.sort_values("RATING", ascending=True)
)
top_genres_rating = genre_perf_rating.head(11)
plt.figure(figsize=(8,6))
sns.barplot(
x=top_genres_rating["RATING"],
y=top_genres_rating.index,
palette="Set2"
)
plt.title("Bottom 10 Genres by Average Rating")
plt.xlabel("Average Rating")
plt.ylabel("Genre")
plt.xlim(5, 7)
plt.show()
global x
x = top_genres_rating
bottom_10_rating()
x
| RATING | VOTES | |
|---|---|---|
| GENRE | ||
| Horror | 5.860998 | 22722.299320 |
| Thriller | 6.333719 | 25295.875161 |
| Game-Show | 6.369792 | 804.104167 |
| Unknown | 6.563636 | 86.909091 |
| Sci-Fi | 6.582593 | 45276.444444 |
| Reality-TV | 6.626437 | 1244.068966 |
| Western | 6.720000 | 12932.600000 |
| Short | 6.747753 | 2524.112360 |
| Family | 6.774933 | 11522.326146 |
| Romance | 6.800786 | 14382.218873 |
| Comedy | 6.825589 | 13207.224060 |
Do Longer Series Perform Better?¶
def series_length():
length_perf = completed_series.groupby("Length_Category").agg({
"RATING": "mean",
"VOTES": "sum"
}).reset_index()
sns.set_style("whitegrid")
sns.set_palette("Set2")
fig, axes = plt.subplots(1, 2, figsize=(12,5))
# Rating Plot
ax1 = sns.barplot(
data=length_perf,
x="Length_Category",
y="RATING",
ax=axes[0]
)
axes[0].set_title("Average Rating by Series Length",
fontsize=13)
axes[0].set_xlabel("Series Length")
axes[0].set_ylabel("Average Rating")
axes[0].set_ylim(7.4, 8.2)
for i, v in enumerate(length_perf["RATING"]):
axes[0].text(i, v + 0.02, f"{v:.2f}",
ha='center', fontsize=11)
# Votes Plot
ax2 = sns.barplot(
data=length_perf,
x="Length_Category",
y="VOTES",
ax=axes[1]
)
axes[1].set_yscale("log")
axes[1].set_title("Total Audience Engagement by Series Length (Log Scale)",
fontsize=13)
axes[1].set_xlabel("Series Length")
axes[1].set_ylabel("Total Votes (log scale)")
for i, v in enumerate(length_perf["VOTES"]):
axes[1].text(i, v,
f"{v/1e6:.1f}M",
ha='center',
va='bottom',
fontsize=11)
plt.suptitle("Impact of Series Longevity on Performance",
fontsize=16)
plt.tight_layout()
plt.show()
series_length()
/tmp/ipython-input-3968155922.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
length_perf = completed_series.groupby("Length_Category").agg({
Conclusions¶
Series tend to be rated higher by audience
Movies seem to reach wider audience
Top 5 Genres by Popularity
- Drama
- Action
- Adventure
- Crime
- Comedy
Top 5 Genres by Rating
- Animation
- History
- Adventure
- Documentary
- Action
Long series (6+ Years) gained most audience and higher ratings