IMDb 10,000 Netflix Movies and Shows¶

End-to-End data analysis project by Mayank Sharma

Table of contents¶

  1. Analysis Overview
  2. Introduction
  3. Required libraries
  4. The problem domain
  5. Step 1: Answering the question
  6. Step 2: Checking the data
  7. Step 3: Tidying the data
  • String Formatting Error
  • Inconsistent Year Formats
  • One-Line Columns
  • Restructuring Genre Column
  • Missing Values
  1. Exploratory analysis
  2. Conclusions

Overview¶

In [ ]:
import warnings
warnings.filterwarnings("ignore") # For sns styling-related warnings
top_10_views()
No description has been provided for this image
In [ ]:
bottom_10_views()
No description has been provided for this image
In [ ]:
top_10_rating()
No description has been provided for this image
In [ ]:
bottom_10_rating()
No description has been provided for this image
In [ ]:
series_length()
No description has been provided for this image

Introduction¶

In the time it takes to scroll through Netflix and choose a show, thousands of new ratings, reviews, and streaming interactions are generated across the globe. Platforms like Netflix and IMDb continuously accumulate vast amounts of data about movies and TV shows — from audience ratings and genres to release years, runtimes, and popularity trends. Hidden within this data are powerful insights about viewer preferences, industry patterns, and the evolution of entertainment over time.

As the volume of entertainment data grows, so does the importance of Data Science in transforming raw information into meaningful stories. Drawing from disciplines such as statistics, programming, and domain knowledge in media analytics, Data Science allows us to clean, structure, analyze, and interpret complex datasets in a way that supports informed decision-making.

In this project, I will analyze and clean an extracted dataset of IMDb’s Netflix Top 10,000 movies and TV shows. The goal is to walk through a complete data analysis workflow — from data cleaning and preprocessing to exploratory analysis and insight generation — demonstrating how raw entertainment data can be transformed into actionable understanding.

Link to dataset

This notebook is made as a structured, end-to-end analysis learning project. If you identify areas for improvement, alternative interpretations, or additional insights worth exploring, comments and PR's are welcome.

Required Libraries¶

The primary libraries used are:

  • NumPy: Provides a fast numerical array structure and helper functions.
  • pandas: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
  • scikit-learn: The essential Machine Learning package in Python.
  • matplotlib: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
  • Seaborn: Advanced statistical plotting library.
In [192]:
import kagglehub
path = kagglehub.dataset_download("bharatnatrayn/movies-dataset-for-feature-extracion-prediction")
print("Path to dataset:", path)
Using Colab cache for faster access to the 'movies-dataset-for-feature-extracion-prediction' dataset.
Path to dataset: /kaggle/input/movies-dataset-for-feature-extracion-prediction

The Problem Domain¶

Our task is exploring and extracting insights from a large dataset of the IMDb Netflix Top 10,000 movies and TV shows. The dataset contains information such as title, year, genre, rating, votes, runtime, gross earnings, cast, and brief descriptions.

We are working with data sourced from IMDb and content available on Netflix. However, the raw dataset is not analysis-ready — it contains missing values, inconsistent formatting (newline characters in genres and descriptions), mixed data types (years with variants like “2010–2022”, "2010-", "2010"), and null values in several fields. Therefore, a major component of this project involves data cleaning and preprocessing before performing meaningful analysis.

Project objective is not to build a predictive model, but rather to:

  • Clean and standardize the dataset
  • Determine useful features
  • Explore trends in ratings, genres, and release years
  • Analyze relationships between ratings, votes, runtime, and gross revenue

Step 1: Shaping Goal¶

The first step is to clearly define the problem and determine what the target is.

The type of data analytic question?

This project is primarily:

  • Exploratory – Understanding distributions of ratings, genres, runtimes, and release years.

  • Descriptive – Summarizing trends in Netflix’s most popular or highest-rated content.

The metric for success?

Since this is an exploratory data analysis (EDA) project, success will be measured by:

  • Successfully cleaning and structuring messy raw data
  • Reducing missing or inconsistent values
  • Producing clear summary statistics and visualizations
  • Identifying meaningful, data-supported insights

The context for the question and its business application?

Streaming platforms like Netflix rely heavily on data-driven decision-making. Insights derived from IMDb rating trends, vote counts, genre distributions, and runtime preferences can help inform:

  • Content acquisition strategies
  • Investment in productions
  • Marketing decisions
  • Understanding audience engagement patterns

Step 2: Checking the data¶

Let's take a look at the data we're working with.

Questions to answer:

  • Is there anything wrong with the data?
  • Are there any quirks with the data?
  • Do we need to fix or remove any of the data?

Loading dataset into a pandas dataframe:

In [193]:
import pandas as pd
movies = pd.read_csv(f"{path}/movies.csv")
movies.head()
Out[193]:
MOVIES YEAR GENRE RATING ONE-LINE STARS VOTES RunTime Gross
0 Blood Red Sky (2021) \nAction, Horror, Thriller 6.1 \nA woman with a mysterious illness is forced ... \n Director:\nPeter Thorwarth\n| \n Star... 21,062 121.0 NaN
1 Masters of the Universe: Revelation (2021– ) \nAnimation, Action, Adventure 5.0 \nThe war for Eternia begins again in what may... \n \n Stars:\nChris Wood, \nSara... 17,870 25.0 NaN
2 The Walking Dead (2010–2022) \nDrama, Horror, Thriller 8.2 \nSheriff Deputy Rick Grimes wakes up from a c... \n \n Stars:\nAndrew Lincoln, \n... 885,805 44.0 NaN
3 Rick and Morty (2013– ) \nAnimation, Adventure, Comedy 9.2 \nAn animated series that follows the exploits... \n \n Stars:\nJustin Roiland, \n... 414,849 23.0 NaN
4 Army of Thieves (2021) \nAction, Crime, Horror NaN \nA prequel, set before the events of Army of ... \n Director:\nMatthias Schweighöfer\n| \n ... NaN NaN NaN

The dataset is in a tabular format, and the first row defines the column headers:

MOVIES, YEAR, GENRE, RATING, ONE-LINE, STARS, VOTES, RunTime, Gross

which are descriptive enough to understand what each feature represents without needing documentation. Each row corresponds to a single title, either a movie or a TV show

However, dataset is not perfectly curated and requires careful inspection before analysis.

Next, it's always a good idea to look at the distribution of our data — especially the outliers.

Let's print out some summary statistics about the data set.

In [194]:
movies.describe()
Out[194]:
RATING RunTime
count 8179.000000 7041.000000
mean 6.921176 68.688539
std 1.220232 47.258056
min 1.100000 1.000000
25% 6.200000 36.000000
50% 7.100000 60.000000
75% 7.800000 95.000000
max 9.900000 853.000000

We can see several useful values from this table. For example, we see that several entries are missing in both RATING and RunTime

Other than that, table like this is rarely useful unless we know that data should fall in a particular range.

It's better to visualize the data in some way. Visualization makes outliers and errors immediately stand out, whereas they might go unnoticed in a large table of numbers.

Step 3: Tidying the data¶

Now that we've identified several errors in the data set, we need to fix them before we proceed with the analysis.

Let's walk through the issues one-by-one.

1. Handling Duplicates¶

We must start by chceking for duplicate rows in dataset, removing them early on, saving compute and trouble later.

In [195]:
movies.duplicated().sum()
Out[195]:
np.int64(431)

There are 431 duplicate rows in dataset, due to errornous scrapping. We can use DataFrame duplicates() to fix this.

In [196]:
movies = movies.drop_duplicates()

2. String Formatting Error¶

Fields like 'GENRE', 'STARS', 'ONE-LINE' have unexpected '\n' characters, possibly due to issues in scrapping data.

We can use DataFrame to fix this.

In [197]:
movies['GENRE'][0]
Out[197]:
'\nAction, Horror, Thriller            '
In [198]:
movies['STARS'][0]
Out[198]:
'\n    Director:\nPeter Thorwarth\n| \n    Stars:\nPeri Baumeister, \nCarl Anton Koch, \nAlexander Scheer, \nKais Setti\n'
In [199]:
movies['ONE-LINE'][0]
Out[199]:
'\nA woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.'

Using python string functions replace() to remove newline and strip() to remove spaces in front and back

In [200]:
movies_format = movies.copy()
cols = ['GENRE','STARS','ONE-LINE']
for col in cols:
    movies_format[col] = movies_format[col].str.replace('\n','').str.strip()
In [201]:
print(movies_format['GENRE'][0])
print(movies_format['STARS'][0])
print(movies_format['ONE-LINE'][0])
Action, Horror, Thriller
Director:Peter Thorwarth|     Stars:Peri Baumeister, Carl Anton Koch, Alexander Scheer, Kais Setti
A woman with a mysterious illness is forced into action when a group of terrorists attempt to hijack a transatlantic overnight flight.

3. Inconsistent Year Formats¶

The YEAR column contains multiple formats: (2001) → A movie (2001–2007) → A completed TV series (2021– ) or (2021–) → An ongoing TV series

Before analysis, we must:

  • Remove parentheses

  • Strip extra spaces

  • Standardize dash formatting

  • Engineer More Useful Features

In [202]:
movies['YEAR'][0:3]
Out[202]:
YEAR
0 (2021)
1 (2021– )
2 (2010–2022)

Instead of keeping YEAR as complex string, we create structured features:

Start_Year

Extract the first year (e.g. 2001 from '2001-2006').

End_Year

If format is 2001–2007 → End_Year = 2007

If format is 2021– → End_Year = 2021

If format is 2001 → End_Year = 2001

Is_Movie

True if format is a single year (e.g., 2001)

False if a range exists

Is_Ongoing

True if format is 2021– (no end year)

False otherwise

This transformation converts a messy string into structured, analysis-ready features that enable:

  • Comparing movies vs series
  • Analyzing trends in ongoing content
  • Studying longevity of shows
  • Examining production patterns over time

Let's start by removing '(' & ')' parenthesis and whitespaces, using replace and strip

In [203]:
movies_year = movies_format.copy()
movies_year['YEAR'] = movies_year['YEAR'].str.replace('(', '').str.replace(')', '').str.strip()
movies_year['YEAR'][0:3]
Out[203]:
YEAR
0 2021
1 2021–
2 2010–2022

Creating new features: Start_Year and End_Year

In [204]:
movies_year[['Start_Year', 'End_Year']] = (movies_year['YEAR'].str.split('–', expand=True))
movies_year['Start_Year'] = pd.to_numeric(movies_year['Start_Year'], errors='coerce')
movies_year['End_Year'] = pd.to_numeric(movies_year['End_Year'], errors='coerce')

Creating new Features Is_Series and Is_Ongoing

In [205]:
movies_year['Is_Series'] = movies_year['YEAR'].str.contains('–', na=False)
movies_year['Is_Ongoing'] = (movies_year['Is_Series'] & movies_year['End_Year'].isna())

Making End_year = Start_year for Movies, and On-going series to minimize NA/NULL values. We also drop String Year since useful features have already been extracted.

In [206]:
movies_year.loc[movies_year["Is_Series"].eq(False), "End_Year"] = movies_year["Start_Year"]
movies_year.loc[movies_year["Is_Series"] & movies_year["Is_Ongoing"], "End_Year"] = movies_year["Start_Year"]

movies_year.drop(columns=['YEAR'], inplace=True)

4. ONE-LINE & Votes Column¶

The ONE-LINE column contains short textual descriptions of each title. Can be useful for NLP-based sentiment or keyword analysis, it does not fit well into this project.

Decision: We will drop the ONE-LINE column to simplify the dataset and reduce unnecessary dimensionality.

VOTES column on the other hand contains commas , and is store as string instead of desired numerical format for working with it.

In [207]:
movies_year = movies_year.drop(columns=['ONE-LINE'])

movies_year['VOTES'] = movies_year['VOTES'].str.replace(',', '', regex=False)
movies_year['VOTES'] = pd.to_numeric(movies_year['VOTES'], errors='coerce')

5. Unstructured STARS Column¶

The STARS column contains mixed-format strings such as:

"Director: Augustine Frizzell | Stars: Shailene Woodley, Joe Alwyn, Wendy Nottingham, Felicity Jones"

Or,

"Stars: Chase Stokes, Madelyn Cline, Madison Bailey, Jonathan Daviss"

Some issues in above format are:

  • Director and actors are mentioned in single string.
  • The format is inconsistent, sometimes including Director.

For meaningful analysis, this structure is not ideal. Therefore, we must decompose this column into structured features.

Extracting Director Column¶

If the string contains "Director:", we will extract the director’s name else insert NA. This gives us a clean Director column suitable for:

  • Grouping by director
  • Counting titles per director
  • Average rating per director
In [208]:
movies_star = movies_year.copy()
movies_star['Director'] = movies_star['STARS'].str.extract(r'Director:\s*([^|]+)')
movies_star['Director'] = movies_star['Director'].str.strip()
movies_star.loc[movies_star['Director'].isna(), 'Director'] = 'NA'

Extracting Stars as Column¶

From the "Stars:" portion of the string, we can extract list of Stars in the movie or show, creating a structured Stars column containing a list/array of actor names.

This transformation allows:

  • Grouping by Actor
  • Studying most frequent actors
  • Examining actor-rating relationships
In [209]:
# Removing director
movies_star['Stars_Clean'] = movies_star['STARS'].str.replace(r'Director:.*?\|', '', regex=True)

# Removing 'Stars:'
movies_star['Stars_Clean'] = movies_star['Stars_Clean'].str.replace('Stars:', '', regex=False).str.strip()
movies_star['Stars_List'] = movies_star['Stars_Clean'].str.split(',')

# Stripping whitespace
movies_star['Stars_List'] = movies_star['Stars_List'].apply(
    lambda x: [actor.strip() for actor in x] if isinstance(x, list) else x
)
movies_star.drop(columns=['STARS','Stars_Clean'], inplace=True)

Creating Star Count Feature¶

Once we have a clean list of actors, we compute star count per label

This engineered feature enables analysis such as:

  • Do titles with larger casts receive higher ratings?
  • Are movies associated with different cast sizes than series?
  • Is there any relationship between star count and votes?
In [210]:
movies_star['Star_Count'] = movies_star['Stars_List'].apply(
    lambda x: len(x) if isinstance(x, list) else 0
)
In [211]:
movies_star.head()
Out[211]:
MOVIES GENRE RATING VOTES RunTime Gross Start_Year End_Year Is_Series Is_Ongoing Director Stars_List Star_Count
0 Blood Red Sky Action, Horror, Thriller 6.1 21062.0 121.0 NaN 2021.0 2021.0 False False Peter Thorwarth [Peri Baumeister, Carl Anton Koch, Alexander S... 4
1 Masters of the Universe: Revelation Animation, Action, Adventure 5.0 17870.0 25.0 NaN 2021.0 2021.0 True True NA [Chris Wood, Sarah Michelle Gellar, Lena Heade... 4
2 The Walking Dead Drama, Horror, Thriller 8.2 885805.0 44.0 NaN 2010.0 2022.0 True False NA [Andrew Lincoln, Norman Reedus, Melissa McBrid... 4
3 Rick and Morty Animation, Adventure, Comedy 9.2 414849.0 23.0 NaN 2013.0 2013.0 True True NA [Justin Roiland, Chris Parnell, Spencer Gramme... 4
4 Army of Thieves Action, Crime, Horror NaN NaN NaN NaN 2021.0 2021.0 False False Matthias Schweighöfer [Matthias Schweighöfer, Nathalie Emmanuel, Rub... 4

6. Restructuring GENRE Column¶

Now that we have cleaned newline characters from the GENRE column, we encounter another structural issue, genres are stored as a comma-separated string.

Such as,

"Action, Horror, Thriller"

"Drama, Romance"

"Animation, Action, Adventure"

While readable, this format is not ideal for analysis. In its current form:

  • We cannot easily count how many titles belong to each genre.
  • We cannot compute average rating per genre.
  • Multi-genre titles remain compressed into a single cell.

For platforms like Netflix multi-label categorical variables are common. However, data cleaning principles require each categorical value to be represented in a structured way.

Let's convert GENRE column into a List

In [212]:
movies_star['GENRE'] = movies_star['GENRE'].str.split(',')

# whitespace
movies_star['GENRE'] = movies_star['GENRE'].apply(
    lambda x: [genre.strip() for genre in x] if isinstance(x, list) else x
)

For genre-level analysis, we need a dataset where:

  • Each row represents one title–genre pair
  • Multi-genre titles appear multiple times (once per genre)

This process is called exploding the dataset. Let's do this on a different DataFrame meant for genre-level analysis only.

In [213]:
movies_genre = movies_star.copy()

movies_genre = movies_genre.explode('GENRE')
movies_genre.head()
Out[213]:
MOVIES GENRE RATING VOTES RunTime Gross Start_Year End_Year Is_Series Is_Ongoing Director Stars_List Star_Count
0 Blood Red Sky Action 6.1 21062.0 121.0 NaN 2021.0 2021.0 False False Peter Thorwarth [Peri Baumeister, Carl Anton Koch, Alexander S... 4
0 Blood Red Sky Horror 6.1 21062.0 121.0 NaN 2021.0 2021.0 False False Peter Thorwarth [Peri Baumeister, Carl Anton Koch, Alexander S... 4
0 Blood Red Sky Thriller 6.1 21062.0 121.0 NaN 2021.0 2021.0 False False Peter Thorwarth [Peri Baumeister, Carl Anton Koch, Alexander S... 4
1 Masters of the Universe: Revelation Animation 5.0 17870.0 25.0 NaN 2021.0 2021.0 True True NA [Chris Wood, Sarah Michelle Gellar, Lena Heade... 4
1 Masters of the Universe: Revelation Action 5.0 17870.0 25.0 NaN 2021.0 2021.0 True True NA [Chris Wood, Sarah Michelle Gellar, Lena Heade... 4
In [214]:
import seaborn as sns
import matplotlib.pyplot as plt

genre_counts = movies_genre['GENRE'].value_counts().reset_index()
genre_counts.columns = ['Genre', 'Count']

plt.figure(figsize=(10,6))
sns.barplot(
    data=genre_counts,
    x='Genre',
    y='Count',
    palette='Set2'
)

plt.xticks(rotation=45)
plt.title('Frequency of Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
/tmp/ipython-input-3438722267.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(
No description has been provided for this image

7. Missing Values¶

After tidying and restructuring the dataset, the next critical step in the data analysis pipeline is assessing missing values.

Missing data is not just a technical inconvenience, it can introduce bias, distort statistical summaries, and weaken model performance if not handled thoughtfully.

In [215]:
movies_star.isnull().sum()
Out[215]:
0
MOVIES 0
GENRE 78
RATING 1400
VOTES 1400
RunTime 2560
Gross 9108
Start_Year 1579
End_Year 1574
Is_Series 0
Is_Ongoing 0
Director 0
Stars_List 0
Star_Count 0

Reasons for Missingness¶

Instead of randomly filling or dropping values, we must first understand why they are missing.

  1. GENRE (80)

This is relatively small compared to the dataset size. Possible causes:

  • Incomplete IMDb tagging

  • Since genre is critical for our analysis, we may:

    • Drop rows if proportion is very small
    • label them as "Unknown"
In [216]:
movies_star['GENRE'] = movies_star['GENRE'].fillna('Unknown')
  1. RATING and VOTES (~1820)

These two columns are missing together, which suggests titles may not yet have sufficient user ratings, such as for recently released content, Low-engagement titles.

Since the dataset is missing much of Revenue data and Rating is critical to primary analytical goal, we will drop missing value rows which will not be useful for this project.

In [217]:
movies_star = movies_star.dropna(subset=['RATING', 'VOTES'])
  1. RunTime (2958)

Missing runtime could occur because:

  • TV shows may not have consistent episode durations listed
  • Data scraping inconsistencies
  • Older or limited-release titles

For runtime analysis, we will impute movies to global mean of movies and shows to global mean of shows

In [218]:
movies_star.groupby('Is_Series')['RunTime'].mean()
Out[218]:
RunTime
Is_Series
False 90.492119
True 39.310864

Movies have an average RunTime of ~90.4921 minutes and Series have that of ~39.3108 minutes (Per episode). Using Mean Impution:

In [219]:
movies_star.loc[(movies_star['Is_Series'] == False) & (movies_star['RunTime'].isna()), 'RunTime'] = 90.492119 # Movies
movies_star.loc[(movies_star['Is_Series'] == True) & (movies_star['RunTime'].isna()), 'RunTime'] = 39.310864 # Shows
  1. Gross (9539)

This is most significant missingness. The majority of titles lack gross revenue due to data source not listing it largely.

Given the extremely high missing proportion, Gross cannot be used as a primary analytical variable, financial analysis using this column would be highly biased and it may be excluded from most analyses.

In [220]:
movies_clean = movies_star.copy()
movies_clean.drop(columns=['Gross'], inplace=True)
  1. Start_Year and End_Year (~1690 each)

These were engineered from the YEAR column. In some cases, both Start_Year and End_Year are missing, likely due to missing values in the original YEAR column.

Shall rows where both values are missing should be explicitly marked as 0 (instead of remaining NaN), we can update them conditionally.

This preserves rows while clearly flagging incomplete temporal information.

In [221]:
# Condition mask
mask = movies_clean['Start_Year'].isna() & movies_year['End_Year'].isna()
movies_clean.loc[mask, ['Start_Year', 'End_Year']] = 0


# Condition: Start_Year is NA and End_Year is not NA
mask = movies_clean['Start_Year'].isna() & movies_clean['End_Year'].notna()
movies_clean.loc[mask, 'Start_Year'] = movies_clean.loc[mask, 'End_Year']

We have removed or flagged all missing values, and have a clean data.

Important to note:

  • 1690 values miss both Start_Date and End_Date and have been marked as 0.
  • 80 values missing GENRE have been marked as 'Unknown'

8. Outliers¶

After handling missing values, the next critical step in data preprocessing is identifying and treating outliers.

Outliers are observations that deviate significantly from the majority of the data. They may arise due to:

  • Data entry errors
  • Measurement errors
  • Experimental anomalies
  • Natural but rare extreme values
  • Genuine variability in the population

Outliers are not inherently “bad.” However, if left unchecked, they can:

  • Skew mean and standard deviation
  • Distort statistical inference
  • Negatively affect distance-based models like KNN.

At the same time, removing outliers blindly may eliminate valuable information—especially in domains like finance, healthcare, or fraud detection where rare events are meaningful.

In [222]:
plt.figure(figsize=(6, 4))
movies_clean[['Start_Year', 'End_Year']].boxplot()
plt.xticks(rotation=45)
plt.title("Start & End Year Outlier Detection")
plt.show()
No description has been provided for this image

No Unexpected values exist in Start_Year and End_Year columns. Removing outliers in VOTES, RunTime will shadow valueable exceptional cases.

Visualization¶

Dataset¶

In [222]:
 

Movies: 53.6% Series: 46.4%

The indicates the dataset is fairly balanced between Movies and Series, with a slight dominance of movies.

  • The dataset represents both formats well.
  • Comparative analysis (Movies vs Series) is meaningful because neither class overwhelmingly dominates.
  • Slight bias toward movies may slightly influence overall averages.

Average Runtime: Movies vs Series¶

This visualization compares the average runtime of movies and series. Runtime behavior is structurally different for both formats, so this helps validate content-type differences.

Movie average runtime ≈ 90 minutes

Series average runtime ≈ 38 minutes

This shows movies are more than twice as long as individual series episodes.

  • Movies are long-form content (~1.5 hours average).

  • Series episodes are shorter (~40 minutes), consistent with episodic structure.

  • Runtime is strongly dependent on content type.

  • Runtime can be a predictive feature for distinguishing movies from series.

Average Rating: Movies vs Series¶

This compares the average audience rating between Movies and Series. It helps determine which format tends to receive higher audience appreciation.

In [223]:
avg_rating = movies_clean.groupby("Is_Series")["RATING"].mean()
avg_rating.index = avg_rating.index.map({True: 'Series', False: 'Movie'})

plt.figure(figsize=(6,4))
ax = sns.barplot(x=avg_rating.index, y=avg_rating.values)

plt.title("Average Rating: Movies vs Series")
plt.ylabel("Average Rating")
plt.xlabel("Content Type")

for i, v in enumerate(avg_rating.values):
    ax.text(i, v + 0.05, f"{v:.2f}", ha='center')

plt.show()
No description has been provided for this image

Result

Movies: 6.5

Series: 7.2

We can see that series have noticeably higher average ratings than movies.

  • Series tend to be rated more favorably.

  • Possible reasons:

    • Longer character development

    • More audience engagement over time

    • Rating bias (only successful series survive multiple seasons)

  • Content type influences rating behavior.

Ongoing vs Completed Series¶

This compares the number of ongoing series versus completed series.

In [224]:
series_df = movies_clean[movies_clean['Is_Series'] == True]
ongoing_counts = series_df['Is_Ongoing'].value_counts()
ongoing_counts.index = ongoing_counts.index.map({True: 'Ongoing', False: 'Completed'})

plt.figure(figsize=(7,5))
ax = sns.barplot(x=ongoing_counts.index,
                 y=ongoing_counts.values,
                 hue=ongoing_counts.index,
                 legend=False,
                 palette="Set3")
plt.title("Ongoing vs Completed Series")
plt.xlabel("Series Status")
plt.ylabel("Count")

for i, value in enumerate(ongoing_counts.values):
    ax.text(i,
            value + 1,
            str(value),
            ha='center',
            va='bottom')

plt.show()
No description has been provided for this image

Result

Ongoing: 2444

Completed: 1350

There are significantly more ongoing series than completed ones.

  • The dataset is skewed toward currently running content.

  • Modern streaming platforms produce long-running series.

  • Ongoing shows may have:

    • Inflated ratings due to current hype

    • Incomplete lifecycle data

    • Lifecycle status Is_Ongoing can influence ratings and vote patterns.

Distribution of Audience Rating¶

This visualization represents the frequency distribution of ratings across all titles. It helps us understand:

  • Central tendency (where most ratings lie)

  • Spread (variability of ratings)

  • Skewness (bias toward high or low ratings)

  • Presence of outliers (extremely high/low-rated titles)

It also reveals how audiences generally evaluate content in this dataset

In [225]:
plt.figure()
sns.histplot(movies_clean["RATING"].dropna(), bins=20, kde=True)
plt.title("Distribution of Audience Ratings")
plt.xlabel("Rating")
plt.ylabel("Count of Titles")
plt.show()
No description has been provided for this image

Result

Majority of ratings lie between 6.5 and 8

Peak around 7.5

Interpretation

  1. Central Tendency

The clustering around 7–7.5 indicates that most content is perceived as above average but not exceptional.

IMDb-style rating systems typically show:

5 = average

6–7 = good

8+ = excellent

Your dataset suggests most content is rated as “good”.

Popularity vs Rating¶

This plot analyzes the relationship between Votes and audience rating.

It helps determine:

  • Whether higher quality content is rated-more

  • Whether less voted movies are rated less critically.

In [226]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=movies_format, x="VOTES", y="RATING")
plt.xscale("log")
plt.title("Rating vs Votes")
plt.show()
No description has been provided for this image
In [227]:
correlation = movies_clean[["RATING", "VOTES"]].corr()
print("Correlation between Rating and Votes:")
print(correlation)
Correlation between Rating and Votes:
          RATING     VOTES
RATING  1.000000  0.103792
VOTES   0.103792  1.000000

Indicates a very weak, Positive relationship between RATING and VOTESindicating small bias towards more rated movies being highly rated.

Runtime vs Rating¶

This plot analyzes the relationship between content length and audience rating.

It helps determine:

  • Whether longer content is perceived as higher quality

  • Whether shorter content underperforms

  • Whether runtime impacts audience satisfaction

In [228]:
corr_runtime_rating = movies_clean[["RunTime", "RATING"]].corr()
corr_runtime_rating
Out[228]:
RunTime RATING
RunTime 1.000000 -0.215801
RATING -0.215801 1.000000

This indicates moderate negative correlation between RunTime and RATING, suggesting longer titles are receiving lower rating.

In [229]:
top_runtime_records = movies_clean.sort_values(by='RunTime', ascending=False).head(1)

top_runtime_records
Out[229]:
MOVIES GENRE RATING VOTES RunTime Start_Year End_Year Is_Series Is_Ongoing Director Stars_List Star_Count
1902 El tiempo entre costuras [Adventure, Drama, History] 8.3 3876.0 853.0 2013.0 2014.0 True False NA [Adriana Ugarte, Mari Carmen Sánchez, Tristán ... 4

Many outliers represent actual long movies, it will not be ideal to drop them.

https://www.imdb.com/es/title/tt1864750/

In [230]:
movies_clean[movies_clean['Is_Series'] == True].head()
Out[230]:
MOVIES GENRE RATING VOTES RunTime Start_Year End_Year Is_Series Is_Ongoing Director Stars_List Star_Count
1 Masters of the Universe: Revelation [Animation, Action, Adventure] 5.0 17870.0 25.0 2021.0 2021.0 True True NA [Chris Wood, Sarah Michelle Gellar, Lena Heade... 4
2 The Walking Dead [Drama, Horror, Thriller] 8.2 885805.0 44.0 2010.0 2022.0 True False NA [Andrew Lincoln, Norman Reedus, Melissa McBrid... 4
3 Rick and Morty [Animation, Adventure, Comedy] 9.2 414849.0 23.0 2013.0 2013.0 True True NA [Justin Roiland, Chris Parnell, Spencer Gramme... 4
5 Outer Banks [Action, Crime, Drama] 7.6 25858.0 50.0 2020.0 2020.0 True True NA [Chase Stokes, Madelyn Cline, Madison Bailey, ... 4
7 Dexter [Crime, Drama, Mystery] 8.6 665387.0 53.0 2006.0 2013.0 True False NA [Michael C. Hall, Jennifer Carpenter, David Za... 4

Masters of the Universe: Revelation https://www.imdb.com/title/tt10826054/

The Walking Dead https://www.imdb.com/title/tt1520211/

Rick and Morty https://www.imdb.com/title/tt2861424/

In [231]:
movies_clean.head(10)
Out[231]:
MOVIES GENRE RATING VOTES RunTime Start_Year End_Year Is_Series Is_Ongoing Director Stars_List Star_Count
0 Blood Red Sky [Action, Horror, Thriller] 6.1 21062.0 121.0 2021.0 2021.0 False False Peter Thorwarth [Peri Baumeister, Carl Anton Koch, Alexander S... 4
1 Masters of the Universe: Revelation [Animation, Action, Adventure] 5.0 17870.0 25.0 2021.0 2021.0 True True NA [Chris Wood, Sarah Michelle Gellar, Lena Heade... 4
2 The Walking Dead [Drama, Horror, Thriller] 8.2 885805.0 44.0 2010.0 2022.0 True False NA [Andrew Lincoln, Norman Reedus, Melissa McBrid... 4
3 Rick and Morty [Animation, Adventure, Comedy] 9.2 414849.0 23.0 2013.0 2013.0 True True NA [Justin Roiland, Chris Parnell, Spencer Gramme... 4
5 Outer Banks [Action, Crime, Drama] 7.6 25858.0 50.0 2020.0 2020.0 True True NA [Chase Stokes, Madelyn Cline, Madison Bailey, ... 4
6 The Last Letter from Your Lover [Drama, Romance] 6.8 5283.0 110.0 2021.0 2021.0 False False Augustine Frizzell [Shailene Woodley, Joe Alwyn, Wendy Nottingham... 4
7 Dexter [Crime, Drama, Mystery] 8.6 665387.0 53.0 2006.0 2013.0 True False NA [Michael C. Hall, Jennifer Carpenter, David Za... 4
8 Never Have I Ever [Comedy] 7.9 34530.0 30.0 2020.0 2020.0 True True NA [Maitreyi Ramakrishnan, Poorna Jagannathan, Da... 4
9 Virgin River [Drama, Romance] 7.4 27279.0 44.0 2019.0 2019.0 True True NA [Alexandra Breckenridge, Martin Henderson, Col... 4
10 Gunpowder Milkshake [Action, Adventure, Thriller] 6.0 17989.0 114.0 2021.0 2021.0 False False Navot Papushado [Karen Gillan, Lena Headey, Carla Gugino, Mich... 4
In [232]:
decade_avg = (
    movies_clean
        .loc[movies_clean["Start_Year"] != 0]
        .assign(Decade=(movies_clean["Start_Year"] // 10) * 10)
        .groupby("Decade", as_index=False)["RATING"]
        .mean()
        .sort_values("Decade")
)

plt.figure(figsize=(10, 6))

sns.lineplot(data=decade_avg, x="Decade", y="RATING", marker="o")

plt.title("Average Rating by Decade")
plt.xlabel("Decade")
plt.ylabel("Average Rating")
plt.ylim(5.5, 7.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

Movies from decade of 2000's seem to perform the best among titles, with ratings rising over time

In [233]:
year_count = movies_clean["Start_Year"].loc[movies_clean["Start_Year"] != 0].value_counts().sort_index()

plt.figure(figsize=(10,6))
year_count.plot()
plt.title("Movies over Years")
plt.show()
No description has been provided for this image
In [234]:
year_count = movies_clean["Start_Year"].loc[movies_clean["Start_Year"] >1995].value_counts().sort_index()

plt.figure(figsize=(10,6))
year_count.plot()
plt.title("Number of Movies Featured")
plt.show()
No description has been provided for this image

Insights¶

Movies vs Series¶

In [235]:
df = movies_clean.copy()
df["Content_Type"] = df["Is_Series"].map({
    False: "Movies",
    True: "Series"
})
performance = df.groupby("Content_Type").agg({
    "RATING":"mean",
    "VOTES":"mean"
}).reset_index()

plt.figure(figsize=(8,5))
sns.barplot(data=performance, x="Content_Type", y="VOTES")
plt.title("Average Votes: Movies vs Series")
plt.xlabel("Content Type")
plt.ylabel("Average Votes")
plt.show()

performance
No description has been provided for this image
Out[235]:
Content_Type RATING VOTES
0 Movies 6.489209 18840.200046
1 Series 7.415999 10883.643121

Observation 1: Series Are Rated Significantly Higher

TV series consistently receive higher audience ratings than movies, suggesting that long-form storytelling may foster deeper emotional engagement and stronger viewer satisfaction.

Why might this happen?

  • More time for character development

  • Multi-episode narrative arcs

  • Stronger fan communities, viewer loyalty over seasons

  • Doing well with younger audiences

Observation 2: Movies Generate More Votes on Average

Movies receive ~73% more votes on average.

Despite lower ratings, movies attract significantly higher average vote counts, indicating broader reach but potentially less audience engagement.

This suggests:

  • Movies have wider reach.
  • They are easier to consume (single sitting).

Genres¶

By Popularity¶

In [258]:
def top_10_views():
  df_exploded = df.explode("GENRE")

  genre_votes_sum = (
      df_exploded
      .groupby("GENRE")
      .agg({
          "VOTES": "sum",
          "RATING": "mean",
          "MOVIES": "count"
      })
      .sort_values("VOTES", ascending=False)
  )

  top_genres_votes = genre_votes_sum.head(10)

  plt.figure(figsize=(9,6))
  sns.barplot(
      x=top_genres_votes["VOTES"],
      y=top_genres_votes.index,
      palette="Set2"
  )

  plt.title("Top 10 Genres by Total Audience Engagement")
  plt.xlabel("Total Votes")
  plt.ylabel("Genre")
  plt.show()

  global x
  x = top_genres_votes

top_10_views()
x
/tmp/ipython-input-2598965299.py:18: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(
No description has been provided for this image
Out[258]:
VOTES RATING MOVIES
GENRE
Drama 70167787.0 7.091197 3499
Action 43766718.0 7.097396 1843
Adventure 35370287.0 7.295955 1335
Crime 33788787.0 7.080233 1376
Comedy 31948275.0 6.825589 2419
Thriller 19654895.0 6.333719 777
Mystery 13553679.0 7.089417 737
Animation 12697131.0 7.377619 1403
Sci-Fi 12224640.0 6.582593 270
Fantasy 11527266.0 6.996264 455
In [259]:
def bottom_10_views():
  genre_votes_sum = (
      df_exploded
      .groupby("GENRE")
      .agg({
          "VOTES": "sum",
          "RATING": "mean",
          "MOVIES": "count"
      })
      .sort_values("VOTES", ascending=True)
  )

  top_genres_votes = genre_votes_sum.head(11).iloc[1:]

  plt.figure(figsize=(9,6))
  sns.barplot(
      x=top_genres_votes["VOTES"],
      y=top_genres_votes.index,
      palette="Set2"
  )

  plt.title("Bottom 10 Genres by Total Audience Engagement")
  plt.xlabel("Total Votes")
  plt.ylabel("Genre")
  plt.show()

  global x
  x = top_genres_votes

bottom_10_views()
x
/tmp/ipython-input-4038548646.py:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(
No description has been provided for this image
Out[259]:
VOTES RATING MOVIES
GENRE
News 29672.0 7.061111 18
Talk-Show 36987.0 6.960870 23
Game-Show 77194.0 6.369792 96
Film-Noir 154485.0 7.016667 12
Western 258652.0 6.720000 20
Reality-TV 432936.0 6.626437 348
Short 449292.0 6.747753 178
Sport 897016.0 6.852500 160
War 1009873.0 6.986667 45
Musical 1109089.0 7.001923 52

By Average Rating¶

In [260]:
def top_10_rating():
  df_exploded = df.explode("GENRE")

  genre_perf_rating = (
      df_exploded
      .groupby("GENRE")
      .agg({
          "RATING": "mean",
          "VOTES": "mean"
      })
      .sort_values("RATING", ascending=False)
  )

  top_genres_rating = genre_perf_rating.head(10)

  plt.figure(figsize=(8,6))
  sns.barplot(
      x=top_genres_rating["RATING"],
      y=top_genres_rating.index,
      palette="Set2"
  )

  plt.title("Top 10 Genres by Average Rating")
  plt.xlabel("Average Rating")
  plt.ylabel("Genre")
  plt.xlim(7, 7.5)
  plt.show()

  global x
  x = top_genres_rating

top_10_rating()
x
/tmp/ipython-input-3877282698.py:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(
No description has been provided for this image
Out[260]:
RATING VOTES
GENRE
Animation 7.377619 9049.986458
History 7.326172 9132.699219
Adventure 7.295955 26494.597004
Documentary 7.175704 2576.133803
Action 7.097396 23747.540966
Drama 7.091197 20053.668763
Mystery 7.089417 18390.337856
Crime 7.080233 24555.804506
News 7.061111 1648.444444
Biography 7.054104 23733.220149
In [267]:
def bottom_10_rating():
  genre_perf_rating = (
      df_exploded
      .groupby("GENRE")
      .agg({
          "RATING": "mean",
          "VOTES": "mean"
      })
      .sort_values("RATING", ascending=True)
  )

  top_genres_rating = genre_perf_rating.head(11)

  plt.figure(figsize=(8,6))
  sns.barplot(
      x=top_genres_rating["RATING"],
      y=top_genres_rating.index,
      palette="Set2"
  )

  plt.title("Bottom 10 Genres by Average Rating")
  plt.xlabel("Average Rating")
  plt.ylabel("Genre")
  plt.xlim(5, 7)
  plt.show()

  global x
  x = top_genres_rating

bottom_10_rating()
x
No description has been provided for this image
Out[267]:
RATING VOTES
GENRE
Horror 5.860998 22722.299320
Thriller 6.333719 25295.875161
Game-Show 6.369792 804.104167
Unknown 6.563636 86.909091
Sci-Fi 6.582593 45276.444444
Reality-TV 6.626437 1244.068966
Western 6.720000 12932.600000
Short 6.747753 2524.112360
Family 6.774933 11522.326146
Romance 6.800786 14382.218873
Comedy 6.825589 13207.224060

Do Longer Series Perform Better?¶

In [262]:
def series_length():
  length_perf = completed_series.groupby("Length_Category").agg({
      "RATING": "mean",
      "VOTES": "sum"
  }).reset_index()

  sns.set_style("whitegrid")
  sns.set_palette("Set2")

  fig, axes = plt.subplots(1, 2, figsize=(12,5))

  # Rating Plot
  ax1 = sns.barplot(
      data=length_perf,
      x="Length_Category",
      y="RATING",
      ax=axes[0]
  )

  axes[0].set_title("Average Rating by Series Length",
                    fontsize=13)
  axes[0].set_xlabel("Series Length")
  axes[0].set_ylabel("Average Rating")
  axes[0].set_ylim(7.4, 8.2)

  for i, v in enumerate(length_perf["RATING"]):
      axes[0].text(i, v + 0.02, f"{v:.2f}",
                  ha='center', fontsize=11)

  # Votes Plot
  ax2 = sns.barplot(
      data=length_perf,
      x="Length_Category",
      y="VOTES",
      ax=axes[1]
  )

  axes[1].set_yscale("log")
  axes[1].set_title("Total Audience Engagement by Series Length (Log Scale)",
                    fontsize=13)
  axes[1].set_xlabel("Series Length")
  axes[1].set_ylabel("Total Votes (log scale)")

  for i, v in enumerate(length_perf["VOTES"]):
      axes[1].text(i, v,
                  f"{v/1e6:.1f}M",
                  ha='center',
                  va='bottom',
                  fontsize=11)

  plt.suptitle("Impact of Series Longevity on Performance",
              fontsize=16)

  plt.tight_layout()
  plt.show()

series_length()
/tmp/ipython-input-3968155922.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  length_perf = completed_series.groupby("Length_Category").agg({
No description has been provided for this image

Conclusions¶

  • Series tend to be rated higher by audience

  • Movies seem to reach wider audience

  • Top 5 Genres by Popularity

    • Drama
    • Action
    • Adventure
    • Crime
    • Comedy
  • Top 5 Genres by Rating

    • Animation
    • History
    • Adventure
    • Documentary
    • Action
  • Long series (6+ Years) gained most audience and higher ratings

In [ ]: