Movie Recommendation System

This repository contains a Jupyter notebook that implements two classic recommendation approaches in one place:

Content-based filtering using movie genres
Collaborative filtering using matrix factorization on user ratings

The project is intentionally small and educational, but the workflow is real: load movie data, represent similarity numerically, generate recommendations, then learn hidden preference structure from user behavior.

Repository Structure

Movie_Recommendation_(content_and_collaborative_filtering).ipynb: main notebook with the complete pipeline
README.md: project documentation and workflow explanation

What This Project Does

The notebook answers two different recommendation questions.

1. Content-based filtering

This part asks:

If a user likes one movie, which other movies look most similar based on their genres?

It uses the genres column from the movie dataset, converts those genre strings into TF-IDF vectors, and compares movies with cosine similarity.

2. Collaborative filtering

This part asks:

Given how many users rated many movies, what hidden preference patterns can we learn from the rating matrix?

It builds a user-movie matrix, reduces it with TruncatedSVD, reconstructs the ratings, and measures reconstruction quality with RMSE.

Tech Stack

The notebook imports:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error

Why these libraries are used

numpy: matrix math and reconstruction
pandas: loading CSV files and reshaping the ratings table
TfidfVectorizer: converting genre text into numeric feature vectors
cosine_similarity: measuring movie-to-movie similarity
TruncatedSVD: learning latent factors from the user-movie matrix
mean_squared_error: computing RMSE for reconstruction quality

Dataset Inputs

The notebook expects two CSV files:

movies.csv
ratings.csv

In the notebook they are loaded with:

movies_data = pd.read_csv('/content/movies.csv')
ratings_data = pd.read_csv('/content/ratings.csv')

These paths are written for Google Colab. If you run the notebook locally, change them to local paths such as:

movies_data = pd.read_csv('movies.csv')
ratings_data = pd.read_csv('ratings.csv')

End-to-End Workflow

This is the notebook flow in execution order.

Import the required Python libraries.
Load movie metadata and ratings data from CSV files.
Clean the genres column by replacing (no genres listed) with an empty string.
Turn movie genres into TF-IDF vectors.
Compute cosine similarity between every pair of movies.
Build a title-based recommendation function for content filtering.
Test that function using "Toy Story".
Pivot the ratings table into a user-movie matrix.
Apply truncated SVD with 20 latent components.
Reconstruct the rating matrix from the learned latent factors.
Compute RMSE between the original and reconstructed matrices.

Content-Based Filtering

Step 1: Clean the genre values

movies_data['genres'] = movies_data['genres'].replace('(no genres listed)', '')

This avoids passing placeholder text into the vectorizer when a movie has no genre metadata.

Step 2: Vectorize the genres with TF-IDF

tfidf = TfidfVectorizer(stop_words='english')
genre_matrix = tfidf.fit_transform(movies_data['genres'])

Each movie's genre string becomes a numeric vector. Movies that share genre terms will end up closer in vector space.

Example idea:

Adventure|Animation|Children|Comedy|Fantasy
Adventure|Fantasy|Comedy

These should produce a relatively high similarity score because they overlap strongly.

Step 3: Compute pairwise cosine similarity

similarity = cosine_similarity(genre_matrix)

This creates a square movie-to-movie similarity matrix where each value represents how similar two movies are according to their genre vectors.

Step 4: Generate recommendations from a movie title

from difflib import get_close_matches

def recommend_movies(movie_name, num_recommendations=10):
    list_of_titles = movies_data['title'].tolist()
    close_match = get_close_matches(movie_name, list_of_titles)[0]

    index = movies_data[movies_data.title == close_match].index[0]
    similarity_scores = list(enumerate(similarity[index]))
    sorted_movies = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    print(f"\nMovies recommended for you (based on '{close_match}'):\n")

    for i, movie in enumerate(sorted_movies[1:num_recommendations+1], start=1):
        print(i, ".", movies_data.iloc[movie[0]]['title'])

How the recommendation function works

It receives a movie title from the user.
It uses get_close_matches to handle approximate title input.
It finds the matching movie row in the dataset.
It reads that movie's similarity scores against all other movies.
It sorts movies from most similar to least similar.
It skips the first result because that is the same movie.
It prints the top n similar titles.

Example call

recommend_movies("Toy Story")

Why this part works well

It is easy to understand and fast to compute.
It works even without user-specific history.
It is useful for "more like this" recommendations.

Current limitation

This implementation only uses genres. It does not yet consider plot, cast, director, tags, or release-era similarity.

Collaborative Filtering

Step 1: Build the user-movie rating matrix

user_movie_matrix = ratings_data.pivot(
    index='userId', columns='movieId', values='rating'
).fillna(0)

This matrix has:

rows = users
columns = movies
values = ratings

Missing ratings are filled with 0 so the matrix becomes fully numeric and can be decomposed.

Step 2: Create the SVD model

svd = TruncatedSVD(n_components=20, random_state=42)

The model compresses the rating matrix into 20 latent factors. These factors are not named directly, but they represent hidden structure in user preference behavior.

Step 3: Learn the latent user representation

latent_matrix = svd.fit_transform(user_movie_matrix)

This transforms the original high-dimensional user vectors into a smaller latent space.

Step 4: Reconstruct the approximate rating matrix

reconstructed_matrix = np.dot(latent_matrix, svd.components_)

This step rebuilds an approximation of the original matrix from the learned latent factors. In a more complete recommender system, these reconstructed values can be used as predicted ratings.

Step 5: Evaluate with RMSE

original = user_movie_matrix.values
rmse = np.sqrt(mean_squared_error(original, reconstructed_matrix))
rmse

This measures how closely the reconstructed matrix matches the original matrix.

How to Read the Results

Content-based output

The notebook prints a ranked list of recommended movie titles similar to the selected input title.

Collaborative output

The notebook currently produces an RMSE value, which is useful for understanding how much information the latent-factor model preserved during reconstruction.

Important Implementation Notes

The notebook contains duplicate cells

A few collaborative-filtering steps appear twice:

building user_movie_matrix
defining svd
generating latent_matrix
generating reconstructed_matrix

These duplicates do not change the final idea, but cleaning them would make the notebook easier to maintain.

RMSE is calculated on training data

The current RMSE is computed using the same matrix that was used to fit the SVD model. That means it is a reconstruction score, not a true evaluation on unseen data.

If we wanted a more realistic performance estimate, the next step would be:

Split ratings into train and test sets
Train only on the training portion
Reconstruct predictions
Evaluate only on held-out ratings

Collaborative filtering is not yet exposed as a recommendation function

Right now the notebook builds the collaborative model and evaluates it, but it does not yet define a helper such as:

recommend_for_user(user_id)
top_unseen_movies(user_id, n=10)

That would be the most natural next feature.

Developer Workflow Explanation

If I were maintaining this repo as the project developer, I would describe the workflow like this:

Phase 1: Prepare the data

Load the movies dataset and the ratings dataset, then shape each one for the recommendation method that needs it.

Content filtering needs clean movie metadata.
Collaborative filtering needs a dense user-item matrix.

Phase 2: Build a metadata-driven recommender

Use the genres field as a lightweight content signal, vectorize it, and compare movies by cosine similarity. This gives us a straightforward movie-to-movie recommender.

Phase 3: Build a behavior-driven recommender core

Transform raw ratings into a matrix and apply SVD so we can capture hidden user-preference patterns. This lays the groundwork for personalized recommendations.

Phase 4: Evaluate what the latent model learned

Reconstruct the matrix and compute RMSE to get a quick sense of how much rating structure was retained.

Phase 5: Improve from prototype to product

Once the concepts work, the next upgrades are clear:

remove duplicate notebook cells
make paths portable outside Colab
add collaborative recommendation functions for a target user
filter out already-rated movies
evaluate on a held-out test set
move the notebook logic into reusable Python functions or a small app

Strengths of This Project

It demonstrates two major recommendation strategies in one notebook.
The code is short enough for beginners to follow.
The content-based part already gives direct recommendation output.
The collaborative part introduces matrix factorization in a practical way.

Limitations

Content filtering only uses genres.
Collaborative filtering fills missing ratings with zero, which is simple but not ideal for all use cases.
No proper train/test evaluation yet.
No collaborative top-N recommendation function yet.
Notebook structure can be cleaned up for reuse.

How to Run

Place movies.csv and ratings.csv where the notebook can read them.
Open Movie_Recommendation_(content_and_collaborative_filtering).ipynb in Jupyter or Colab.
Update the CSV paths if you are not using Colab.
Run the notebook from top to bottom.

Future Improvements

Here are the most valuable next upgrades for this repo:

Add a function that recommends top unseen movies for a specific userId using the reconstructed matrix.
Exclude movies a user has already rated from collaborative recommendations.
Replace the simple training RMSE with evaluation on held-out ratings.
Add richer content features such as tags, overview text, cast, or director.
Refactor the notebook into reusable .py modules for easier experimentation.

Summary

This repository is a compact movie recommendation project that shows both sides of recommender systems:

content-based recommendations from item metadata
collaborative filtering from user behavior

The notebook already demonstrates the core ideas clearly. The content-based section is usable as a simple similarity engine, and the collaborative section provides the mathematical foundation for personalized recommendations. With a few next-step improvements, this can grow from a learning notebook into a more complete recommendation project.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
Movie_Recommendation_(content_and_collaborative_filtering).ipynb		Movie_Recommendation_(content_and_collaborative_filtering).ipynb
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Movie Recommendation System

Repository Structure

What This Project Does

1. Content-based filtering

2. Collaborative filtering

Tech Stack

Why these libraries are used

Dataset Inputs

End-to-End Workflow

Content-Based Filtering

Step 1: Clean the genre values

Step 2: Vectorize the genres with TF-IDF

Step 3: Compute pairwise cosine similarity

Step 4: Generate recommendations from a movie title

How the recommendation function works

Example call

Why this part works well

Current limitation

Collaborative Filtering

Step 1: Build the user-movie rating matrix

Step 2: Create the SVD model

Step 3: Learn the latent user representation

Step 4: Reconstruct the approximate rating matrix

Step 5: Evaluate with RMSE

How to Read the Results

Content-based output

Collaborative output

Important Implementation Notes

The notebook contains duplicate cells

RMSE is calculated on training data

Collaborative filtering is not yet exposed as a recommendation function

Developer Workflow Explanation

Phase 1: Prepare the data

Phase 2: Build a metadata-driven recommender

Phase 3: Build a behavior-driven recommender core

Phase 4: Evaluate what the latent model learned

Phase 5: Improve from prototype to product

Strengths of This Project

Limitations

How to Run

Future Improvements

Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages