This repository contains a Jupyter notebook that implements two classic recommendation approaches in one place:
- Content-based filtering using movie genres
- Collaborative filtering using matrix factorization on user ratings
The project is intentionally small and educational, but the workflow is real: load movie data, represent similarity numerically, generate recommendations, then learn hidden preference structure from user behavior.
Movie_Recommendation_(content_and_collaborative_filtering).ipynb: main notebook with the complete pipelineREADME.md: project documentation and workflow explanation
The notebook answers two different recommendation questions.
This part asks:
If a user likes one movie, which other movies look most similar based on their genres?
It uses the genres column from the movie dataset, converts those genre strings into TF-IDF vectors, and compares movies with cosine similarity.
This part asks:
Given how many users rated many movies, what hidden preference patterns can we learn from the rating matrix?
It builds a user-movie matrix, reduces it with TruncatedSVD, reconstructs the ratings, and measures reconstruction quality with RMSE.
The notebook imports:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_errornumpy: matrix math and reconstructionpandas: loading CSV files and reshaping the ratings tableTfidfVectorizer: converting genre text into numeric feature vectorscosine_similarity: measuring movie-to-movie similarityTruncatedSVD: learning latent factors from the user-movie matrixmean_squared_error: computing RMSE for reconstruction quality
The notebook expects two CSV files:
movies.csvratings.csv
In the notebook they are loaded with:
movies_data = pd.read_csv('/content/movies.csv')
ratings_data = pd.read_csv('/content/ratings.csv')These paths are written for Google Colab. If you run the notebook locally, change them to local paths such as:
movies_data = pd.read_csv('movies.csv')
ratings_data = pd.read_csv('ratings.csv')This is the notebook flow in execution order.
- Import the required Python libraries.
- Load movie metadata and ratings data from CSV files.
- Clean the
genrescolumn by replacing(no genres listed)with an empty string. - Turn movie genres into TF-IDF vectors.
- Compute cosine similarity between every pair of movies.
- Build a title-based recommendation function for content filtering.
- Test that function using
"Toy Story". - Pivot the ratings table into a user-movie matrix.
- Apply truncated SVD with 20 latent components.
- Reconstruct the rating matrix from the learned latent factors.
- Compute RMSE between the original and reconstructed matrices.
movies_data['genres'] = movies_data['genres'].replace('(no genres listed)', '')This avoids passing placeholder text into the vectorizer when a movie has no genre metadata.
tfidf = TfidfVectorizer(stop_words='english')
genre_matrix = tfidf.fit_transform(movies_data['genres'])Each movie's genre string becomes a numeric vector. Movies that share genre terms will end up closer in vector space.
Example idea:
Adventure|Animation|Children|Comedy|FantasyAdventure|Fantasy|Comedy
These should produce a relatively high similarity score because they overlap strongly.
similarity = cosine_similarity(genre_matrix)This creates a square movie-to-movie similarity matrix where each value represents how similar two movies are according to their genre vectors.
from difflib import get_close_matches
def recommend_movies(movie_name, num_recommendations=10):
list_of_titles = movies_data['title'].tolist()
close_match = get_close_matches(movie_name, list_of_titles)[0]
index = movies_data[movies_data.title == close_match].index[0]
similarity_scores = list(enumerate(similarity[index]))
sorted_movies = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
print(f"\nMovies recommended for you (based on '{close_match}'):\n")
for i, movie in enumerate(sorted_movies[1:num_recommendations+1], start=1):
print(i, ".", movies_data.iloc[movie[0]]['title'])- It receives a movie title from the user.
- It uses
get_close_matchesto handle approximate title input. - It finds the matching movie row in the dataset.
- It reads that movie's similarity scores against all other movies.
- It sorts movies from most similar to least similar.
- It skips the first result because that is the same movie.
- It prints the top
nsimilar titles.
recommend_movies("Toy Story")- It is easy to understand and fast to compute.
- It works even without user-specific history.
- It is useful for "more like this" recommendations.
This implementation only uses genres. It does not yet consider plot, cast, director, tags, or release-era similarity.
user_movie_matrix = ratings_data.pivot(
index='userId', columns='movieId', values='rating'
).fillna(0)This matrix has:
- rows = users
- columns = movies
- values = ratings
Missing ratings are filled with 0 so the matrix becomes fully numeric and can be decomposed.
svd = TruncatedSVD(n_components=20, random_state=42)The model compresses the rating matrix into 20 latent factors. These factors are not named directly, but they represent hidden structure in user preference behavior.
latent_matrix = svd.fit_transform(user_movie_matrix)This transforms the original high-dimensional user vectors into a smaller latent space.
reconstructed_matrix = np.dot(latent_matrix, svd.components_)This step rebuilds an approximation of the original matrix from the learned latent factors. In a more complete recommender system, these reconstructed values can be used as predicted ratings.
original = user_movie_matrix.values
rmse = np.sqrt(mean_squared_error(original, reconstructed_matrix))
rmseThis measures how closely the reconstructed matrix matches the original matrix.
The notebook prints a ranked list of recommended movie titles similar to the selected input title.
The notebook currently produces an RMSE value, which is useful for understanding how much information the latent-factor model preserved during reconstruction.
A few collaborative-filtering steps appear twice:
- building
user_movie_matrix - defining
svd - generating
latent_matrix - generating
reconstructed_matrix
These duplicates do not change the final idea, but cleaning them would make the notebook easier to maintain.
The current RMSE is computed using the same matrix that was used to fit the SVD model. That means it is a reconstruction score, not a true evaluation on unseen data.
If we wanted a more realistic performance estimate, the next step would be:
- Split ratings into train and test sets
- Train only on the training portion
- Reconstruct predictions
- Evaluate only on held-out ratings
Right now the notebook builds the collaborative model and evaluates it, but it does not yet define a helper such as:
recommend_for_user(user_id)top_unseen_movies(user_id, n=10)
That would be the most natural next feature.
If I were maintaining this repo as the project developer, I would describe the workflow like this:
Load the movies dataset and the ratings dataset, then shape each one for the recommendation method that needs it.
- Content filtering needs clean movie metadata.
- Collaborative filtering needs a dense user-item matrix.
Use the genres field as a lightweight content signal, vectorize it, and compare movies by cosine similarity. This gives us a straightforward movie-to-movie recommender.
Transform raw ratings into a matrix and apply SVD so we can capture hidden user-preference patterns. This lays the groundwork for personalized recommendations.
Reconstruct the matrix and compute RMSE to get a quick sense of how much rating structure was retained.
Once the concepts work, the next upgrades are clear:
- remove duplicate notebook cells
- make paths portable outside Colab
- add collaborative recommendation functions for a target user
- filter out already-rated movies
- evaluate on a held-out test set
- move the notebook logic into reusable Python functions or a small app
- It demonstrates two major recommendation strategies in one notebook.
- The code is short enough for beginners to follow.
- The content-based part already gives direct recommendation output.
- The collaborative part introduces matrix factorization in a practical way.
- Content filtering only uses genres.
- Collaborative filtering fills missing ratings with zero, which is simple but not ideal for all use cases.
- No proper train/test evaluation yet.
- No collaborative top-N recommendation function yet.
- Notebook structure can be cleaned up for reuse.
- Place
movies.csvandratings.csvwhere the notebook can read them. - Open
Movie_Recommendation_(content_and_collaborative_filtering).ipynbin Jupyter or Colab. - Update the CSV paths if you are not using Colab.
- Run the notebook from top to bottom.
Here are the most valuable next upgrades for this repo:
- Add a function that recommends top unseen movies for a specific
userIdusing the reconstructed matrix. - Exclude movies a user has already rated from collaborative recommendations.
- Replace the simple training RMSE with evaluation on held-out ratings.
- Add richer content features such as tags, overview text, cast, or director.
- Refactor the notebook into reusable
.pymodules for easier experimentation.
This repository is a compact movie recommendation project that shows both sides of recommender systems:
- content-based recommendations from item metadata
- collaborative filtering from user behavior
The notebook already demonstrates the core ideas clearly. The content-based section is usable as a simple similarity engine, and the collaborative section provides the mathematical foundation for personalized recommendations. With a few next-step improvements, this can grow from a learning notebook into a more complete recommendation project.