Skip to content

MusfirahAther/movie-Recommendation-p2

Repository files navigation

Movie Recommendation System

This repository contains a Jupyter notebook that implements two classic recommendation approaches in one place:

  1. Content-based filtering using movie genres
  2. Collaborative filtering using matrix factorization on user ratings

The project is intentionally small and educational, but the workflow is real: load movie data, represent similarity numerically, generate recommendations, then learn hidden preference structure from user behavior.

Repository Structure

  • Movie_Recommendation_(content_and_collaborative_filtering).ipynb: main notebook with the complete pipeline
  • README.md: project documentation and workflow explanation

What This Project Does

The notebook answers two different recommendation questions.

1. Content-based filtering

This part asks:

If a user likes one movie, which other movies look most similar based on their genres?

It uses the genres column from the movie dataset, converts those genre strings into TF-IDF vectors, and compares movies with cosine similarity.

2. Collaborative filtering

This part asks:

Given how many users rated many movies, what hidden preference patterns can we learn from the rating matrix?

It builds a user-movie matrix, reduces it with TruncatedSVD, reconstructs the ratings, and measures reconstruction quality with RMSE.

Tech Stack

The notebook imports:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error

Why these libraries are used

  • numpy: matrix math and reconstruction
  • pandas: loading CSV files and reshaping the ratings table
  • TfidfVectorizer: converting genre text into numeric feature vectors
  • cosine_similarity: measuring movie-to-movie similarity
  • TruncatedSVD: learning latent factors from the user-movie matrix
  • mean_squared_error: computing RMSE for reconstruction quality

Dataset Inputs

The notebook expects two CSV files:

  • movies.csv
  • ratings.csv

In the notebook they are loaded with:

movies_data = pd.read_csv('/content/movies.csv')
ratings_data = pd.read_csv('/content/ratings.csv')

These paths are written for Google Colab. If you run the notebook locally, change them to local paths such as:

movies_data = pd.read_csv('movies.csv')
ratings_data = pd.read_csv('ratings.csv')

End-to-End Workflow

This is the notebook flow in execution order.

  1. Import the required Python libraries.
  2. Load movie metadata and ratings data from CSV files.
  3. Clean the genres column by replacing (no genres listed) with an empty string.
  4. Turn movie genres into TF-IDF vectors.
  5. Compute cosine similarity between every pair of movies.
  6. Build a title-based recommendation function for content filtering.
  7. Test that function using "Toy Story".
  8. Pivot the ratings table into a user-movie matrix.
  9. Apply truncated SVD with 20 latent components.
  10. Reconstruct the rating matrix from the learned latent factors.
  11. Compute RMSE between the original and reconstructed matrices.

Content-Based Filtering

Step 1: Clean the genre values

movies_data['genres'] = movies_data['genres'].replace('(no genres listed)', '')

This avoids passing placeholder text into the vectorizer when a movie has no genre metadata.

Step 2: Vectorize the genres with TF-IDF

tfidf = TfidfVectorizer(stop_words='english')
genre_matrix = tfidf.fit_transform(movies_data['genres'])

Each movie's genre string becomes a numeric vector. Movies that share genre terms will end up closer in vector space.

Example idea:

  • Adventure|Animation|Children|Comedy|Fantasy
  • Adventure|Fantasy|Comedy

These should produce a relatively high similarity score because they overlap strongly.

Step 3: Compute pairwise cosine similarity

similarity = cosine_similarity(genre_matrix)

This creates a square movie-to-movie similarity matrix where each value represents how similar two movies are according to their genre vectors.

Step 4: Generate recommendations from a movie title

from difflib import get_close_matches

def recommend_movies(movie_name, num_recommendations=10):
    list_of_titles = movies_data['title'].tolist()
    close_match = get_close_matches(movie_name, list_of_titles)[0]

    index = movies_data[movies_data.title == close_match].index[0]
    similarity_scores = list(enumerate(similarity[index]))
    sorted_movies = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    print(f"\nMovies recommended for you (based on '{close_match}'):\n")

    for i, movie in enumerate(sorted_movies[1:num_recommendations+1], start=1):
        print(i, ".", movies_data.iloc[movie[0]]['title'])

How the recommendation function works

  1. It receives a movie title from the user.
  2. It uses get_close_matches to handle approximate title input.
  3. It finds the matching movie row in the dataset.
  4. It reads that movie's similarity scores against all other movies.
  5. It sorts movies from most similar to least similar.
  6. It skips the first result because that is the same movie.
  7. It prints the top n similar titles.

Example call

recommend_movies("Toy Story")

Why this part works well

  • It is easy to understand and fast to compute.
  • It works even without user-specific history.
  • It is useful for "more like this" recommendations.

Current limitation

This implementation only uses genres. It does not yet consider plot, cast, director, tags, or release-era similarity.

Collaborative Filtering

Step 1: Build the user-movie rating matrix

user_movie_matrix = ratings_data.pivot(
    index='userId', columns='movieId', values='rating'
).fillna(0)

This matrix has:

  • rows = users
  • columns = movies
  • values = ratings

Missing ratings are filled with 0 so the matrix becomes fully numeric and can be decomposed.

Step 2: Create the SVD model

svd = TruncatedSVD(n_components=20, random_state=42)

The model compresses the rating matrix into 20 latent factors. These factors are not named directly, but they represent hidden structure in user preference behavior.

Step 3: Learn the latent user representation

latent_matrix = svd.fit_transform(user_movie_matrix)

This transforms the original high-dimensional user vectors into a smaller latent space.

Step 4: Reconstruct the approximate rating matrix

reconstructed_matrix = np.dot(latent_matrix, svd.components_)

This step rebuilds an approximation of the original matrix from the learned latent factors. In a more complete recommender system, these reconstructed values can be used as predicted ratings.

Step 5: Evaluate with RMSE

original = user_movie_matrix.values
rmse = np.sqrt(mean_squared_error(original, reconstructed_matrix))
rmse

This measures how closely the reconstructed matrix matches the original matrix.

How to Read the Results

Content-based output

The notebook prints a ranked list of recommended movie titles similar to the selected input title.

Collaborative output

The notebook currently produces an RMSE value, which is useful for understanding how much information the latent-factor model preserved during reconstruction.

Important Implementation Notes

The notebook contains duplicate cells

A few collaborative-filtering steps appear twice:

  • building user_movie_matrix
  • defining svd
  • generating latent_matrix
  • generating reconstructed_matrix

These duplicates do not change the final idea, but cleaning them would make the notebook easier to maintain.

RMSE is calculated on training data

The current RMSE is computed using the same matrix that was used to fit the SVD model. That means it is a reconstruction score, not a true evaluation on unseen data.

If we wanted a more realistic performance estimate, the next step would be:

  1. Split ratings into train and test sets
  2. Train only on the training portion
  3. Reconstruct predictions
  4. Evaluate only on held-out ratings

Collaborative filtering is not yet exposed as a recommendation function

Right now the notebook builds the collaborative model and evaluates it, but it does not yet define a helper such as:

  • recommend_for_user(user_id)
  • top_unseen_movies(user_id, n=10)

That would be the most natural next feature.

Developer Workflow Explanation

If I were maintaining this repo as the project developer, I would describe the workflow like this:

Phase 1: Prepare the data

Load the movies dataset and the ratings dataset, then shape each one for the recommendation method that needs it.

  • Content filtering needs clean movie metadata.
  • Collaborative filtering needs a dense user-item matrix.

Phase 2: Build a metadata-driven recommender

Use the genres field as a lightweight content signal, vectorize it, and compare movies by cosine similarity. This gives us a straightforward movie-to-movie recommender.

Phase 3: Build a behavior-driven recommender core

Transform raw ratings into a matrix and apply SVD so we can capture hidden user-preference patterns. This lays the groundwork for personalized recommendations.

Phase 4: Evaluate what the latent model learned

Reconstruct the matrix and compute RMSE to get a quick sense of how much rating structure was retained.

Phase 5: Improve from prototype to product

Once the concepts work, the next upgrades are clear:

  1. remove duplicate notebook cells
  2. make paths portable outside Colab
  3. add collaborative recommendation functions for a target user
  4. filter out already-rated movies
  5. evaluate on a held-out test set
  6. move the notebook logic into reusable Python functions or a small app

Strengths of This Project

  • It demonstrates two major recommendation strategies in one notebook.
  • The code is short enough for beginners to follow.
  • The content-based part already gives direct recommendation output.
  • The collaborative part introduces matrix factorization in a practical way.

Limitations

  • Content filtering only uses genres.
  • Collaborative filtering fills missing ratings with zero, which is simple but not ideal for all use cases.
  • No proper train/test evaluation yet.
  • No collaborative top-N recommendation function yet.
  • Notebook structure can be cleaned up for reuse.

How to Run

  1. Place movies.csv and ratings.csv where the notebook can read them.
  2. Open Movie_Recommendation_(content_and_collaborative_filtering).ipynb in Jupyter or Colab.
  3. Update the CSV paths if you are not using Colab.
  4. Run the notebook from top to bottom.

Future Improvements

Here are the most valuable next upgrades for this repo:

  1. Add a function that recommends top unseen movies for a specific userId using the reconstructed matrix.
  2. Exclude movies a user has already rated from collaborative recommendations.
  3. Replace the simple training RMSE with evaluation on held-out ratings.
  4. Add richer content features such as tags, overview text, cast, or director.
  5. Refactor the notebook into reusable .py modules for easier experimentation.

Summary

This repository is a compact movie recommendation project that shows both sides of recommender systems:

  • content-based recommendations from item metadata
  • collaborative filtering from user behavior

The notebook already demonstrates the core ideas clearly. The content-based section is usable as a simple similarity engine, and the collaborative section provides the mathematical foundation for personalized recommendations. With a few next-step improvements, this can grow from a learning notebook into a more complete recommendation project.

About

Built a movie recommendation system using content-based filtering with TF-IDF and cosine similarity, and collaborative filtering with SVD-based matrix factorization on user ratings.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors