PySpark - Recommender System

Title: PySpark - Recommender System
Description: ALS-based movie recommendation system with item-item collaborative filtering.. 
Source code link: https://gitfront.io/r/ompranavagrawal/iyoZbxC7nE6z/PySpark-Recommender-System/

Introduction

This Markdown document encapsulates the implementation and findings of a Movie Recommender System developed. The system leverages Apache Spark's capabilities to predict movie ratings using the MovieLens 25M dataset. A hybrid approach, integrating ALS (Alternating Least Squares), item-item collaborative filtering, and supervised learning models, is employed to enhance recommendation accuracy and relevance.

Objective

The project aims to design a recommender system that predicts movie ratings by constructing a utility matrix and applying Alternating Least Squares (ALS). It evaluates the system using metrics such as RMSE (Root Mean Square Error), MSE (Mean Squared Error), and MAP (Mean Average Precision), and explores a hybrid model combining ALS, item-item collaborative filtering, and a supervised learning component.

Data Description

Utilizes the MovieLens 25M dataset, containing 25 million ratings across 62,000 movies, contributed by 162,000 users. This rich dataset serves as a benchmark for assessing the system's performance.

Methodology Overview

Data Preprocessing: Normalizes ratings by subtracting mean ratings to adjust for variance across movies.
ALS Implementation: Configures the ALS algorithm with hyperparameters tuning through cross-validation to predict ratings.
Item-Item Collaborative Filtering: Enhances prediction by calculating cosine similarity between movie pairs, leveraging user ratings.
Supervised Learning Integration: Incorporates movie and user features using a RandomForestRegressor, aiming to improve prediction accuracy.
Hybrid Model Formation: Combines predictions from ALS, item-item CF, and supervised models, adjusting their weights to optimize performance.

Evaluation and Results

The system is thoroughly validated using RMSE, MSE, and MAP, showcasing the effectiveness of the hybrid approach. Weight distribution strategies are explored to fine-tune the model's performance, resulting in optimized recommendations.

Parameter Scores vs Weights

Model Comparison

Conclusion

The report concludes that the integrated hybrid recommender system, utilizing Spark's MLlib, successfully predicts movie ratings on the MovieLens dataset. The combination of ALS, item-item collaborative filtering, and supervised learning demonstrates a significant improvement in prediction accuracy and recommendation quality.
For the detailed implementation, including code snippets and specific model evaluations, refer to the original Python code.