PySpark - Topic Modeling | Pranav Agrawal

Title: PySpark - Topic Modeling
Description: Incorporating Topic Modeling using LDA to enhances tweet sentiment classification accuracy.
Source code link: https://gitfront.io/r/ompranavagrawal/y6MLKJhzSJRn/PySpark-Topic-Modeling/

Objective

The goal is to assess the impact of Topic Modeling on the accuracy of tweet sentiment classification. It involves comparing the performance of sentiment classification using TFIDF vectors with and against LDA-derived features, employing logistic regression and cross-validation in pySpark, and visualizing results with pyLDAvis.

Data Description

The dataset comprises approximately 1.727 million tweets from the "US Election 2020 Tweets" dataset on Kaggle. It serves to analyze Twitter sentiment's correlation with US Presidential Election outcomes, exploring social media's influence on political processes.

Methodology

Data Preprocessing

Loading Data: Tweets are read from a CSV file into a DataFrame.
Filtering and Cleaning: Tweets are filtered to include only those with English alphabet characters, followed by sentiment analysis to classify tweets into positive or non-positive sentiments.
Tokenization and Stop Words Removal: Tweets are tokenized, and stop words are removed.

LDA Implementation

Feature Transformation: Uses CountVectorizer to convert tokens into feature vectors.
Model Setup and Training: An LDA model is initialized to discover underlying topics within tweets, with the Spark ML Pipeline facilitating model fitting and transformation.

Visualization and Classification

Visualization: Utilizes pyLDAvis for an interactive visualization of the topics discovered by the LDA model.
Classification with Logistic Regression: Employs logistic regression models, first with TFIDF vectors and then enhanced by LDA topic distribution features, to classify tweet sentiments. The models are evaluated based on accuracy, F1 score, and area under the ROC curve.

pyLDAvis Visualization

Model Comparison

Results and Conclusion

The integration of LDA topic probabilities with TFIDF vectors improved sentiment classification accuracy, demonstrating that combining thematic context with traditional text vectorization benefits sentiment analysis tasks by providing a more comprehensive data understanding.