PySpark - Machine Learning on Big Data
PySpark ML with TF-IDF, CrossValidator, ParamGrid, LogisticRegression, LinearSVC, RandomForestClassifier, MultilayerPerceptronClassifier.
Title: PySpark - Machine Learning on Big Data
Description: PySpark ML with TF-IDF, CrossValidator, ParamGrid, LogisticRegression, LinearSVC, RandomForestClassifier, MultilayerPerceptronClassifier.
Source code link: https://gitfront.io/r/ompranavagrawal/Nv5h64hkbadi/PySpark-Machine-Learning-on-big-data/
Introduction
This project aims to identify fake job postings within a dataset using PySpark, a powerful tool for handling big data. By employing different machine learning classifiers, we aim to develop a robust model that can accurately distinguish between genuine and fraudulent job listings.
Data Description
The dataset, "fake_job_postings.csv", consists of job listings with various attributes. Each listing is labeled as fraudulent or not, making this a binary classification problem.
Methodology
Data Preprocessing
- Data Loading: The dataset is loaded from an HDFS path into a Spark DataFrame.
- Missing Value Analysis: Columns with a significant percentage of missing values are identified and dropped.
- Text Cleaning: Regular expressions are used to clean text columns, removing non-alphanumeric characters and converting text to lowercase.
Feature Engineering
- Text Vectorization: The description column is tokenized and vectorized using TF-IDF to transform textual data into a numerical format suitable for machine learning models.
- Data Balancing: The dataset is balanced by undersampling the majority class to improve model performance on minority classes.
Model Training and Evaluation
Four classifiers are evaluated:
- Logistic Regression
- Linear Support Vector Classifier (LinearSVC)
- Random Forest Classifier
- Multilayer Perceptron Classifier
- A Pipeline is constructed, including preprocessing and the classifier.
- Hyperparameter tuning is performed using CrossValidator with a ParamGridBuilder.
- Model performance is assessed using accuracy and F1 score.
Results
Model performance is documented, detailing the accuracy, F1 score, and best parameters for each classifier.

Logistic Regression

Multi Layer Perceptron

Random Forest Classifier

Linear Support Vector Classifier
Conclusion
The project demonstrates the application of various machine learning models to detect fake job postings in a big data environment using PySpark. The effectiveness of each model varies, with detailed results providing insights into the best-performing models based on accuracy and F1 score.