PySpark - Machine Learning with Gradient Boost and Random Forest Classifier

Title: PySpark - Machine Learning with Gradient Boost and Random Forest Classifier
Description: PySpark ML with TF-IDF, CrossValidator, ParamGrid, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier, MultilayerPerceptronClassifier. 
Source code link: https://gitfront.io/r/pranav/AGyTKh2Lq4kd/PySpark-Machine-Learning-with-Gradient-Boost-and-Random-Forest-Classifier/
Video Link: https://youtu.be/G-MCYTs8FpQ

Introduction & Methods

This report investigates the impact of ensemble methods, specifically bagging and boosting, on machine learning models, using a flight data delay dataset. It compares these methods against a neural network model, emphasizing data preparation, model training, and validation across various metrics.

Data Description

The report utilizes the Air Flight Dataset, detailing flight records, cancellations, and delays per airline from January 2018, derived from the TranStats "On-Time" database.

Architecture

The study employs PySpark libraries, focusing on a Decision Tree, Multi-Layer Perceptron neural network, Gradient Boosting, and Random Forest classifiers. It outlines the code architecture and pseudo-code for data cleaning, balancing, preprocessing, and model training.

Data Preprocessing

Describes the steps taken for redundant column removal, binary recoding of delay and cancellation indicators, normalization of textual data, and balancing the dataset through undersampling.

Results

Presents detailed results of model training and evaluation for DecisionTreeClassifier, MultiLayerPerceptronClassifier, RandomForestClassifier, and GBTClassifier, including hyperparameter tuning and the effects on accuracy, F1 scores, and ROC-AUC.

Decision Tree Classifier Evaluation

Multi Layer Perceptron Evaluation

Random Forest Classifier Evaluation

Gradient Boosted Tree Evaluation

Performance Comparison of Different Classifier

Conclusion

The study concludes that ensemble methods, particularly bagging and boosting, outperform the neural network model in predicting flight delays, with significant improvements in accuracy and stability. It highlights the importance of thorough data preparation and model optimization.