This self-initiated project improves my submitted project for the CB4247 Statistics & Computational Inference to Big Data module at NTU. After my initial project was graded, I sought my lecturer's feedback and incorporated it into this version of this project. This repository contains my project's code and report and showcases my machine learning, data analysis, pre-processing, Python, and report-writing skills.
This self-initiated project aims to:
- Apply data analysis and visualization techniques to analyze a real-world dataset.
- Train machine learning algorithms on the chosen dataset, including Ordinary Least Squares Regression, Random Forest Regressor, and XGBoost Regressor.
- Conduct ANOVA and residual analysis to test the validity of Ordinary Least Squares (OLS) regression assumptions.
- Evaluate each algorithm's performance via metrics, including mean absolute error, root mean squared error, and R2.
- Identify critical variables via feature importance.
- Develop a robust and accurate model by combining high-performing algorithms.
This project used a dataset containing house prices in India for OLS regression analysis and training various machine learning models. The dataset can be found here in Kaggle. Data pre-processing and thorough preliminary analysis were conducted before regression and training, removing outliers to reduce noise in the data and feature engineering to preserve information in the dataset. Afterward, OLS regression was performed on the dataset and residual analysis followed to verify the assumptions made by OLS. Next, various machine learning models were trained on the dataset. The best-performing models were selected, and feature importance analysis was done on them to determine the improvement of the model. Finally, the top-performing models were combined to perform a final test against the dataset for its performance.
Some of the graphs below comprise the analysis I conducted in this project.
Figure 1 Distribution graphs of Rent
Figure 2 Heatmap of numerical variables' correlations
Figure 3 Residual analysis of the OLS regression model
Figure 4 Root mean squared errors of trained machine learning models
Figure 5 Feature importance graph of the CatBoost model
Protected under the MIT License. See LICENSE
for more information.
Thank you so much for visiting my repository! I sincerely hope my project can help you in providing insights to regression analysis, machine learning models, and report writing! 😄 If you would like me to explain my project further or contact me for any reason, you can email me below or connect with me on LinkedIn!