Skip to content

This repository contains a project I completed for an NTU course titled CB4247 Statistics & Computational Inference to Big Data. In this project, I applied regression and machine learning techniques to predict house prices in India.

License

Notifications You must be signed in to change notification settings

nlawira/india-house-rent-prediction

Repository files navigation

India's House Prices Prediction

Preface

This self-initiated project improves my submitted project for the CB4247 Statistics & Computational Inference to Big Data module at NTU. After my initial project was graded, I sought my lecturer's feedback and incorporated it into this version of this project. This repository contains my project's code and report and showcases my machine learning, data analysis, pre-processing, Python, and report-writing skills.

Project Overview

This self-initiated project aims to:

  • Apply data analysis and visualization techniques to analyze a real-world dataset.
  • Train machine learning algorithms on the chosen dataset, including Ordinary Least Squares Regression, Random Forest Regressor, and XGBoost Regressor.
  • Conduct ANOVA and residual analysis to test the validity of Ordinary Least Squares (OLS) regression assumptions.
  • Evaluate each algorithm's performance via metrics, including mean absolute error, root mean squared error, and R2.
  • Identify critical variables via feature importance.
  • Develop a robust and accurate model by combining high-performing algorithms.

This project used a dataset containing house prices in India for OLS regression analysis and training various machine learning models. The dataset can be found here in Kaggle. Data pre-processing and thorough preliminary analysis were conducted before regression and training, removing outliers to reduce noise in the data and feature engineering to preserve information in the dataset. Afterward, OLS regression was performed on the dataset and residual analysis followed to verify the assumptions made by OLS. Next, various machine learning models were trained on the dataset. The best-performing models were selected, and feature importance analysis was done on them to determine the improvement of the model. Finally, the top-performing models were combined to perform a final test against the dataset for its performance.

Some of the graphs below comprise the analysis I conducted in this project. image Figure 1 Distribution graphs of Rent

image Figure 2 Heatmap of numerical variables' correlations

image Figure 3 Residual analysis of the OLS regression model

image Figure 4 Root mean squared errors of trained machine learning models

image Figure 5 Feature importance graph of the CatBoost model

Conclusion

License

Protected under the MIT License. See LICENSE for more information.

Contact me

Thank you so much for visiting my repository! I sincerely hope my project can help you in providing insights to regression analysis, machine learning models, and report writing! 😄 If you would like me to explain my project further or contact me for any reason, you can email me below or connect with me on LinkedIn!

LinkedIn Email

About

This repository contains a project I completed for an NTU course titled CB4247 Statistics & Computational Inference to Big Data. In this project, I applied regression and machine learning techniques to predict house prices in India.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published