Skip to content

Using GCS, PySpark, and Streamlit to build a model that predicts the size of celestial bodies and make it publicly available

License

Notifications You must be signed in to change notification settings

VinceDiR/celestial_body_size_predictor

Repository files navigation

Celestial Body Size Predictor

By: Nate DiRenzo

Statement of Need:

Potentially Hazard Objects (PHO) are near-Earth objects with an orbit that can bring them within close proximity to the planet, and large enough to cause significant damage in the event of an impact.

Asteroids larger than 35 meters in daimater can pose a threat to a city or town. However, the diameter of most small celestial objects is not well determined, as they are usually estimated using brightness and distinace, as opposed to direct radar measurements.

Because the true size of most celestial objects is not well determined, we will strive to produce a model that can accurately estimate the diameter of objects in space, given a set of easily observable features.



Goal:

The goal of this project is to productionize a model that predicts the diameter of celestial objects with some degree of accuracy. To do so, we will store a database of 800,000 measurements of celestial objects in Google Cloud Storage, create a model with Python using PySpark, and a front-end web application with Streamlit. As a further goal, I would like to containerize the script and web application using Docker.

Success Metrics:

The metric for success is whether or not the model functions in production, and to a lesser extent the efficacy of the model at predicting size of celestial objects.

Data Description:

The data is taken from the Jet Propulsion Laboratory at the California Institute of Technology. The full Small-Body Database contains 1.2million entries with measurements of objects in our solar system. roughly 150,000 entries contain diameter data.

Tools:

  • Google Cloud Storage for Data Warehousing
  • Google Colab for Cloud-based Scripting
  • PySpark for Modelling
  • Streamlit for Frontend Application
  • Docker for Containerization

Models:

  • Gradient Boosted Tree Regression with pySpark

MVP Goal:

Basic implimentation of the model running on local machine, and deployed to streamlit. From there, I can work on expanding the data and moving the entire pipeline into the cloud.

About

Using GCS, PySpark, and Streamlit to build a model that predicts the size of celestial bodies and make it publicly available

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published