By Nate Cheshire
Student Scraper is a web scraping tool that uses Python and Selenium to scrape student details from MSU's student directory, and Folium/Leaflet-js for visualizations. I plan to use DeckGL in the future for better 3D visualizations.
For this stack, make sure you can execute python scripts and have Docker Desktop installed (WLS recommended). The scraper uses Docker for the Postgres instance and Python for general purpose scripting. Additionally, I have provided a PostgresSetup.bat
which will do all the work of setting up Postgres and creating the appropriate database and table.
After the Postgres intance is up and running in a Docker container, and you have ensured the database and table were created successfully, place your username and password for MSU inside of a Keys/state.key
file in the format: netid,password
. This will be used with Selenium to send a DUO push to obtain cookies which will allow the sending of mass POST requests. Make sure you accept the DUO push on your phone if it is right after the console output says that it is sending a DUO push.
Lastly, following completion of the setup, you may invoke Poster.py
which will begin the POST sequence and insertions into the Dockerized Postgres instance.
To obtain the lat,lon
pairs which correspond to student addresses, obtain a MapQuest API key and use that in combination with MapQuest.py
to convert the stored addresses into lat,lon pairs. MapQuest free tier only allows 15K queries so you'll most likely need a second account for this step (use ProtonMail, Gmail, or some other quick and easy email service).
Utilizing all of the lat,lon pairs outputed via the MapQuest.py
script, I used Follium (Python version of Leaflet.js) to generate a heatmap of all students who had public addresses that attended MSU during the Fall 2021 semester. The interactive visualization for this can be viewed here.
As can be seen in MapGenerator.py
, a method exists called generateStreetViewImage()
. Using this method, which simply takes a netid, I can produce a figure showing the student a picture of their house as if I was standing outside. In testing this works for around 70% for all students; a number I find acceptable. The lack of absolute correctness comes from Google street addressing not always being accurate as well as some people having PO boxes or a CMRA. I plan to make the backend I've collected accessible via a Cyder account.
Input: lm2112
Output: