The goal of the project was to gain hands-on experience with multiple steps of the data lifecycle that benefit from big data infrastructure. The project is broken down into two tasks: Data Cleaning/Profiling and Semantic Profiling. All of the datasets used come from the NYC Open Data initiative (https://opendata.cityofnewyork.us/).
The code was run on NYU Hadoop Cluster using Python 3.6.5 and Spark 2.4.0