Skip to content

Uber Data Analysis Project, an End-to-End Data Engineering Project from creating data pipelines to finally creating the dashboard.

License

Notifications You must be signed in to change notification settings

UmairThakur/Uber-Data-Analysis-ETL-PIPELINE-DATA-ANALYSIS_PROJECT

Repository files navigation

Uber-Data-Analysis-ETL-PIPELINE-DATA-ANALYSIS_PROJECT

uber_logo_linkedin_post

Hello Everyone, I am excited to share my first complete end-to-end Data Engineering Project, Uber Data Analysis Project, from creating data pipelines to finally creating the dashboard.

Step 1: Creating a Process Flow

process

Step 2: Data Modeling and creating an ER Diagram to get a better understanding of the data

Uber Data Model

Step 3: Writing the Transformation code in Python

00.-.python.preprocessing.mp4

Step 4: Create a project and a bucket on the Google Cloud Platform, upload the data, select the server and set the appropriate permissions.

gcp_start

Note: Project ID and Project Number are hidden.

Step 5: Create a Virtual Machine Instance in GCP using GCP Compute Engine.

comput_engine_logo google_compute_instance_SSH_code_b4_we_start converting_tables_to_dictionary_in_mage

Step 6: Connect the VM to Mage Project using SSH Linux Terminal and create a mage project (also download the necessary dependencies).

mage_ai

Step 7: Create a data pipeline using Mage Blocks like data loader, transformer, and exporters. Add your transformation code to the data transformer with the necessary changes.

Step 8: Once, the pipeline is ready, add GCP credentials credentials to the configuration 'io_config.yaml' file. You can easily get the credentials from the APIs and Services tab from Google Console.

Step 9: Using BigQuery to query the data, perform ETL operations so that data can be used for Data Analysis like creating dashboards, reporting, etc.

big_query.-.Made.with.Clipchamp.mp4

Step 10: Finally, create a dashboard using any dashboarding/reporting software, I used Looker Studio but we can also use other tools like Power BI, Tableau, Qlik, etc.

bttom_snap cab_map bttom_snap

SQL Query Solution to the questions asked:

-- top 10 pickup locations based on number of trips

SELECT pickup_location_id, COUNT(trip_id) as No_of_Trips FROM uber_dataset.fact_table GROUP BY pickup_location_id ORDER BY No_of_Trips DESC LIMIT 10;

-- total number of trips by passenger count

SELECT passenger_count, COUNT(passenger_count) AS No_of_Trips FROM uber-big-data-analysis.uber_dataset.passenger_count_dim GROUP BY passenger_count;

-- Average fare amount by hour of the day

SELECT d.pick_hour, AVG(f.fare_amount) AS Avg_Fare_Amt FROM uber-big-data-analysis.uber_dataset.datetime_dim d JOIN uber-big-data-analysis.uber_dataset.fact_table f ON d.datetime_id=f.datetime_id GROUP BY d.pick_hour ORDER BY AVG(f.fare_amount) DESC;

Releases

No releases published

Packages

No packages published