Skip to content

The course CIS 4400: Data Warehouse for Analytics I took at Baruch College with Prof. Jefferson Bien-Aime

Notifications You must be signed in to change notification settings

timnaimov/CIS-4400-HW

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CIS4400-HW

The course CIS 4400: Data Warehouse for Analytics I took at Baruch College with Prof. Bien-Aime

Homework #1

Step 1: Data Sourcing Your first step with the project is to get familiar with the data. You need to understand how it is structured and most importantly find the data dictionary associated with it. If it is not there, then you will have to build a data dictionary. The latter should contain the name of the field, the description, the datatype, and any constraints associated with the field.

You will need to source those data using one or more of the following methods: ·       Web Scrapping ·       Web API ·       Connection to Database ·       Connection to Data Store (Cloud Storage)

Deliverables - 3pts

Step 2: Storage Your next step is to choose the appropriate data store for your data. Remember in the previous step, you had to source the data using a script or a specific tool. The data stores of choice are the following Database, Storage S3, MongoDB. Make sure the data are properly stored and not scattered. If need be, you will need also to mark the date the data was stored as well. It is recommended you watch the async videos.

Deliverables - 3pts

  • Scripts updated from the first deliverables. You will need to update those scripts to store to data into their specific storage
  • Git repository updated
  • Data Stored in an orderly fashion in the storage
  • Storage of choice - Google Cloud

Step 3: Modeling Once you have done the storage, you will need to start the modeling of the data warehouse. Remember the Data Warehouse contains already two main aspects. A fact table and a dimension table. The fact table must have a surrogate key as well as each dimension table. Modeling can be done using any tools.

Deliverables - 4pts

  • Scripts that create the Data Warehouse
  • Data Warehouse accessible to everyone in the team and can’t be accessed through a client (DataGrip, DbSchema, SqlDBM)
  • Scripts from previous steps updated.
  • Git Repository Updated
  • Data Model documented showing the fact table and the dimension table. - CIS 4400 HW1 Fact and Dimentions.xlsx

Homework #2

You are required to choose the previous data from homework 1. If you want to change your data, you are free to do so. However, you will have to start the whole homework 1 and it will not be graded. You are free to use any cloud provider. You are required to check the feedback from the professor.

Step 1: Transformation   Once you have stored the data, the next step would be to transform the data. Data should be transformed according to specific business rules. While transforming the data, you should consider the following.

  • It should contain the name of the fields, their data type, their description, the source column and the destination column.
  • Create a Data Mapping that will be served and incorporated into your data dictionary tools.
  • Summing two or more columns
  • Adding one or many columns
  • Use the correct data type each new fact generated.
  • Verify Data against data reference (currency, state, zipcode, county, NAICS, GICS, etc)
  • Removing Duplicates rows if necessary
  • Removing NULL values if necessary
  • Splitting date into multiple units (Year, Quarter, Month, Day, Hour, etc…)
  • Unified date format YYYY-MM-DD

This is only a limited version of what you can do. There is more to that.  Remember also to update your data dictionary.   You have the following options: 1. Use a transformation tool ETL tools to do the transformation. 2. Create scripts that do this transformation. 3. Git repository Updated.

Deliverables - 5pts

  • Git Repository Updated
  • ETL fully created to push the data to DataWarehouse
  • Data transformation project created on the cloud in case you are using Option One
  • Data Mapping Created. Data dictionary updated.
  • Scripts from previous steps updated.

Step 2: Modeling

Once you have done the transformation, you will need to update the modeling of the data warehouse. Remember the Data Warehouse contains already two main aspects. A fact table and a dimension table. The fact table must have a surrogate key as well as each dimension table. Modeling can be done using any tools. Your data warehouse should be in Redshift.

Deliverables - 1pts

  • Git Repository Updated
  • Data Warehouse accessible to everyone in the team and can’t be accessed through a client (DBeaver, DataGrip)
  • Data Inserted into the Data Warehouse
  • Data Warehouse Created in AWS with Redshift
  • Scripts from previous steps updated.
  • Scripts that create the Data Warehouse
  • Data Model documented showing the fact table and the dimension table.

Step 3: Serving Data   You will be using an online visualization tool to show the data that you have to transform. You should apply all the visualization practices you have seen in all sessions.  The following must be part of the Visual:

  • A Heat Map
  • A Line Chart
  • A Column Chart
  • A Pie Chart
  • A Filtering tool by date or by dimension: When you filter by date, all charts should change based on the filter As part of the service DATA as well, you will need to create an API that will generate a CSV file that contains a summary of the data. This is optional.

Deliverables - 5pts

Notes: All deliverables must be in the GitHub (this is your final report) with all the tasks you have executed including a readme that explains the project.

About

The course CIS 4400: Data Warehouse for Analytics I took at Baruch College with Prof. Jefferson Bien-Aime

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published