CIS4400-HW

The course CIS 4400: Data Warehouse for Analytics I took at Baruch College with Prof. Bien-Aime

Homework #1

Step 1: Data Sourcing Your first step with the project is to get familiar with the data. You need to understand how it is structured and most importantly find the data dictionary associated with it. If it is not there, then you will have to build a data dictionary. The latter should contain the name of the field, the description, the datatype, and any constraints associated with the field.

You will need to source those data using one or more of the following methods: · Web Scrapping · Web API · Connection to Database · Connection to Data Store (Cloud Storage)

Deliverables - 3pts

7. Your script should be stored in a git repository that is accessible to all members of your team and the professor.
6. Git Repository Created
https://github.com/timnaimov/CIS4400-HW1/tree/main
5. Scripts that gather these data -
4. Github/AzureDevops/Jira account created - Github
3. Link that shows the data dictionary (excel, google sheets)
https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields
2. Explanation of the data (where does it come from) - FFIEC Home Mortgage Disclosure Act - The dynamic files contain the national HMDA datasets for all HMDA reporters, modified by the Bureau to protect applicant and borrower privacy, updated to include late submissions and resubmissions. The dynamic files are available to download in a pipe delimited text file format. The dynamic datasets are updated on Mondays with HMDA submissions received through the previous Sunday night.
1. Link of all data sources:
Dynamic National Loan-Level Dataset https://ffiec.cfpb.gov/data-publication/dynamic-national-loan-level-dataset/2022

Step 2: Storage Your next step is to choose the appropriate data store for your data. Remember in the previous step, you had to source the data using a script or a specific tool. The data stores of choice are the following Database, Storage S3, MongoDB. Make sure the data are properly stored and not scattered. If need be, you will need also to mark the date the data was stored as well. It is recommended you watch the async videos.

Deliverables - 3pts

Scripts updated from the first deliverables. You will need to update those scripts to store to data into their specific storage
Git repository updated
Data Stored in an orderly fashion in the storage
Storage of choice - Google Cloud

Step 3: Modeling Once you have done the storage, you will need to start the modeling of the data warehouse. Remember the Data Warehouse contains already two main aspects. A fact table and a dimension table. The fact table must have a surrogate key as well as each dimension table. Modeling can be done using any tools.

Deliverables - 4pts

Scripts that create the Data Warehouse
Data Warehouse accessible to everyone in the team and can’t be accessed through a client (DataGrip, DbSchema, SqlDBM)
Scripts from previous steps updated.
Git Repository Updated
Data Model documented showing the fact table and the dimension table. - CIS 4400 HW1 Fact and Dimentions.xlsx

Homework #2

You are required to choose the previous data from homework 1. If you want to change your data, you are free to do so. However, you will have to start the whole homework 1 and it will not be graded. You are free to use any cloud provider. You are required to check the feedback from the professor.

Step 1: Transformation Once you have stored the data, the next step would be to transform the data. Data should be transformed according to specific business rules. While transforming the data, you should consider the following.

This is only a limited version of what you can do. There is more to that. Remember also to update your data dictionary. You have the following options: 1. Use a transformation tool ETL tools to do the transformation. 2. Create scripts that do this transformation. 3. Git repository Updated.

Deliverables - 5pts

Git Repository Updated
ETL fully created to push the data to DataWarehouse
Data transformation project created on the cloud in case you are using Option One
Data Mapping Created. Data dictionary updated.
Scripts from previous steps updated.

Step 2: Modeling

Once you have done the transformation, you will need to update the modeling of the data warehouse. Remember the Data Warehouse contains already two main aspects. A fact table and a dimension table. The fact table must have a surrogate key as well as each dimension table. Modeling can be done using any tools. Your data warehouse should be in Redshift.

Deliverables - 1pts

Git Repository Updated
Data Warehouse accessible to everyone in the team and can’t be accessed through a client (DBeaver, DataGrip)
Data Inserted into the Data Warehouse
Data Warehouse Created in AWS with Redshift
Scripts from previous steps updated.
Scripts that create the Data Warehouse
Data Model documented showing the fact table and the dimension table.

Step 3: Serving Data You will be using an online visualization tool to show the data that you have to transform. You should apply all the visualization practices you have seen in all sessions. The following must be part of the Visual:

A Heat Map
A Line Chart
A Column Chart
A Pie Chart
A Filtering tool by date or by dimension: When you filter by date, all charts should change based on the filter As part of the service DATA as well, you will need to create an API that will generate a CSV file that contains a summary of the data. This is optional.

Deliverables - 5pts

Tableau Links:
Loan Amount by Race and Purpose- Highlight Table: https://public.tableau.com/views/LoanAmountbyRaceandPurpose-HighlightTable/LoanAmountbyRaceandPurpose?:language=en-US&:sid=&:display_count=n&:origin=viz_share_link
Total Loan Amount by Sex and Age - Stacked Bar Graph: https://public.tableau.com/views/TotalLoanAmountbySexandAge-StackedBarGraph/LoanAmountbySexandAge?:language=en-US&:sid=&:display_count=n&:origin=viz_share_link
Loan Term by State - Tree Map: https://public.tableau.com/views/LoanTermbyState-TreeMap/StatesbyLoanTerm?:language=en-US&:sid=&:display_count=n&:origin=viz_share_link
An API using Python script that will generate data in a CSV format which can be used by the Data Analyst.
A link using AWS Quick Sight that connects to the Data Warehouse and shows the data.
Git Repository Updated

Notes: All deliverables must be in the GitHub (this is your final report) with all the tasks you have executed including a readme that explains the project.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
analyses		analyses
data-model		data-model
dbt-data-warehouse		dbt-data-warehouse
docs		docs
scripts		scripts
tableau-insights		tableau-insights
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CIS4400-HW

About

Releases

Packages

Languages

timnaimov/CIS-4400-HW

Folders and files

Latest commit

History

Repository files navigation

CIS4400-HW

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages