Skip to content

Latest commit

 

History

History

2.clean

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Data Cleaning

This section outlines all the data manipulation and transformation done on the dataset.

Pre-Modification Measures

Before any manipulation and transformation were made, measures were taken to ensure data integrity. These measures included:

  • Creation of Backup Table
    A backup table "session" with the exact columns from the imported dataset was created. This served as a safeguard against unintended changes to the original records.

Cleaning Steps Applied:

Data Inconsistency
The "distinct" function was utilised to inspect categorical variables and update them. E.g. device field had values "m" and "d."

Duplicate Record Removal
192 duplicate records were identified and removed based on multiple criteria, including session_id, created_at, device, and status. Remain 2118 unique records.

Handled Null & Empty Values
Fields with null values were identified and addressed. Appropriate measures were taken, including filling null with average instead of dropping them, considering it is a small dataset.

Outlier Removal
Negative, zero and extreme values in the "load_time" and "view_time" variables were examined and removed to ensure data integrity.

Time Unit Conversion
The "load_time" and "view_time" variables were converted from milliseconds to seconds for consistency and easy interpretation.

Data Type Modication
Data type were inspected for categorical and numerical variables, and appropriate changes made to enhance storage efficiency and query performance.

Please reflect over the data dictionary below:

Fields Before After
session_id INT INT NL AI PRIMARY KEY
created_at VARCHAR(255) DATE
device VARCHAR(255) VARCHAR(10)
load_time VARCHAR(255) DECIMAL(10,1)
view_time VARCHAR(255) DECIMAL(10,1)
status VARCHAR(255) TINYINT UNSIGNED