This section outlines all the data manipulation and transformation done on the dataset.
Before any manipulation and transformation were made, measures were taken to ensure data integrity. These measures included:
- Creation of Backup Table
A backup table "session" with the exact columns from the imported dataset was created. This served as a safeguard against unintended changes to the original records.
Data Inconsistency
The "distinct" function was utilised to inspect categorical variables and update them. E.g. device field had values "m" and "d."
Duplicate Record Removal
192 duplicate records were identified and removed based on multiple criteria, including session_id, created_at, device, and status. Remain 2118 unique records.
Handled Null & Empty Values
Fields with null values were identified and addressed. Appropriate measures were taken, including filling null with average instead of dropping them, considering it is a small dataset.
Outlier Removal
Negative, zero and extreme values in the "load_time" and "view_time" variables were examined and removed to ensure data integrity.
Time Unit Conversion
The "load_time" and "view_time" variables were converted from milliseconds to seconds for consistency and easy interpretation.
Data Type Modication
Data type were inspected for categorical and numerical variables, and appropriate changes made to enhance storage efficiency and query performance.
Please reflect over the data dictionary below:
Fields | Before | After |
---|---|---|
session_id | INT | INT NL AI PRIMARY KEY |
created_at | VARCHAR(255) | DATE |
device | VARCHAR(255) | VARCHAR(10) |
load_time | VARCHAR(255) | DECIMAL(10,1) |
view_time | VARCHAR(255) | DECIMAL(10,1) |
status | VARCHAR(255) | TINYINT UNSIGNED |