-
Notifications
You must be signed in to change notification settings - Fork 104
4.4.1.Manually cleaning data
A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable
A file containing a chronologically ordered list of modifications made to a project
- Consider the business problem
- Consider the goal
- Consider the data
The CASE statement goes through one or more conditions and returns a value as soon as a condition is met
Example:
SELECT
customer_id,
CASE
WHEN first_name = 'Tnoy' THEN 'Tony'
WHEN first_name = 'Tmo' THEN 'Tom'
WHEN first_name = 'Rachle' THEN 'Rachel'
ELSE first_name
END AS cleaned_name
FROM
customer_data.customer_name
This reading will give you a checklist of common problems you can refer to when doing your data cleaning verification, no matter what tool you are using. When it comes to data cleaning verification, there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects. Each project has its own organization and data requirements that lead to a unique list of things to run through for verification.
Keep in mind, as you receive more data or a better understanding of the project goal(s), you might want to revisit some or all of these steps.
Make sure you identified the most common problems and corrected them, including:
- Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
- Null data: Did you search for NULLs using conditional formatting and filters?
- Misspelled words: Did you locate all misspellings?
- Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
- Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
- Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
- Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
- Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
- Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
- Misleading variable labels (columns): Did you name your columns meaningfully?
- Truncated data: Did you check for truncated or missing data that needs correction?
- Business Logic: Did you check that the data makes sense given your knowledge of the business?
Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project and confirm that your data is still aligned with that goal. This is a continuous process that you will do throughout your project-- but here are three steps you can keep in mind while thinking about this:
- Confirm the business problem
- Confirm the goal of the project
- Verify that data can solve the problem and is aligned to the goal
TOTAL POINTS 4
Making sure data is properly verified is an important part of the data-cleaning process. Which of the following tasks are involved in this verification? Select all that apply.
- Considering whether the data is credible and appropriate for the project
Asking stakeholders to check and confirm the data is clean- Rechecking the data-cleaning effort
- Manually fixing any errors found in the data
Correct. The verification process confirms that data cleaning was well executed and the resulting data is accurate and reliable. To verify data, analysts recheck the data-cleaning effort, manually fix errors in the data, and consider whether the data is credible and appropriate for the project.
Fill in the blank: To count the total number of spreadsheet values within a specified range, a data analyst uses the _____ function.
WHOLETOTALSUM- COUNTA
Correct. To count the total number of spreadsheet values within a specified range, a data analyst uses the COUNTA function.
A data analyst is cleaning a dataset with inconsistent formats and repeated cases. They use the TRIM function to remove extra spaces from string variables. What other tools can they use for data cleaning? Select all that apply.
Protect sheet- Remove duplicates
Import data- Find and replace
Correct. The analyst can use TRIM*, remove duplicates, and find and replace for data cleaning.*
To correct a typo in a database column, where should you insert a CASE statement in a query?
- As a SELECT clause
As an ORDER BY clauseAs a FROM clauseAs a GROUP BY clause
Correct. You should add a CASE
statement as a SELECT
clause. A CASE
statement goes through one or more conditions and returns a value as soon as a condition is met. The typo would be a condition and the correction would be the returned value for the condition.
https://www.coursera.org/professional-certificates/google-data-analytics
© 2021 Coursera Inc. All rights reserved.