Data Cleaning: First Step into Data Analysis

Data Analysis refers to cleaning, transforming, and modeling data in order to extract information for making educated business decisions. The process involves the application of statistical and logical techniques to evaluate data. A critical point in data analysis is the kind of data we analyze. Clean data can deliver results as expected. On the contrary, businesses can end up taking false decisions based on conclusions drawn from the analysis of bad data.
What is data cleaning?

Data cleaning is the process of correcting misleading records from a dataset (database, tables). It refers to identify unreliable or irrelevant parts of data and removing them. Data cleaning can be performed in batch scripting or interactively. Data Analysts/ Data Scientists end up spending about half of their time cleaning the data, and we can infer that data cleaning is not just essential but a time-consuming process. The rise of big data makes data cleaning all the more important.

Common Data Cleaning techniques in Excel:

Get rid of extra spaces: In most of the cases, data is recorded manually. Any human involvement might lead to common errors such as extra spaces between words. For eg. “Data Cleaning”, “Data Cleaning”, “Data Cleaning” are all same 2 words but the first instance is most appropriate, while second and third are separated with unnecessary spaces. We can use: =TRIM(Text). What this function does is it would remove all the leading spaces and trailing spaces and extra spaces between words except one single space that is allowed.
Spell Check: If any data set is recorded manually, another associated human error could be spelling errors. To check and correct these, select the data and press F7 and when you do that it runs the spellcheck for you and it is the same thing that you see in Microsoft Word or PowerPoint. It will show you the text that it thinks is a spelling error and it will show you the suggestions as well so you can change these and once it is done it will show you that spellcheck is complete and you are good to go.
Text to Lower/ Upper/ Proper Case: SYNTAX: LOWER() – Converts all text into Lower Case. ex. mary jane . UPPER() – Converts all text into Upper Case. ex. MARY JANE. PROPER() – Converts all Text into Proper Case. ex. Mary Jane

More examples Removing Duplicates, Highlighting errors, Removing Formatting, etc. and can always be found online. Just Google it!

Conclusion:

Data Science is inseperable part of data analysis. We can divide the entire process into 4 part: collecting data, cleaning data, analyzing data, and presenting the data. The presentation will highly be affected in terms of quality if we skip the cleaning part in our analysis.

 

-Nishant Shristiraj (California State University, Los Angels)

Comments