*Data Cleaning Tutorial



This tutorial will take you through cleaning your data. It will use an example where students took a survey of their classmates over several weeks of what they did in the evening. Place a copy of this "What Did You Do Last Night" survey data file on your Google Drive.

Filtering Your Data

Your data may be too large to look at all at once. One way to address this problem is to filter the data so that you are only looking at some of the rows. Here's how you can do it in Google Sheets.
Do This: Follow the example in the animation below. Filter your own data on one of its columns:

More Complex Filtering

Filters may include more than one value, just check all the values you'd like to include. You can even use a conditional statement as a filter. For example, you can make a filter that will only show values if the value is less than some number you specify. It's also possible to filter on multiple columns at the same time. More complex filters help refine how you look at your data.
Do This: Add filters to two different columns. In one column choose at least two values. In the other column use a Conditional filter, as shown below.

Sorting Data

Sorting will reorder your rows of data by one of the columns. This makes it easy to see the smallest or largest value in each column. You may also notice patterns in your data once it is in sorted order.
Do This: Sort your data by at least one of the columns, both A -> Z and Z -> A:

Cleaning Data

Even when it is collected carefully (and especially if it isn't) data usually has issues that make it difficult to use for analysis. Values may have been entered incorrectly, not be in a usable format, or your data might just be poorly organized. It is important to fix these issues prior to exploring the data for trends and patterns. The process of preparing data for analysis is called "cleaning" the data

Fixing Errors

As you looked through your data you probably already noticed errors. Sometimes it's easy to know how to fix an error in the data. At other times it's not as obvious. In general, if you are certain what the data is supposed to be, you can change it by hand. Otherwise it is better to just delete it.
Do This: Use sorting and searching to find errors in your data. Are some values too large or too small to make sense? Did some users input text when they should have used a number? Correct the errors that you can and delete the values that make no sense.

Categorizing Data

In order to use computers to analyze data we usually need it to be standardized in some way. Data collected as "free form text" will be particularly susceptible to this problem. If you ask people "What did you do last night" you will likely get a different response from every single person. Making charts or tables with this information would be meaningless and confusing. Free form text data like this may be useful for a human to read but cannot easily be used by a computer. In order to fix this you might need to create new columns of data by hand which categorize free form text into data that is more useful for computation. In the example below you'll see one possible way to categorize responses to the question, "What did you do to relax last night?" In this case the data is being categorized by what the person was doing. The resulting column of data will be much easier to use when analyzing this data with a computer.

Note: This process is being done by hand using only one of many possible methods for creating categories. There are always many different ways to categorize data and there will rarely be one clearly "right" way. You'll have to make judgment calls about what makes the most sense.
Do This: Create at least one new column in your dataset that standardizes a column of data collected as "free form text". To do this you will need to invent a set of categories that applies to your data. This may take a while to do. Consult with your partners and make sure you have a generally agreed-upon set of rules as you start.





* Wording, concepts, and materials for parts of this tutorial were taken from educational material provided by code.org under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License .