Data Analysis - Processing Data


The objective of this assignment is to use spreadsheets to process the data that you collected in your survey. You will clean data and produce a summary of it. It will use Google Sheets, which is a spreadsheet application comparable to Microsoft's Excel.
Place all answers and screenshots in a Google Document. The steps below will tell you what to put in the assignment document. Put the question number from the assignment leading all entries in the assignment document.

*Part 1: Cleaning Data

Using computational tools to analyze data has made it much easier to find trends and patterns in large datasets. When preparing data for this kind of analysis, however, it’s important to remember that the computer is much less “intelligent” than we might imagine. Small discrepancies in the data may prevent accurate interpretation of trends and patterns or can even make it impossible to use the data in computation in the first place. Cleaning data is therefore an important step in analyzing it, and in many contexts, it may actually take the largest amount of time.

You will work with the data that you collected from your survey. In this part, you do the first step, which is cleaning the data using the common techniques of filtering and sorting data to familiarize yourself with its contents. You will then correct errors that you find in the data by either hand-correcting invalid values or deleting them. Finally, you will categorize any free-text columns that were collected to prepare for analysis. This lesson introduces many new skills with spreadsheets and reveals the sometimes subjective nature of data analysis.

Follow this tutorial to clean data from your survey.

  1. In your answer document place a screenshot of your original spreadsheet of responses from your survey showing the column headers and at least 5 rows of data.
  2. In your answer document place a screen shot showing how you filtered data in one of your columns as described in the Filter Your Data part of the tutorial. Add a sentence or two describing what you did and how it filtered the data.
  3. In your answer document place a screen shot showing how you did more complex filtering in to two different columns. In one column choose at least two values. In the other column use a Conditional filter, as described in the More Complex Filtering part of the tutorial. Add a sentence or two describing what you did and how it filtered the data.
  4. In your answer document place a screen shot showing sorted data in a column as described in the Data Sorting part of the tutorial. Add a sentence or two describing what you did and how it sorted the data.
  5. In your answer document place a screen shot showing corrected errors that you can and delete the values that make no sense as described in the Fixing Errors part of the tutorial. Create at least one new column in your dataset that standardizes a column of data collected as "free form text". To do this you will need to invent a set of categories that applies to your data. Add a few sentences describing the errors that you fixed and how you fixed them.
  6. In your answer document place a screen shot showing at least one new column in your dataset that standardizes a column of data collected as "free form text", as described in the Categorizing Data part of the tutorial. Add a sentence or two describing how you categorized the data.

*Part 2: Creating Summary Tables

A summary table is a table used to compute summary statistics of a larger dataset. Summary tables are another way that computational tools can be used to look more closely at data in order to identify trends and patterns. They can often be good data visualizations on their own, and they are quite useful when trying to make charts from larger collections of data.

Vocabulary:

Follow this tutorial to create summary tables from the movie data set in the tutorial.

  1. In your answers document add a screenshot showing the pivot table that you created in the Making Summary Tables section of the tutorial.
  2. In your answers document add a screenshot showing a table that displays the average rating for every movie listed in the data set as described in the Adding Rows section of the tutorial.
  3. In your answers document add a screenshot showing a table that displays the counting for the number of ratings for each movie in the data set as described in the Let's Change the Value -- Summarize by: COUNT section of the tutorial.
  4. In your answers document add a screenshot showing a table that displays the adding of another Values field as described in the Add Another Field To Values section of the tutorial.
  5. In your answers document add a screenshot showing a table that displays the adding of columns that group your data by gender as described in the Add Columns section of the tutorial.
  6. In your answers document add a screenshot showing a chart of movies where the differences between male and female ratings are significant as described in the Manipulating The Pivot Table section of the tutorial.

*Part 3: Tell A Data Story With Your Data

We collect data to learn about the world and test our hypotheses about the patterns and relationships that might otherwise be invisible. Computational tools now make it possible to quickly collect and analyze vast amounts of data, but those processes still depend on knowledgeable and thoughtful people who understand the data being used and can clearly present the conclusions draw from it. For this part you will be doing just that with the data that you have collected.

In your answers document respond to the written prompts below. Assume that you are describing your data collection and analysis to someone who is not in the class. You need to provide detail on how you performed the steps.

  1. Describe how you collected the data for your visualization including the techniques that you used to collect it, when it was collected, and over what period of time. Your description should be understandable by someone unfamiliar with this project. This should be 80-120 words.
  2. Describe your development process, explicitly identifying the computing tools and techniques you used to create your artifact. Your description must be detailed enough so that a person unfamiliar with those tools and techniques will understand your process. This should be 80-120 words and contain screen shots showing any filtering, cleaning, sorting, or categorizing that you did.
  3. Describe your findings: This should be 180-200 words.
  4. What point about the impact of your computing innovation does you data analysis make? Make the point and support it with the data and your visualization(s). This should be 50 to 150 words.

Part 4: Package Your Data To Share

The Internet has enabled the Big Data phenomenon by facilitating sharing of data among people. In this part, you will package your data to share.

  1. That data sets that you downloaded in these tutorials were Comma Separated Value (CSV) files . CSV is the most common way that people share data because not everyone has the same applications to work on data. For instance, some people may use Excel, some may use Google Sheets, still others may use a database program like MySQL or TinyDB to work with data. CSV files are only the data values in text along with the text commas separating them - there is no dependency on applications. Because of this, CSV files are how most data sets are shared; and most applications that work with data can import and export data in CSV format. Save your survey data in a CSV file (check Google Sheets Help if you need to see how to do this) named lastname_survey.csv, where lastname is your last name.
  2. Open the CSV in a text editor like Notepad on Windows and TextEdit on Macs. Place in your answers document a screenshot of your data displayed in CSV format as shown by the text editor.
  3. Describe in a sentence or two what you see when your CSV file is displayed in the text editor.
  4. Create a README file for your data, like the one you used in a previous assignment. Download the file and call it lastname_readme.txt
  5. Share your Google Sheet with your cleaned data, as directed by your teacher.

*Part 5: The Bad Side: Data Breaches

You have been thinking primarily about the beneficial effects of collecting and analyzing data. This look at data breaches is intended to help you explore the potential harmful effects of collecting data, specifically with regards to privacy and security.

Examine this visualization of data about data breaches. In your answer document, answer:

  1. Using the crtieria for good data visualizations from a previous lesson, how effective is this visualization? What is good about it? What is bad about it? Make your answer be about 50 words.
  2. From the information you get from the visualizatin: What kind of data is being lost? Use specfic examples from the information you got from the visualization. Make your answer be about 50 words.
  3. What kinds of issues could arise from this data getting into the wrong hands? Make your answer be 50-100 words.



* Wording, concepts, and materials for parts of this assignment marked with * were taken from educational material provided by code.org under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License. .