An exploratory data analysis will allow you to analyze a dataset and summarize important characteristics of the data that you are interested in. Being able to
perform an exploratory data analysis on and clean a dataset is a critical skill for data analysts and data scientists. You must be able to become familiar with
a data set prior to making inferences about a population from which the dataset is selected.
For the Final Project, you will be developing an analytical report using a survey data. You will be submitting milestones this week(Nov14). You will submit a
completed report at the end of this month(Nov 28).
We willing using 1994 wave of NLYSY1979, a survey data of young adults in USA. The data set and the code books are below.
Data 1994 + Codebook NLSY1979.zip Download Data 1994 + Codebook NLSY1979.zip Download Download Codebook NLSY1979.zip Download Codebook
NLSY1979.zip(download codebook and data set)
Note that a negative number usually indicates missing observations. The list of majors are listed below, although I would recommend that you aggregate up
to a fewer major categories if you want to use those variables.
https://www.nlsinfo.org/content/cohorts/nlsy79/other-documentation/codebook-supplement/nlsy79-attachment-4-fields-study#business (Links to an
Using the data set provided, complete an exploratory data analysis. Extract or derive the following variables:
Year of birth, country of birth, race, and sex
3 to 5 more variables of interest
Using the extracted variables:
generate additional variables such as age and indicators for categorical variables such as white, black, gender, region, undergraduate major, employment
status etc. Black indicator variable, for example, equals 1 if black and 0 otherwise. You will need many indicator variables.
obtain descriiptive statistic of the data (count, proportion, means, SD, etc.) for the entire sample and by group (by gender and by race, or employed vs.
unemployed, for example)
create visualizations to represent the data (histograms, bar charts, line charts, etc.) for the entire sample and by group
As you explore your data and see it visualized, you may find that your dataset has extreme outliers, incomplete data or just wrong data. If this is the case, you
will need to clean your data to get a better understanding of the data. Once you have cleaned the data, perform the exploratory analysis again.
In your report be sure to:
Describe your dataset.
What is the purpose of the dataset? What is your data source?
Sample size and the distribution of important variables (the distribution of income across race/ethnic categories, for sample).
Provide tables for key variables and groups of of interest. This should be done for categorical data, discrete data and continuous data.
Provide visualizations of key variables and groups of interest.
Provide written analysis above and beyond the graphs and tables. Explain what the tables and visualizations tell you about the data.
Present your Exploratory Data Analysis in a report. Incorporate visualizations and tables into the textual analysis of your report. If appropriate, add an
appendix of additional data tables and graphs.
The report should follow this flow:
Introduction: introduce the data, its purpose, the sources, the reason for choosing the data and what you hope to learn from the data. Incorporate a
discussion of the data cleaning methods used.
Data Analysis: This is the body of the report where you provide descriiptions of the data, basic statistical measures, graphs, tables and analysis.
2 to 5 tables with written interpretation
2 to 5 charts with written interpretation
Summary: Summarize the report. Identify the key take-aways from your analysis. Describe what you want to explore further about your data. Identify
questions you want to answer with the data.
Your report should be 5 to 10 pages in length, including graphs but excluding the appendix or references.
What to Submit
You must submit 2 files:
The Exploratory Data Analysis report
The R code used to analyze the data