get the data
run the following two cells below to get the data for this exercise, then followup by reading the questions and writing your own code to answer them.
!pip install requests
import requests
url = "http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data"
request = requests.get(url)
request.raise_for_status()
with open('adult.csv', 'w') as f:
f.write(request.text)
### now the data is available in the file adult.csv.
### read the questions below
# import pandas as pd
# pd.read_csv('adult.csv')
income for adults from the 1994 census
This dataset was extracted done by Barry Becker from the 1994 Census database. source: http://mlr.cs.umass.edu/ml/datasets/Adult
Listing of attributes:
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- income: >50K, <=50K.
1. load the data
- extract the column names from the description and read the csv while supplying the columns names
- rename columns with a hyphen
-
to use underscores_
insead. example:capital-gain --> capital_gain
- look at the head()
- rename columns with a hyphen
- look at info, dtype, check for nan values
- what are the value counts of the categorical variables: workclass, education, marital_status, occupation, relationship, race, sex, native_country, income?
- do you notice the extra space ‘ ‘ at the beginning of each value?
- remove the extra space
- turn ‘sex’ and ‘income’ into 0/1 fields
- replace the categorical ‘sex’ column with a numerical ‘female’ column with value 1 for females and 0 for males
- replace the categorical ‘income’ column with a numerical ‘over50k’ column with value 1 for ‘>50k’ and 0 for ‘<50K’
- use
.describe()
function to get descriptive statistics for most columns- make sure that ‘sex’ and ‘over50k’ are now numerical fields
2. explore capital gains / losses
capital_gain
- plot the histogram for capital gains
- verbally describe what you see
- for people who have
capital_gain > 0
- plot the histogram for capital gains
- how many people have capital gains over 25000?
- use
value_counts()
to look at all the values of capital_gain over 25000. - what’s weird about the data?
- use
- could the people who had capital_gain==25124 be related?
- does capital_gain over 50k mean income is over 50k?
capital_loss
- plot the histogram of capital_loss
- for people who have
capital_loss > 0
- plot the histogram for capital_loss
- how many people had both
capital_gain>0
andcapital_loss>0
? - who can afford to lose money on capital investments?
- what percent of people overall had over 50K income?
- what percent of people with 0 capital_loss? with capital_loss>0?
combining and binning
- create a new
capital_change
column that equalscapital_gain - capital_loss
- use the
qcut
function to quantize/bin/cutcapital_change
into a new columns calledcapital_change_bin
with 10 bins of equal proportions.- do not bin
capital_change==0
values as there are too many of them - to simplify using this column later, use the left side of the interval created as the label
- label rows with
capital_change==0
as havingcapital_change_bin=0
- make sure you have no null values for
capital_change_bin
- do not bin
- how many people have a non-zero capital_change?
- lets call this ‘has_capital_change’
- plot ‘has_capital_change’ over ‘over50k’
- what do you learn from this diagram
- plot
capital_change
byover50k
- what do you learn from this diagram
- plot
over50k
bycapital_change_bin
- what can you learn from this diagram?
education
- what is the mean education_num by education?
- sort the education categories by the mean_values. does it make sense
- check out other descriptive statistics to see if anything falls out of place
- turn education into a categorical ordered type
- plot education VS education_num
- what have we learned?
- plot the distribution for
education
- plot over50k by education
- what can we learn?
- plot hours_per_week by education
- what can we learn from this plot?
- now use the hue=”over50k” of seaborn to see hours_per_week by education/over50k.
- learn anything else?
- plot education_num by occupation
- sort by mean education_num
- plot education_num by workclass
- sort by mean education_num
- create a crosstab or a pivot_table of education VS occupation.
- normalize it by the education rows (each row X shows the conditional probability of having occupation Y by education level X)
- create a heatmap that shows which occpupations are most likely for each education level
- verbally describe what you’ve learned
- create a crosstab or a pivot_table of education VS workclass.
- normalize it by the education rows (each row X shows the conditional probability of having workclass Y by education level X)
- create a heatmap that shows which workclass is most likely for each education level
- verbally describe what you’ve learned
- re-run this analysis without the private sector
- plot “race” vs “education_num
- plot “relationship” vs “education_num
occupation / workclass
- how many levels of occupation?
- how many levels of worklass?
- how many combinations? potential? actual?
- plot
over50k
byoccupation
- sort by mean
over50k
- compare this to
over50k
byeducation
. which variable more strongly predicts income? - compare this to
education_num
byoccupation
. are the highest paying jobs correlated with highest earning education?
- sort by mean
- plot
over50k
byworkclass
- look at combinations of occupation / workclass
- what are the top combinations in terms of earning over50k (mean)? how many people in that category?
- how many of these combinations have more than 100 people?
- show a heatmap of the mean over50k of occupation-vs-worklass for combinations with more than 100 people. center the heatmap at the populations mean over50k for increased effect. what conclusions can you draw?
- create a numerical encoding for occupation / workclass pairs
- create a new column called “occ_class” that combines the string of the occupation and workclass
- use the library category_encoders, here’s an intro how to do it
- use the weight of evidence encoder
ce.woe.WOEEncoder
here’s an article explaining it - add the encoded occ_class as a new column called
occ_class_woe
to your dataframe
correlations
- which features are most important, which correlate?
- compute the correction matrix of features with themselves
- draw a clustermap or heatmap of this correlation
- center the cluster at 0
- annotate the plot with the correlation values
- look at the strongest correlations and draw some conclusions.
TODO:
- look at
relationship
andmarriage_status
. how meaningful are they? should we encode them? - look at
native_country
. how does immigration effect other variables? should we build further categories based on continent or on 1st/2nd/3rd world countries? should we add anis_immigrant
boolean column? - we’ve done the analysis treating each row of the data as a person, when in fact each row represents a large group of people, with the variable
fnlwgt
counting how many people are in the group. redo some of the analysis with weighted averages - look further at age. should we cut this continous variable into age groups like 18-25, 25-40 etc ?
- combine age/relationship to see if relationship effects can be explained away by age
education_num
seems to be a label encoding ofeducation
. I think some degrees are not properly captured with that encoding, likeassoc-voc
. it would be interesting to to recode it with woe againstover50k
and see if anything changes.- data quality questions:
- why are women under-represented in this data
- why are there no hispanic/latin category in race?
- compare to other interesting analysis:
- Predicting Earning Potential using the Adult Dataset https://rpubs.com/H_Zhu/235617
- related notebook on kaggle https://www.kaggle.com/uciml/adult-census-income