get the data

run the following two cells below to get the data for this exercise, then followup by reading the questions and writing your own code to answer them.

!pip install requests

Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (2.21.0)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests) (2019.9.11)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests) (1.24.3)

### important: this code downloads the data for you into "adult.data" csv file in the current directory
import requests

url = "http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data"
request = requests.get(url)
request.raise_for_status()
with open('adult.data', 'w') as f:
    f.write(request.text)

### now the data is available in the csv file adult.data. 
### read the questions below
# import pandas as pd
# pd.read_csv('adult.data')    

income for adults from the 1994 census

This dataset was extracted done by Barry Becker from the 1994 Census database. source: http://mlr.cs.umass.edu/ml/datasets/Adult

Listing of attributes:

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
income: >50K, <=50K.

import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
sns.set()

1. load the data

extract the column names from the description and read the csv while supplying the columns names
- rename columns with a hyphen - to use underscores _ insead. example: capital-gain --> capital_gain

description = """\
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
"""

columns = [line.split(':')[0] for line in description.splitlines()] + ['income']
columns = [column.replace('-', '_') for column in columns]
columns

['age',
 'workclass',
 'fnlwgt',
 'education',
 'education_num',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital_gain',
 'capital_loss',
 'hours_per_week',
 'native_country',
 'income']

df = pd.read_csv('adult.data', names=columns)
df.head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

2. look at the head, info, dtype, check for nan values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

> answer:

no null values. yippee!

3. what are the value counts of the categorical variables: workclass, education, marital_status, occupation, relationship, race, sex, native_country, income?

do you notice the extra space ‘ ‘ at the beginning of each value?
remove the extra space

for column in "workclass education marital_status occupation relationship race sex native_country income".split():
    print(column, ':')
    print(df[column].value_counts())
    print()

workclass :
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

education :
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

marital_status :
 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: marital_status, dtype: int64

occupation :
 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: occupation, dtype: int64

relationship :
 Husband           13193
 Not-in-family      8305
 Own-child          5068
 Unmarried          3446
 Wife               1568
 Other-relative      981
Name: relationship, dtype: int64

race :
 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: race, dtype: int64

sex :
 Male      21790
 Female    10771
Name: sex, dtype: int64

native_country :
 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 Greece                           29
 France                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Trinadad&Tobago                  19
 Cambodia                         19
 Thailand                         18
 Laos                             18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Hungary                          13
 Honduras                         13
 Scotland                         12
 Holand-Netherlands                1
Name: native_country, dtype: int64

income :
 <=50K    24720
 >50K      7841
Name: income, dtype: int64

for column in "workclass education marital_status occupation relationship race sex native_country income".split():
    print(column, ':')
    df[column] = df[column].str.strip()
    print(df[column].value_counts())
    print()

workclass :
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

education :
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

marital_status :
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital_status, dtype: int64

occupation :
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

relationship :
Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: relationship, dtype: int64

race :
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

sex :
Male      21790
Female    10771
Name: sex, dtype: int64

native_country :
United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                           29
France                           29
Ecuador                          28
Ireland                          24
Hong                             20
Cambodia                         19
Trinadad&Tobago                  19
Laos                             18
Thailand                         18
Yugoslavia                       16
Outlying-US(Guam-USVI-etc)       14
Hungary                          13
Honduras                         13
Scotland                         12
Holand-Netherlands                1
Name: native_country, dtype: int64

income :
<=50K    24720
>50K      7841
Name: income, dtype: int64

4. turn ‘sex’ and ‘income’ into 0/1 fields

replace the categorical ‘sex’ column with a numerical ‘female’ column with value 1 for females and 0 for males
replace the categorical ‘income’ column with a numerical ‘over50k’ column with value 1 for ‘>50k’ and 0 for ‘<50K’

df.rename(columns={'sex': 'female', 'income' : 'over50k'}, inplace=True)
df.replace({'female' : {'Male' : 0, 'Female' : 1 }, 'over50k' : {'>50K' : 1, '<=50K' : 0}}, inplace=True)
df.head(10)

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	female	capital_gain	hours_per_week	native_country	over50k
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	0	2174	40	United-States	0
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	0	0	13	United-States	0
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	0	0	40	United-States	0
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	0	0	40	United-States	0
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	1	0	40	Cuba	0
5	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	1	0	40	United-States	0
6	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	1	0	16	Jamaica	0
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	0	0	45	United-States	1
8	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	1	14084	50	United-States	1
9	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	0	5178	40	United-States	1

5. use `.describe()` function to get descriptive statistics for most columns

make sure that ‘sex’ and ‘over50k’ are now numerical fields

df.describe()

	age	fnlwgt	education_num	female	capital_gain	capital_loss	hours_per_week	over50k
count	32561.000000	3.256100e+04	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000
mean	38.581647	1.897784e+05	10.080679	0.330795	1077.648844	87.303830	40.437456	0.240810
std	13.640433	1.055500e+05	2.572720	0.470506	7385.292085	402.960219	12.347429	0.427581
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000
25%	28.000000	1.178270e+05	9.000000	0.000000	0.000000	0.000000	40.000000	0.000000
50%	37.000000	1.783560e+05	10.000000	0.000000	0.000000	0.000000	40.000000	0.000000
75%	48.000000	2.370510e+05	12.000000	1.000000	0.000000	0.000000	45.000000	0.000000
max	90.000000	1.484705e+06	16.000000	1.000000	99999.000000	4356.000000	99.000000	1.000000

2. explore capital gains / losses

capital_gain

plot the histogram for capital gains
- verbally describe what you see

sns.distplot(df.capital_gain, kde=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce2a4672b0>

png

> answer:

almost everyone has $capital_gains=0$

2. for people who have `capital_gain > 0` - plot the histogram for capital gains

sns.distplot(df[df.capital_gain > 0].capital_gain, kde=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce2a467278>

png

> remark:

what’s this weird bump at $100K ?

3. how many people have capital gains over 25000?

use value_counts() to look at all the values of capital_gain over 25000.
what’s weird about the data?

df[df.capital_gain > 25000].capital_gain.value_counts()

  159
   34
   11
    5
    4
    2
Name: capital_gain, dtype: int64

> answer:

the values seem oddly perscriptive… why do specific value repeat so much, and no values in between?

are the values here coding for something?

also 99999 is maybe a cap value of capital gains?

df[df.capital_gain == 25124]

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	female	capital_gain	hours_per_week	native_country	over50k
1562	75	?	111177	Bachelors	13	Widowed	?	Not-in-family	White	1	25124	16	United-States	1
18847	73	Private	183213	Assoc-voc	11	Widowed	Prof-specialty	Not-in-family	White	0	25124	60	United-States	1
21892	65	?	224472	Prof-school	15	Never-married	?	Not-in-family	White	0	25124	80	United-States	1
22462	68	Self-emp-inc	52052	Assoc-voc	11	Widowed	Sales	Not-in-family	White	1	25124	50	United-States	1

> answer:

very unlikely, 3 widows and 1 never-married are unlikely to share one household.
their “shared” capital_gain either codes for something, or is some kind of quantized property of a calculated field, or other fluke

5. does capital_gain over 50k mean income is over 50k?

df[df.capital_gain > 50000].over50k.mean()

1.0

> answer:

100% of people who had capital_gain>50000 had income over 50k
so inspite of the weirdness we saw in the values, we can probably keep and interpret this variable as intended

capital_loss

plot the histogram of capital_loss

sns.distplot(df.capital_loss, kde=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce28cce3c8>

png

> remark

again we see almost everybody having 0 capital losses

2. for people who have `capital_loss > 0`

plot the histogram for capital_loss

sns.distplot(df[df.capital_loss !=0].capital_loss, kde=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce28bac390>

png

3. how many people had both capital_gain>0 and capital_loss>0 ?

len(df[(df.capital_gain > 0) & (df.capital_loss >0)])

4. who can afford to lose money on capital investments?

what percent of people overall had over 50K income?
what percent of people with 0 capital_loss? with capital_loss>0?

print("proportion of people who earn over 50K:\n{}".format(df.over50k.mean()))
print()
print("proportion of people who earn over 50K, by capital_loss > 0:")
print(df.groupby(df.capital_loss >0).over50k.mean())

proportion of people who earn over 50K:
0.2408095574460244

proportion of people who earn over 50K, by capital_loss > 0:
capital_loss
False    0.227692
True     0.508887
Name: over50k, dtype: float64

sns.barplot(x="lost_capital", y="over50k", data=df.assign(lost_capital=df.capital_loss >0))
plt.axhline(y=df.over50k.mean(), linestyle="--")

<matplotlib.lines.Line2D at 0x7fce28b369b0>

png

> answer:

having capital_loss > 0 is correlated with earning more than 50K.
in fact, the proportion of over50k is twice as large in capital losers than in the general population

combining and binning

create a new capital_change column that equals capital_gain - capital_loss

df = df.assign(capital_change=df.capital_gain - df.capital_loss)
df.head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	female	capital_gain	hours_per_week	native_country	capital_change
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	0	2174	40	United-States	2174
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	0	0	13	United-States	0
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	0	0	40	United-States	0
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	0	0	40	United-States	0
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	1	0	40	Cuba	0

sns.barplot(
    y="has_capital_change", 
    x="over50k", 
    data=df.assign(has_capital_change=df.capital_change.abs() > 0)
    )

<matplotlib.axes._subplots.AxesSubplot at 0x7fce26299278>

png

2. use the qcut function to quantize/bin/cut capital_change into a new columns called capital_change_bin with 10 bins of equal proportions.

do not bin capital_change==0 values as there are too many of them
to simplify using this column later, use the left side of the interval created as the label
label rows with capital_change==0 as having capital_change_bin=0
make sure you have no null values for capital_change_bin

# do not bin capital_change==-0
capital_change_bins = pd.qcut(df[df.capital_change !=0].capital_change, 10) 

# use left side of interval as label
capital_change_bins = capital_change_bins.apply(lambda interval: interval.left) 

# add capital_change_bin to df
df['capital_change_bin'] = capital_change_bins

# look at the last 4 columns
df.iloc[:, -4:].head()

	native_country	capital_change	capital_change_bin
0	United-States	2174	-1594.0
1	United-States	0	NaN
2	United-States	0	NaN
3	United-States	0	NaN
4	Cuba	0	NaN

# label rows with capital_change==0 as having capital_change_bin=0
df.replace({'capital_change_bin' : {np.nan : 0}}, inplace=True)

# look at the last 4 columns
df.iloc[:, -4:].head()

	native_country	capital_change	capital_change_bin
0	United-States	2174	-1594
1	United-States	0	0
2	United-States	0	0
3	United-States	0	0
4	Cuba	0	0

# make sure you have no null values for capital_change_bin
df.capital_change_bin.isnull().sum()

3. how many people have a non-zero capital_change?

lets call this ‘has_capital_change’
plot ‘has_capital_change’ over ‘over50k’
what do you learn from this diagram?

sns.barplot(
    y="has_capital_change", 
    x="over50k", 
    data=df.assign(has_capital_change=df.capital_change.abs() > 0)
    )
plt.grid(True)

png

> answer:

for people earning less than 50K, only ~7.5% try to have “money work for them” in terms of capital gain/loss

for people earning above 50K, over 30% have non-zero capital change

4. plot `capital_change` by `over50k`

what do you learn from this diagram

sns.barplot(x='over50k', y='capital_change', data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce24fc1ac8>

png

> answer:

people earning over50k had the means/knowledge/opportunity to invest in capital, and earned a mean ~3500$ from capital

5. plot `over50k` by `capital_change_bin`

what can you learn from this diagram?

bars = sns.lineplot(x='capital_change_bin', y='over50k', data=df)
plt.setp(bars.get_xticklabels(), rotation=90);

png

> answer:

among people that has lost/earned roughly 2500$ or under in capital_change, there is very minor difference in proportion of over50K earnings relative to the general population

however, among those that have earned or lost in excess of ~2500$ there is a sharp increase in the proportion of over50K earners

education

what is the mean education_num by education?
- sort the education categories by the mean_values. does it make sense?
- check out other descriptive statistics to see if anything falls out of place
- turn education into a categorical ordered type
- plot education VS education_num
- what have we learned?

# - what is the mean education_num by education?
# - sort the education categories by the mean_values. does it make sense?
df.groupby('education').education_num.mean().sort_values()

education
Preschool        1
1st-4th          2
5th-6th          3
7th-8th          4
9th              5
10th             6
11th             7
12th             8
HS-grad          9
Some-college    10
Assoc-voc       11
Assoc-acdm      12
Bachelors       13
Masters         14
Prof-school     15
Doctorate       16
Name: education_num, dtype: int64

> answer:

yes, makes sense. looks like education_num is a numerical encoding for education

# - check out other descriptive statistics to see if anything falls out of place
df.groupby('education').education_num.describe()

	count	mean	std	min	25%	50%	75%	max
education
10th	933.0	6.0	0.0	6.0	6.0	6.0	6.0	6.0
11th	1175.0	7.0	0.0	7.0	7.0	7.0	7.0	7.0
12th	433.0	8.0	0.0	8.0	8.0	8.0	8.0	8.0
1st-4th	168.0	2.0	0.0	2.0	2.0	2.0	2.0	2.0
5th-6th	333.0	3.0	0.0	3.0	3.0	3.0	3.0	3.0
7th-8th	646.0	4.0	0.0	4.0	4.0	4.0	4.0	4.0
9th	514.0	5.0	0.0	5.0	5.0	5.0	5.0	5.0
Assoc-acdm	1067.0	12.0	0.0	12.0	12.0	12.0	12.0	12.0
Assoc-voc	1382.0	11.0	0.0	11.0	11.0	11.0	11.0	11.0
Bachelors	5355.0	13.0	0.0	13.0	13.0	13.0	13.0	13.0
Doctorate	413.0	16.0	0.0	16.0	16.0	16.0	16.0	16.0
HS-grad	10501.0	9.0	0.0	9.0	9.0	9.0	9.0	9.0
Masters	1723.0	14.0	0.0	14.0	14.0	14.0	14.0	14.0
Preschool	51.0	1.0	0.0	1.0	1.0	1.0	1.0	1.0
Prof-school	576.0	15.0	0.0	15.0	15.0	15.0	15.0	15.0
Some-college	7291.0	10.0	0.0	10.0	10.0	10.0	10.0	10.0

> answer

yes, we see that std is 0 and min=max.

that means there are no deviations or wrong encodings in the data. yippee!

#   - turn education into a categorical ordered type
sorted_education = df.groupby('education').education_num.mean().sort_values().index.values
print(sorted_education)
df.education = df.education.astype(pd.CategoricalDtype(sorted_education, ordered=True))

#   - plot education VS education_num
bars = sns.barplot(x='education', y='education_num', data=df)
plt.setp(bars.get_xticklabels(), rotation=90);
plt.grid(True)

#   - what have we learned?

['Preschool' '1st-4th' '5th-6th' '7th-8th' '9th' '10th' '11th' '12th'
 'HS-grad' 'Some-college' 'Assoc-voc' 'Assoc-acdm' 'Bachelors' 'Masters'
 'Prof-school' 'Doctorate']

png

> answer:

we can now interchangably use either education_num or education in our graphs, depending if we want to emphasize the categorical name of the education or numerical size

2. plot the distribution for education

bars = sns.countplot(df.education)
plt.setp(bars.get_xticklabels(), rotation=90);

png

3. plot over50k by education

what do we learn?

bars = sns.barplot(y='over50k', x='education', data=df)
plt.setp(bars.get_xticklabels(), rotation=90);

png

> answer:

over50k seems to be stringly correlated with education,

with significant improvements to outcomes at High School, Bachelors, Masters and Proffesional/Doctorate attainment levels

4. plot hours_per_week by education

what can we learn?
now use the hue=”over50k” of seaborn to see hours_per_week by education/over50k.
- learn anything else?

bars = sns.lineplot(y='hours_per_week', x='education', ci="sd", data=df)
plt.setp(bars.get_xticklabels(), rotation=90);

png

> answer:

we observe High-School gradues to have roughly a 40-hour work week

with education above highschool, we see a strong trend for more hours-per-week as education increases, above and beyond 40 hours

we see a sharp decline in work hours for highschool dropouts (9th-12th grade education). perhaps indicative of physical/mental health issues or life circumstances that make work difficult.

for education up until 8th grade we see a positive trend that more education allows for more work, reaching almost 40 hour weeks of full employment

bars = sns.lineplot(y='hours_per_week', x='education', hue='over50k', ci="sd", data=df)
plt.setp(bars.get_xticklabels(), rotation=90);

png

> answer:

those that earn above 50K work approx 45 hours per week across most education levels. with exceptions being at low education levels (where sample sizes are small so confidence intervals are large) and for “Prof-School” education (doctors?), where the mean is closer to ~50 hour weeks.

those that earn under 50K work less hours across almost all education levels, mostly around 40 hour weeks at above high-school level education (except perhaps those with Doctorates where sample sizes are small)

> note for later:

look further at relationship between hours worked and over50k

5. plot education_num by occupation

sort by mean education_num

occupation_by_education = df.groupby("occupation").education_num.mean().sort_values().index.values
bars = sns.boxplot(
    x="occupation", 
    y='education_num', 
    order=occupation_by_education, 
    data=df,
    # medianprops={"color":"white"},
    showfliers=False, 
    showmeans=True
    )
_ = plt.setp(bars.get_xticklabels(), rotation=90)

png

> remark:

clearly there seems like 5 cluster of jobs by education level:

“Priv-house-serv” (servants)

“Machine-op-inspect” to “Arned forced” (blue collar)

“Adm-clerical” to “sales” (white collar)

“Tech-support” - “Exec-managerial” (tech / management)

“Prof-specialty” (doctors/lawyers/proffesionals)

6. plot education_num by workclass

sort by mean education_num

workclass_by_education = df.groupby("workclass").education_num.mean().sort_values().index.values
bars = sns.boxplot(x="workclass", y='education_num', order=workclass_by_education,  showfliers=False, data=df)
_ = plt.setp(bars.get_xticklabels(), rotation=90)

png

7. create a crosstab or a pivot_table of education VS occupation.

normalize it by the education rows (each row X shows the conditional probability of having occupation Y by education level X)
create a heatmap that shows which occpupations are most likely for each education level
verbally describe what you’ve learned

occ_by_edu_df = pd.crosstab(
    df.education, 
    df.occupation.astype(pd.CategoricalDtype(occupation_by_education, ordered=True)), 
    normalize='index')

fig, ax = plt.subplots(figsize=(7,7))

sns.heatmap(
    occ_by_edu_df
    , robust = True
    , ax=ax
    , cmap="Oranges"
    , linewidths=.75
);

png

> answer:

the heatmap shows for each education/row, which occupations are most likely to be taken by individuals with that education attainment.

for the very lower end of education, “machine-op-inspect” or “farming-fishing” is likely

Those with high-school or below attainment, are likely to find themselves in the large “Other-service”, “Craft-repair” or “transport-moving” occupations, since these are large overall.

mid-level education “some-college” to “assoc-acdm” find a wide range of horizontal occuptions

for Bachelor and above education “Exec-managerial” and “Prof-specialty” are highly likely

8. create a crosstab or a pivot_table of education VS workclass.

normalize it by the education rows (each row X shows the conditional probability of having workclass Y by education level X)
create a heatmap that shows which workclass is most likely for each education level
verbally describe what you’ve learned
re-run this analysis without the private sector

class_by_edu_df = pd.crosstab(
    df.education, 
    df.workclass.astype(pd.CategoricalDtype(workclass_by_education, ordered=True)), 
    normalize='index')

fig, ax = plt.subplots(figsize=(7,7))

sns.heatmap(
    class_by_edu_df
    , robust = True
    , ax=ax
    , cmap="Oranges"
    , linewidths=.75
);

png

> answer:

almost everyone, regardless of education, works in the private sector. yippee!

non_private_df = df[df.workclass != 'Private']

class_by_edu_df = pd.crosstab(
    non_private_df.education, 
    non_private_df.workclass.astype(pd.CategoricalDtype(workclass_by_education, ordered=True)), 
    normalize='index')

fig, ax = plt.subplots(figsize=(7,7))

sns.heatmap(
    class_by_edu_df
    , robust = True
    , ax=ax
    , cmap="Oranges"
    , linewidths=.75
);

png

> answer:

outside the private sector:

those with lowest education, mostly are unclassified by workclass, or are self-employed

high scool grads tend to be self-employed

mid-level education tend to be self-employed or local-government

people with masters degree tend to be employed by local goverment

people with prof-school tend to be self-employed

people with doctorates tend to be employed by state-government

9. plot “race” vs “education_num

race_by_education = df.groupby('race').education_num.mean().sort_values().index.values
bars = sns.boxplot(x="race", y="education_num",  order=race_by_education, data=df)
_ = plt.setp(bars.get_xticklabels(), rotation=90)

png

10. plot “relationship” vs “education_num

rel_by_education = df.groupby('relationship').education_num.mean().sort_values().index.values
bars = sns.boxplot(x="relationship", y="education_num",  order=rel_by_education, data=df)
_ = plt.setp(bars.get_xticklabels(), rotation=90)

png

occupation / workclass

how many levels of occupation?

print(df.occupation.nunique())
print(df.occupation.unique())

15
['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' '?'
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv']

2. how many levels of worklass?

print(df.workclass.nunique())
print(df.workclass.unique())

9
['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked']

3. how many combinations? potential? actual?

print("potential:", df.workclass.nunique() * df.occupation.nunique())
print("actual:", len(df.groupby(['occupation', 'workclass'])))

potential: 135
actual: 83

4. plot over50k by occupation

sort by mean over50k
compare this to over50k by education. which variable more strongly predicts income?
compare tjos to education_num by occupation. are the highest paying jobs correlated with highest earning education?

fig, axs = plt.subplots(ncols=2, sharey=True, figsize=(12,6))
occ_by_50k = df.groupby('occupation').over50k.mean().sort_values().index.values
bars = sns.barplot(x='occupation', y='over50k', order=occ_by_50k, data=df, ax=axs[0])
plt.setp(bars.get_xticklabels(), rotation=90);

bars = sns.barplot(x='education', y='over50k', data=df, ax=axs[1])
plt.setp(bars.get_xticklabels(), rotation=90);

png

fig, axs = plt.subplots(ncols=2, sharex=True, figsize=(12,6))
bars = sns.barplot(x='occupation', y='over50k', order=occ_by_50k, data=df, ax=axs[0])
plt.setp(bars.get_xticklabels(), rotation=90);

bars = sns.barplot(x='occupation', y='education_num', ci="sd", order=occ_by_50k, data=df, ax=axs[1])
plt.setp(bars.get_xticklabels(), rotation=90);

png

> answer:

education looks like a stronger indicator of income, because disparities in income are higher when grouped by education than when grouped by occupation

the highest paying occupations are not the same as the occupations requiring highest education. perhaps this is due to education needs varying wildly within occupations

5. plot `over50k` by `workclass`

class_by_50k = df.groupby('workclass').over50k.mean().sort_values().index.values
bars = sns.barplot(x='workclass', y='over50k', order=class_by_50k, data=df)
plt.setp(bars.get_xticklabels(), rotation=90);

png

6. look at combinations of occupation / workclass

what are the top combinations in terms of earning over50k? how many people in that category?

occupation_class = df.groupby(['occupation', 'workclass']) \
    .over50k.agg(["mean", "count"]) \
    .sort_values(by="mean", ascending=False)

occupation_class.head(10)

		mean	count
occupation	workclass
Prof-specialty	Self-emp-inc	0.756250	160
Tech-support	Self-emp-inc	0.666667	3
Exec-managerial	Self-emp-inc	0.635000	400
Prof-specialty	Self-emp-not-inc	0.563003	373
Sales	Self-emp-inc	0.549828	291
Prof-specialty	Federal-gov	0.542857	175
Exec-managerial	Federal-gov	0.511111	180
Protective-serv	Federal-gov	0.500000	28
Exec-managerial	Private	0.481234	2691
Exec-managerial	Local-gov	0.476636	214

2. how many of these combinations have more than 100 people?

occupation_class = occupation_class[occupation_class['count'] > 100]
print(len(occupation_class))
occupation_class.head(10)

		mean	count
occupation	workclass
Prof-specialty	Self-emp-inc	0.756250	160
Exec-managerial	Self-emp-inc	0.635000	400
Prof-specialty	Self-emp-not-inc	0.563003	373
Sales	Self-emp-inc	0.549828	291
Prof-specialty	Federal-gov	0.542857	175
Exec-managerial	Federal-gov	0.511111	180
	Private	0.481234	2691
	Local-gov	0.476636	214
Protective-serv	Local-gov	0.444079	304
Prof-specialty	Private	0.435798	2313

3. show a heatmap

of the mean over50k of occupation-vs-worklass for combinations with more than 100 people. center the heatmap at the populations mean over50k for increased effect. what conclusions can you draw?

sns.heatmap(
    occupation_class['mean'].unstack(1)
    , center=df.over50k.mean()
    , robust = True
    , linewidths=.75)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce24a29080>

png

> answer:

Prof-speciality and Exec managerial lead the earners across all work classes.

Self-employed-inc (having your own firm incorporated) is great

the private sector often has worse over50k mean than other sectors for the same occupations

Protective-serv (police/firemen) working for local-gov is a sweet deal

7. create a numerical encoding for occupation / workclass pairs

create a new column called “occ_class” that combines the string of the occupation and workclass
use the library category_encoders, here’s an intro how to do it
use the weight of evidence encoder ce.woe.WOEEncoder here’s an article explaining it
add the encoded occ_class as a new column called occ_class_woe to your dataframe

!pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 5.5MB/s 
[?25hRequirement already satisfied: patsy>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (0.5.1)
Requirement already satisfied: pandas>=0.21.1 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (0.25.2)
Requirement already satisfied: scipy>=0.19.0 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (1.3.1)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (0.21.3)
Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (1.17.3)
Requirement already satisfied: statsmodels>=0.6.1 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (0.10.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from patsy>=0.4.1->category_encoders) (1.12.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.21.1->category_encoders) (2.6.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.21.1->category_encoders) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20.0->category_encoders) (0.14.0)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0

# pip install category_encoders
import category_encoders as ce

def get_occupation_workclass(person):
    occupation, workclass = person['occupation'], person['workclass']
    if (occupation, workclass) in occupation_class.index:
        return f"{occupation}|{workclass}"
    else:
        return "other"

df['occ_class'] = df.apply(get_occupation_workclass, axis=1)
ce_target = ce.woe.WOEEncoder(cols = ['occ_class'])
ce_target.fit(df["occ_class"], df['over50k'])
df["occ_class_woe" ] = ce_target.transform(df["occ_class"], df['over50k'])

# show the encoding
df.groupby("occ_class")['occ_class_woe'].max().sort_values()

occ_class
Priv-house-serv|Private             -3.162727
Other-service|Private               -2.115720
Other-service|State-gov             -2.038281
Handlers-cleaners|Private           -1.638773
Other-service|Local-gov             -1.490985
Farming-fishing|Private             -1.472380
Other-service|Self-emp-not-inc      -1.386845
?|?                                 -1.000536
Adm-clerical|State-gov              -0.944673
Transport-moving|Local-gov          -0.931369
Adm-clerical|Private                -0.906609
Machine-op-inspct|Private           -0.863285
Adm-clerical|Local-gov              -0.851020
Farming-fishing|Self-emp-not-inc    -0.582902
Protective-serv|Private             -0.499345
Craft-repair|Self-emp-not-inc       -0.367513
Transport-moving|Private            -0.231336
Craft-repair|Private                -0.083898
Sales|Private                       -0.045187
Transport-moving|Self-emp-not-inc    0.005975
Protective-serv|State-gov            0.071933
other                                0.108300
Craft-repair|Local-gov               0.188815
Tech-support|Private                 0.304643
Adm-clerical|Federal-gov             0.393148
Sales|Self-emp-not-inc               0.454925
Prof-specialty|Local-gov             0.575653
Craft-repair|Self-emp-inc            0.577527
Exec-managerial|Self-emp-not-inc     0.607353
Prof-specialty|State-gov             0.798398
Exec-managerial|State-gov            0.863443
Prof-specialty|Private               0.890063
Protective-serv|Local-gov            0.924929
Exec-managerial|Local-gov            1.055413
Exec-managerial|Private              1.073028
Exec-managerial|Federal-gov          1.192035
Prof-specialty|Federal-gov           1.317971
Sales|Self-emp-inc                   1.346675
Prof-specialty|Self-emp-not-inc      1.400064
Exec-managerial|Self-emp-inc         1.698903
Prof-specialty|Self-emp-inc          2.263214
Name: occ_class_woe, dtype: float64

correlations

which features are most important, which correlate?
- compute the correction matrix of features with themselves

corr = df.corr()
corr.head()

	age	fnlwgt	education_num	female	capital_gain	capital_loss	hours_per_week	over50k	capital_change	capital_change_bin	occ_class_woe
age	1.000000	-0.076646	0.036527	-0.088832	0.077674	0.057775	0.068756	0.234037	0.074284	0.099866	0.162277
fnlwgt	-0.076646	1.000000	-0.043195	-0.026858	0.000432	-0.010252	-0.018768	-0.009463	0.000988	0.000334	-0.026165
education_num	0.036527	-0.043195	1.000000	-0.012280	0.122630	0.079923	0.148123	0.335154	0.117891	0.131408	0.442328
female	-0.088832	-0.026858	-0.012280	1.000000	-0.048480	-0.045567	-0.229309	-0.215980	-0.045846	-0.055595	-0.157328
capital_gain	0.077674	0.000432	0.122630	-0.048480	1.000000	-0.031615	0.078409	0.223329	0.998521	0.731770	0.120792

2. draw a clustermap of this correlation

center the cluster at 0
annotate the plot with the correlation values

sns.clustermap(corr, 
               linewidths=.75,
               annot=True,
               fmt=".2f",
               center=0
               )

<seaborn.matrix.ClusterGrid at 0x7fce247eb8d0>

png

3. look at the strongest correlations and draw some conclusions.

> answer:

we obviously ignore the diagonal where variables have 100% correlation with themselves

a strong .44 correlation between education_num and occ_class_woe is especially interesting because we encoded occupation/class toward the over50k variable, but received a stronger correlation with education than with income (at a smaller .36). this shows that education is a super strong determinant of occupation/class, and that occupation/class determines income

at a slighly lower .34, education_num is still very strongly correlated with over50k, showing education is key for higher incomes

at .28 capital_change_bin is strongly correlated with over50k. it is my belief that here causality goes backwards - higher incomes create capital investments, rather than the opposite. perhpas an indication for that is that capital_change_bin is not strongly correlated with anything but over50k

at .25 occ_class_woe is strongly correlated with hours_per_week. stronger than the .15 correlation of education_num with hours_per_week, and we’ve seen analysis that with higher education people work more hours. it appears that higher education enables higher paying jobs, with require more hours.

at .23 hours_per_week is strongly correlated with over50k. this is probably explained by the previous item. better jobs require more hours, rather than the more obvious more hours worked creating higher salary.

at -0.23 hours_per_week is strongly negatively correlated with being female. so women work less hours. this is only slightly more negative than the -0.22 correlation women have with over50k. are women earning less because they are not willing to work as many hours? there is no definitive answer here, but it is interesting to also note the -0.16 negative correlation of occ_class_woe with being female, so they might be choosing lower paying jobs with shorter hours. lastly, it is interesting to note the correlation of female and education_num is a negligble -0.01, that is women have the same levels of education as men.

the strongest correlation is capital_change_bin and capital_gain but that is an artifact of how we built that variable.

get the data

income for adults from the 1994 census

1. load the data

2. look at the head, info, dtype, check for nan values

> answer:

3. what are the value counts of the categorical variables: workclass, education, marital_status, occupation, relationship, race, sex, native_country, income?

4. turn ‘sex’ and ‘income’ into 0/1 fields

5. use .describe() function to get descriptive statistics for most columns

2. explore capital gains / losses

capital_gain

> answer:

2. for people who have capital_gain > 0 - plot the histogram for capital gains

> remark:

3. how many people have capital gains over 25000?

> answer:

4. could the people who had capital_gain==25124 be related?

> answer:

5. does capital_gain over 50k mean income is over 50k?

> answer:

capital_loss

> remark

2. for people who have capital_loss > 0

3. how many people had both capital_gain>0 and capital_loss>0 ?

4. who can afford to lose money on capital investments?

> answer:

combining and binning

2. use the qcut function to quantize/bin/cut capital_change into a new columns called capital_change_bin with 10 bins of equal proportions.

3. how many people have a non-zero capital_change?

> answer:

4. plot capital_change by over50k

> answer:

5. plot over50k by capital_change_bin

> answer:

education

> answer:

> answer

> answer:

2. plot the distribution for education

3. plot over50k by education

> answer:

4. plot hours_per_week by education

> answer:

> answer:

> note for later:

5. plot education_num by occupation

> remark:

6. plot education_num by workclass

7. create a crosstab or a pivot_table of education VS occupation.

> answer:

8. create a crosstab or a pivot_table of education VS workclass.

> answer:

> answer:

9. plot “race” vs “education_num

10. plot “relationship” vs “education_num

occupation / workclass

2. how many levels of worklass?

3. how many combinations? potential? actual?

4. plot over50k by occupation

> answer:

5. plot over50k by workclass

6. look at combinations of occupation / workclass

2. how many of these combinations have more than 100 people?

3. show a heatmap

> answer:

7. create a numerical encoding for occupation / workclass pairs

correlations

2. draw a clustermap of this correlation

3. look at the strongest correlations and draw some conclusions.

> answer:

5. use `.describe()` function to get descriptive statistics for most columns

2. for people who have `capital_gain > 0` - plot the histogram for capital gains

2. for people who have `capital_loss > 0`

4. plot `capital_change` by `over50k`

5. plot `over50k` by `capital_change_bin`

5. plot `over50k` by `workclass`