Skip to content

Commit e3cdf5a

Browse files
authored
last commit
1 parent 4103f13 commit e3cdf5a

File tree

1 file changed

+183
-0
lines changed

1 file changed

+183
-0
lines changed

Census_Income.Rmd

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
---
2+
title: "Predict Income Level from Census Data"
3+
output: html_document
4+
---
5+
6+
The data was extracted from [1994 Census bureau database](https://www.census.gov/en.html). However, I downloaded from [Kaggle](https://www.kaggle.com/datasets) at [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html).
7+
8+
The data contains 32561 observations (people) and 15 variables. A high level summary of the data is below.
9+
10+
```{r global_options, include=FALSE}
11+
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
12+
```
13+
14+
```{r}
15+
income <- read.csv('adult.csv', na.strings = c('','?'))
16+
str(income)
17+
```
18+
19+
Statistics summary after changing missing values to 'NA'.
20+
21+
```{r}
22+
summary(income)
23+
```
24+
25+
### Data Cleaning Process
26+
27+
Check for 'NA' values and look how many unique values there are for each variable.
28+
29+
```{r}
30+
sapply(income,function(x) sum(is.na(x)))
31+
```
32+
33+
```{r}
34+
sapply(income, function(x) length(unique(x)))
35+
```
36+
37+
```{r}
38+
library(Amelia)
39+
missmap(income, main = "Missing values vs observed")
40+
table (complete.cases (income))
41+
```
42+
43+
Approximate 7%(2399/32561) of the total data has missing value. They are mainly in variables 'occupation', 'workclass' and 'native country'. I decided to remove those missing values because I don't think its a good idea to replace categorical values by imputing.
44+
45+
```{r}
46+
income <- income[complete.cases(income),]
47+
```
48+
49+
### Explore Numeric Variables With Income Levels
50+
51+
```{r}
52+
library(ggplot2)
53+
library(gridExtra)
54+
p1 <- ggplot(aes(x=income, y=age), data = income) + geom_boxplot() +
55+
ggtitle('Age vs. Income Level')
56+
p2 <- ggplot(aes(x=income, y=education.num), data = income) + geom_boxplot() +
57+
ggtitle('Years of Education vs. Income Level')
58+
p3 <- ggplot(aes(x=income, y=hours.per.week), data = income) + geom_boxplot() +
59+
ggtitle('Hours Per Week vs. Income Level')
60+
p4 <- ggplot(aes(x=income, y=capital.gain), data=income) + geom_boxplot() +
61+
ggtitle('Capital Gain vs. Income Level')
62+
p5 <- ggplot(aes(x=income, y=capital.loss), data=income) + geom_boxplot() +
63+
ggtitle('Capital Loss vs. Income Level')
64+
p6 <- ggplot(aes(x=income, y=fnlwgt), data=income) + geom_boxplot() +
65+
ggtitle('Final Weight vs. Income Level')
66+
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=3)
67+
```
68+
69+
"Age", "Years of education" and "hours per week" all show significant variations with income level. Therefore, they will be kept for the regression analysis. "Final Weight" does not show any variation with income level, therefore, it will be excluded from the analysis. Its hard to see whether "Capital gain" and "Capital loss" have variation with Income level from the above plot, so I will keep them for now.
70+
71+
```{r}
72+
income$fnlwgt <- NULL
73+
```
74+
75+
### Explore Categorical Variables With Income Levels
76+
77+
```{r}
78+
library(dplyr)
79+
by_workclass <- income %>% group_by(workclass, income) %>% summarise(n=n())
80+
by_education <- income %>% group_by(education, income) %>% summarise(n=n())
81+
by_education$education <- ordered(by_education$education,
82+
levels = c('Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', 'Some-college', 'Bachelors', 'Masters', 'Doctorate'))
83+
by_marital <- income %>% group_by(marital.status, income) %>% summarise(n=n())
84+
by_occupation <- income %>% group_by(occupation, income) %>% summarise(n=n())
85+
by_relationship <- income %>% group_by(relationship, income) %>% summarise(n=n())
86+
by_race <- income %>% group_by(race, income) %>% summarise(n=n())
87+
by_sex <- income %>% group_by(sex, income) %>% summarise(n=n())
88+
by_country <- income %>% group_by(native.country, income) %>% summarise(n=n())
89+
90+
p7 <- ggplot(aes(x=workclass, y=n, fill=income), data=by_workclass) + geom_bar(stat = 'identity', position = position_dodge()) + ggtitle('Workclass with Income Level') + theme(axis.text.x = element_text(angle = 45, hjust = 1))
91+
p8 <- ggplot(aes(x=education, y=n, fill=income), data=by_education) + geom_bar(stat = 'identity', position = position_dodge()) + ggtitle('Education vs. Income Level') + coord_flip()
92+
p9 <- ggplot(aes(x=marital.status, y=n, fill=income), data=by_marital) + geom_bar(stat = 'identity', position=position_dodge()) + ggtitle('Marital Status vs. Income Level') + theme(axis.text.x = element_text(angle = 45, hjust = 1))
93+
p10 <- ggplot(aes(x=occupation, y=n, fill=income), data=by_occupation) + geom_bar(stat = 'identity', position=position_dodge()) + ggtitle('Occupation vs. Income Level') + coord_flip()
94+
p11 <- ggplot(aes(x=relationship, y=n, fill=income), data=by_relationship) + geom_bar(stat = 'identity', position=position_dodge()) + ggtitle('Relationship vs. Income Level') + coord_flip()
95+
p12 <- ggplot(aes(x=race, y=n, fill=income), data=by_race) + geom_bar(stat = 'identity', position = position_dodge()) + ggtitle('Race vs. Income Level') + coord_flip()
96+
p13 <- ggplot(aes(x=sex, y=n, fill=income), data=by_sex) + geom_bar(stat = 'identity', position = position_dodge()) + ggtitle('Sex vs. Income Level')
97+
p14 <- ggplot(aes(x=native.country, y=n, fill=income), data=by_country) + geom_bar(stat = 'identity', position = position_dodge()) + ggtitle('Native Country vs. Income Level') + coord_flip()
98+
grid.arrange(p7, p8, p9, p10, ncol=2)
99+
grid.arrange(p11, p12, p13, p14, ncol=2)
100+
```
101+
102+
Most of the data was collected from the United States, so variable "native country" does not have effect in my analysis, I will exclude it from regression model. And all the other categorial variables seem to have reasonable variation, so will be kept.
103+
104+
```{r}
105+
income$native.country <- NULL
106+
```
107+
108+
```{r}
109+
income$income = as.factor(ifelse(income$income==income$income[1],0,1))
110+
```
111+
112+
Convert income level to 0's and 1's,"<=50K" will be 0 and ">50K"" will be 1(binary outcome).
113+
114+
### Model Fitting
115+
116+
split the data into two chunks: training and testing set.
117+
118+
```{r}
119+
train <- income[1:24000,]
120+
test <- income[24001:30162,]
121+
```
122+
123+
Fit the model
124+
125+
```{r}
126+
model <-glm(income ~.,family=binomial(link='logit'),data=train)
127+
summary(model)
128+
```
129+
130+
Interpreting the results of the logistic regression model:
131+
132+
1. "Age", "Hours per week", "sex", "capital gain" and "capital loss" are the most statistically significant variables. Their lowest p-values suggesting a strong association with the probability of wage>50K from the data.
133+
2. "Workclass", "education", "marital status", "occupation" and "relationship" are all across the table. so cannot be eliminated from the model.
134+
3. "Race" category is not statistically significant and can be eliminated from the model.
135+
136+
Run the anova() function on the model to analyze the table of deviance.
137+
138+
```{r}
139+
anova(model, test="Chisq")
140+
```
141+
142+
The difference between the null deviance and the residual deviance indicates how the model is doing against the null model. The bigger difference, the better. From the above table we can see the drop in deviance when adding each variable one at a time. Adding age, workclass, education, marital status, occupation, relationship, race, sex, capital gain, capital loss and hours per week significantly reduces the residual deviance. education.num seem to have no effect.
143+
144+
### Apply model to the test set
145+
146+
```{r}
147+
fitted.results <- predict(model,newdata=test,type='response')
148+
fitted.results <- ifelse(fitted.results > 0.5,1,0)
149+
misClasificError <- mean(fitted.results != test$income)
150+
print(paste('Accuracy',1-misClasificError))
151+
```
152+
153+
The 0.84 accuracy on the test set is a very encouraging result.
154+
155+
At last, plot the ROC curve and calculate the AUC (area under the curve). The closer AUC for a model comes to 1, the better predictive ability.
156+
157+
```{r}
158+
library(ROCR)
159+
p <- predict(model, newdata=test, type="response")
160+
pr <- prediction(p, test$income)
161+
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
162+
plot(prf)
163+
164+
auc <- performance(pr, measure = "auc")
165+
auc <- auc@y.values[[1]]
166+
auc
167+
```
168+
169+
The area under the curve corresponds the AUC.
170+
171+
As a last step, as I have just learned, use the effects package to compute and plot all variables.
172+
173+
```{r}
174+
library(effects)
175+
plot(allEffects(model))
176+
```
177+
178+
### The End
179+
180+
I have been very caucious on removing variables because I don't want to compromise the data as I may end up removing valid information. As a result, I may have kept variables that I should have removed such as "education.num".
181+
182+
Reference:
183+
[r-bloggers](https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/)

0 commit comments

Comments
 (0)