Learning Data Science: Predicting Income Brackets

As promised in the post Learning Data Science: Modelling Basics we will now go a step further and try to predict income brackets with real world data and different modelling approaches. We will learn a thing or two along the way, e.g. about the so-called Accuracy-Interpretability Trade-Off, so read on…

The data we will use is from here: Marketing data set. The description reads:

This dataset contains questions from questionnaires that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Annual Income of Household from the other 13 demographics attributes.

The following extra information (or metadata) is provided with the data:

cat(readLines('data/marketing-names.txt'), sep = '\n')
Marketing data set
1: Description.
This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Annual Income of Household from the other 13 demographics attributes.
2: Type.            Classification
3: Origin.          Real world
4: Instances.       6876 (8993)
5: Features.        14
6: Classes          9
7: Missing values.  Yes
8: Header.
@relation marketing
@attribute Sex integer [1, 2]
@attribute MaritalStatus integer [1, 5]
@attribute Age integer [1, 7]
@attribute Education integer [1, 6]
@attribute Occupation integer [1, 9]
@attribute YearsInSf integer [1, 5]
@attribute DualIncome integer [1, 3]
@attribute HouseholdMembers integer [1, 9]
@attribute Under18 integer [0, 9]
@attribute HouseholdStatus integer [1, 3]
@attribute TypeOfHome integer [1, 5]
@attribute EthnicClass integer [1, 8]
@attribute Language integer [1, 3]
@attribute Income {1, 2, 3, 4, 5, 6, 7, 8, 9}
@inputs Sex, MaritalStatus, Age, Education, Occupation, YearsInSf, DualIncome, HouseholdMembers, Under18, HouseholdStatus, TypeOfHome, EthnicClass, Language
@outputs Income
DATA DICTIONARY
 1    ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE)
             1. Less than $10,000
             2. $10,000 to $14,999
             3. $15,000 to $19,999
             4. $20,000 to $24,999
             5. $25,000 to $29,999
             6. $30,000 to $39,999
             7. $40,000 to $49,999
             8. $50,000 to $74,999
             9. $75,000 or more
             
 2    SEX
             1. Male
             2. Female
  3    MARITAL STATUS
             1. Married
             2. Living together, not married
             3. Divorced or separated
             4. Widowed
             5. Single, never married
  4    AGE
             1. 14 thru 17
             2. 18 thru 24
             3. 25 thru 34
             4. 35 thru 44
             5. 45 thru 54
             6. 55 thru 64
             7. 65 and Over
  5    EDUCATION
             1. Grade 8 or less
             2. Grades 9 to 11
             3. Graduated high school
             4. 1 to 3 years of college
             5. College graduate
             6. Grad Study
 6    OCCUPATION
             1. Professional/Managerial
             2. Sales Worker
             3. Factory Worker/Laborer/Driver
             4. Clerical/Service Worker
             5. Homemaker
             6. Student, HS or College
             7. Military
             8. Retired
             9. Unemployed
  7    HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE AREA?
             1. Less than one year
             2. One to three years
             3. Four to six years
             4. Seven to ten years
             5. More than ten years
  8    DUAL INCOMES (IF MARRIED)
             1. Not Married
             2. Yes
             3. No
  9    PERSONS IN YOUR HOUSEHOLD
             1. One
             2. Two
             3. Three
             4. Four
             5. Five
             6. Six
             7. Seven
             8. Eight
             9. Nine or more

 10    PERSONS IN HOUSEHOLD UNDER 18
             0. None
             1. One
             2. Two
             3. Three
             4. Four
             5. Five
             6. Six
             7. Seven
             8. Eight
             9. Nine or more
 11    HOUSEHOLDER STATUS
             1. Own
             2. Rent
             3. Live with Parents/Family
 12    TYPE OF HOME
             1. House
             2. Condominium
             3. Apartment
             4. Mobile Home
             5. Other
 13    ETHNIC CLASSIFICATION
             1. American Indian
             2. Asian
             3. Black
             4. East Indian
             5. Hispanic
             6. Pacific Islander
             7. White
             8. Other
  14    WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME?
             1. English
             2. Spanish
             3. Other

Our task is to predict the variable “Income”.

So, let us first load the data (you can find the correctly formatted csv-file here: marketing.csv), have a look at some of its characteristics and perform a little bit of additional formatting. After that we divide it into a training (80%) and a test set (20%) to account for potential overfitting (also see Learning Data Science: Modelling Basics):

data <- read.csv("data/marketing.csv") # change path accordingly
dim(data)
## [1] 6876   14

str(data)
## 'data.frame':    6876 obs. of  14 variables:
##  $ Sex             : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ MaritalStatus   : int  1 1 5 5 1 5 3 1 1 5 ...
##  $ Age             : int  5 3 1 1 6 2 3 6 7 2 ...
##  $ Education       : int  5 5 2 2 4 3 4 3 4 4 ...
##  $ Occupation      : int  5 1 6 6 8 9 3 8 8 9 ...
##  $ YearsInSf       : int  5 5 5 3 5 4 5 5 4 5 ...
##  $ DualIncome      : int  3 2 1 1 3 1 1 3 3 1 ...
##  $ HouseholdMembers: int  5 3 4 4 2 3 1 3 2 1 ...
##  $ Under18         : int  2 1 2 2 0 1 0 0 0 0 ...
##  $ HouseholdStatus : int  1 2 3 3 1 2 2 2 2 2 ...
##  $ TypeOfHome      : int  1 3 1 1 1 3 3 3 3 3 ...
##  $ EthnicClass     : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Language        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Income          : int  9 9 1 1 8 1 6 2 4 1 ...

data <- data.frame(data[-ncol(data)], Income = factor(data$Income))

set.seed(12)
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]

We start with a simple but comprehensible model, OneR (on CRAN), as a benchmark:

library(OneR)
data <- optbin(data_train)
model <- OneR(data, verbose = TRUE)
## 
##     Attribute        Accuracy
## 1 * Age              28.2%   
## 2   MaritalStatus    28.11%  
## 3   Occupation       28.07%  
## 4   HouseholdStatus  27.56%  
## 5   DualIncome       27.04%  
## 6   Education        25.98%  
## 7   HouseholdMembers 22.51%  
## 8   Under18          20.69%  
## 9   TypeOfHome       19.36%  
## 10  EthnicClass      19.29%  
## 11  Sex              18.07%  
## 12  Language         17.82%  
## 13  YearsInSf        17.75%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If Age = 1 then Income = 1
## If Age = 2 then Income = 1
## If Age = 3 then Income = 6
## If Age = 4 then Income = 8
## If Age = 5 then Income = 8
## If Age = 6 then Income = 8
## If Age = 7 then Income = 6
## 
## Accuracy:
## 1551 of 5500 instances classified correctly (28.2%)
## 
## Contingency table:
##       Age
## Income     1     2     3     4     5    6    7  Sum
##    1   * 421 * 352    99    43    21   15   25  976
##    2      16   204   107    39    13   22   33  434
##    3       9   147   122    49    12   21   35  395
##    4       5   121   188    71    39   29   42  495
##    5       3    77   179    81    29   23   34  426
##    6      10    93 * 234   156    70   56 * 47  666
##    7      12    92   185   155   107   66   33  650
##    8      12   111   211 * 251 * 160 * 86   44  875
##    9      11    76   114   187   104   69   22  583
##    Sum   499  1273  1439  1032   555  387  315 5500
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 2671.2, df = 48, p-value < 2.2e-16

plot(model)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    232   45   46   32   33   27   19   27   24  485
##        2      0    0    0    0    0    0    0    0    0    0
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     31   30   44   44   41   66   44   57   50  407
##        7      0    0    0    0    0    0    0    0    0    0
##        8     16   20   20   47   27   87   71  110   86  484
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.17 0.03 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.35
##        2   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.02 0.02 0.03 0.03 0.03 0.05 0.03 0.04 0.04 0.30
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.05 0.08 0.06 0.35
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.2965 (408/1376)
## 
## Error rate:
## 0.7035 (968/1376)
## 
## Error rate reduction (vs. base rate):
## 0.1176 (p-value < 2.2e-16)

What can we conclude from this? First, the most important feature is “Age” while “Marital Status”, “Occupation” and “Household Status” perform comparably well. The overall accuracy (i.e. the percentage of correctly predicted instances) is with about 30% not that great, on the other hand not that extraordinarily uncommon when dealing with real-world data. Looking at the model itself, in the form of rules and the diagnostic plot, we can see that we have non-linear relationship between Age and Income: the older one gets the higher the income bracket, except for people who are old enough to retire. This seems plausible.

OneR is basically a one step decision tree, so the natural choice for our next analysis would be to have a full decision tree built (all packages are on CRAN):

library(rpart)
library(rpart.plot)
model <- rpart(Income ~., data = data_train)
rpart.plot(model, type = 3, extra = 0, box.palette = "Grays")

prediction <- predict(model, data_test, type = "class")
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    201   36   22   13   16   12    8   15   12  335
##        2     43   25   32   22   17   12   10   14    6  181
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     18   24   40   50   42   68   32   33   22  329
##        7      0    0    0    0    0    0    0    0    0    0
##        8     17   10   16   38   26   88   84  132  120  531
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.15 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.24
##        2   0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.00 0.13
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.01 0.02 0.03 0.04 0.03 0.05 0.02 0.02 0.02 0.24
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.06 0.10 0.09 0.39
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.3096 (426/1376)
## 
## Error rate:
## 0.6904 (950/1376)
## 
## Error rate reduction (vs. base rate):
## 0.134 (p-value < 2.2e-16)

The new model has an accuracy that is a little bit better but the interpretability is a little bit worse. You have to go through the different branches to see in which income bracket it ends. So, for example, when the age bracket is below $2$ (which means that it is $1$ ) it predicts income bracket $1$ when it is bigger than $2$ and the Household Status bracket is $1$ it predicts income bracket $8$ and so on. The most important variable, as you can see is again Age (which is reassuring that OneR was on the right track) but we also see Household Status and Occupation again.

What is better than one tree? Many trees! So the next natural step is to have many trees built while varying the features and the examples that should be included in each tree. In the end it may be that different trees give different income brackets as their respective predictions, which we solve via voting as in a good democracy. This whole method is fittingly called Random Forests and fortunately there are many good packages available in R. We use the randomForest package (also on CRAN) here:

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

set.seed(4543) # for reproducibility
model <- randomForest(Income ~., data = data_train, importance = TRUE)
varImpPlot(model)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    223   35   26   16   19   11    9   18   16  373
##        2     24   15   12   20    7   11    1    4    1   95
##        3      9   10   11   14    9    6    3    4    2   68
##        4      5   15   25   22   10   22    6    9    5  119
##        5      2    2    8    9    6   12    6    3    1   49
##        6      3    5   15   14   19   40   23   17   15  151
##        7      8    4    7   13   14   26   25   24    5  126
##        8      3    8    5   11   13   44   49   87   68  288
##        9      2    1    1    4    4    8   12   28   47  107
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.16 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.27
##        2   0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.07
##        3   0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.05
##        4   0.00 0.01 0.02 0.02 0.01 0.02 0.00 0.01 0.00 0.09
##        5   0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.04
##        6   0.00 0.00 0.01 0.01 0.01 0.03 0.02 0.01 0.01 0.11
##        7   0.01 0.00 0.01 0.01 0.01 0.02 0.02 0.02 0.00 0.09
##        8   0.00 0.01 0.00 0.01 0.01 0.03 0.04 0.06 0.05 0.21
##        9   0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.08
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.3459 (476/1376)
## 
## Error rate:
## 0.6541 (900/1376)
## 
## Error rate reduction (vs. base rate):
## 0.1796 (p-value < 2.2e-16)

As an aside you can see that the basic modelling workflow stayed the same – no matter what approach (OneR, decision tree or random forest) you chose. This standard is kept for most (modern) packages and one of the great strengths of R! With thousands of packages on CRAN alone there are of course sometimes differences but those are normally well documented (so the help function or the vignette are your friends!)

Now, back to the result of the last analysis: again, the overall accuracy is better but because we have hundreds of trees now the interpretability suffered a lot. This is also known under the name Accuracy-Interpretability Trade-Off. The best we can do in the case of random forests is to find out which features are the most important ones. Again Age, Occupation and Household Status show up, which is consistent with our analyses so far. Additionally, because of the many trees that had to be built, this analysis ran a lot longer than the other two.

The following table shows those trade-offs for OneR, decision trees, random forests and neural networks (deep learning):

Random forests are one of the best methods out of the box, so the accuracy of about 34% tells us that our first model (OneR) wasn’t doing too bad in the first place! Why are able to reach comparatively good results with just one feature. This is true for many real-world data sets. Sometimes simple models are not much worse than very complicated ones – you should keep that in mind!

If you play around with this dataset I would be interested in your results! Please post them in the comments – Thank you and stay tuned!

UPDATE April 16, 2023
I created a video for this post (in German):

17 thoughts on “Learning Data Science: Predicting Income Brackets”

Pingback: Learning Data Science: Modelling Basics – Learning Machines
vlad says:

March 20, 2019 at 8:33 am

Very nice.
Thank you !!!

1. Learning Machines says:
  
  October 25, 2019 at 8:40 am
  
  Thank you, much appreciated!
  
Peter Stephensen says:

March 25, 2019 at 4:26 pm
Hi

Looks interesting.

When I do this:
data <- read.csv("data/marketing.dat")

I get the error:
```
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed
```
Best,
Peter
1. Learning Machines says:
  
  March 25, 2019 at 4:48 pm
  
  Good point, thank you for the heads-up! I uploaded the correctly formatted csv-file to my server: marketing.csv)
  
  You will have to change “dat” to “csv” – hope this helps
  
  (Also edited the post accordingly)
  
Pingback: Understanding AdaBoost – or how to turn Weakness into Strength – Learning Machines
Pingback: Data Science on Rails: Analyzing Customer Churn – Learning Machines
Pingback: Learning Data Science: Sentiment Analysis with Naive Bayes – Learning Machines
Pingback: Explainable AI (XAI)… Explained! Or How to whiten any Black Box with LIME – Learning Machines
Pingback: ZeroR: The Simplest Possible Classifier… or: Why High Accuracy can be Misleading – Learning Machines
tester says:

June 1, 2020 at 3:55 am

Holger – thanks for yet another
clear, excellent post.

quick Q:
——
the RandomForest plot
shows 2 plots side-by-side:
* Accuracy and
* Gini.

Which plot should we use
to determine Var importance?.

In simple terms,
what is the difference btw these 2 plots?.

Thanks / Danke!
Ray
SF

1. Learning Machines says:
  
  June 2, 2020 at 9:33 am
  
  Thank you for your question. You can find an excellent discussion on Stats.SE: How to interpret Mean Decrease in Accuracy and Mean Decrease GINI in Random Forest models.
  
Yun Tian says:

January 24, 2021 at 6:02 pm
Thank you for the helpful blogs.
when I run:
```
library(OneR)
data <- optbin(data_train)
model <- OneR(data, verbose = TRUE)
```
then got the warning:
Warnmeldung:
```
In OneR.data.frame(data, verbose = TRUE) :
  data contains unused factor levels
```
1. Learning Machines says:
  
  January 25, 2021 at 12:57 pm
  
  Thank you for your great feedback, highly appreciated.
  
  As stated in the documentation: “If data contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given.”
  
olivier says:

November 20, 2022 at 10:39 am
Hi, thanks for your blogs and all the work, super excited with getting into R programming easily!

just a quick Q:
when I run:
```
data_train <- data[random, ]
data_test <- data[-random, ]
library(OneR)
data <- optbin(data_train)
model  data <- optbin(data_train)
Error in optbin.data.frame(data_train) : 
  number of target levels must be bigger than 1
```
thanks in advance!
best Oli
1. Learning Machines says:
  
  November 20, 2022 at 2:13 pm
  Dear olivier: Thank you for feedback!
  
  I unfortunately cannot reproduce your error. Have you run
```
data <- read.csv("data/marketing.csv") # change path accordingly
dim(data)
str(data)
data <- data.frame(data[-ncol(data)], Income = factor(data$Income))

set.seed(12)
random <- sample(1:nrow(data), 0.8 * nrow(data))
```
  before that and does it give you the same results as in the text (i.e. that the dataset was loaded correctly)?
  
  best,
  h
Pingback: Learning Data Science: Predictive Maintenance with Decision Trees – Learning Machines