casey jenkins Code, Technology & Data Science

Machine Learning with the Random Forest Algorithm

Analysis-Random-Forest.r

Mushroom Data Source archive.ics.uci.edu

Load the needed R libraries

 #Load needed librarys
 source('prepare_functions.R')
 library(randomForest)
 library(e1071)
 library(caret)
 library(ggplot2)
 #Set a consistent seed
  set.seed(123) 
  
  #Major clean and preparation see helper_function.r
  data = prepareAndCleanData()
  
  #Check the first couple of rows of data
  head(data)
 

Just a few of the 23 variables

Edible CapShape CapSurface CapColor Bruises    Odor GillAttachment
1 Poisonous   Convex     Smooth    Brown    True Pungent           Free
2    Edible   Convex     Smooth   Yellow    True  Almond           Free
3    Edible     Bell     Smooth    White    True   Anise           Free
4 Poisonous   Convex      Scaly    White    True Pungent           Free
5    Edible   Convex     Smooth     Gray   False    None           Free
6    Edible   Convex      Scaly   Yellow    True  Almond           Free

Create data for training and testing.

This is the back bone for machine learning. We are taking the data and splitting into two groups one is for the model the other is to run through the model, to train the model. The test data will be ~7700 records and the training data will be around ~400 records.

sample.ind = sample(2, 
                     nrow(data),
                     replace = T,
                     prob = c(0.05,0.95))
data.dev = data[sample.ind==1,]
data.val = data[sample.ind==2,]

The data sets look good as edible vs poisonous proportion ratio is about equal To see if the traning and test data are some what equal

table(data$Edible)/nrow(data)

   Edible Poisonous 
0.5179714 0.4820286
table(data.dev$Edible)/nrow(data.dev)

   Edible Poisonous 
   0.5025    0.4975
table(data.val$Edible)/nrow(data.val)

   Edible Poisonous 
0.5187727 0.4812273

Lets run the random forest algorithm with number of trees being 100 The more the better, but we don’t want to over do it.

rf = randomForest(Edible ~ ., 
                   ntree = 100,
                   data = data.dev)

plot(rf)

rf plot

This plot will tell us if the number of trees we picked will give good results. As you can see 10-30 trees causes a minor error span.

Plot the top 10 variables for prediction of edible vs posionous.

varImpPlot(rf,
           sort = T,
           n.var=10,
           main="Top 10 - Variable Predictors")

Top 10 Predictors

Looks like Odor is the greatest indicator along with Spore Print Color

var.imp = data.frame(importance(rf,
                                 type=2))

#Make row names as columns

var.imp$Variables <- row.names(var.imp)
var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),]


                       MeanDecreaseGini             Variables
Odor                       70.62464550                  Odor
SporePrintColor            34.98215873       SporePrintColor
GillColor                  14.38042458             GillColor
RingType                   11.25159749              RingType
GillSize                   11.12214770              GillSize
Population                  8.39424526            Population
StalkSurfaceAboveRing       6.91877682 StalkSurfaceAboveRing
Habitat                     6.53783046               Habitat
StalkRoot                   5.61593844             StalkRoot
StalkSurfaceBelowRing       5.59337397 StalkSurfaceBelowRing
Bruises                     5.47550779               Bruises
StalkColorBelowRing         5.08844477   StalkColorBelowRing
CapColor                    3.29438185              CapColor
GillSpacing                 2.74912949           GillSpacing
StalkColorAboveRing         2.26185667   StalkColorAboveRing
StalkShape                  1.79067923            StalkShape
CapSurface                  1.10385529            CapSurface
RingNumber                  0.93590714            RingNumber
CapShape                    0.61875375              CapShape
VeilColor                   0.13948555             VeilColor
GillAttachment              0.01928571        GillAttachment
VeilType                    0.00000000              VeilType

Here is the 400 samples ran through the model the confidence level is 99%, meaning we can say that we can predict mushroom edibility 99% of the time. See the Confusion Matrix and Statistics below.

#Predicting response variable

data.dev$predicted.response <- predict(rf , data.dev)

# Create Confusion Matrix for the traning data

#Calculates a cross-tabulation of observed and predicted classes with associated statistics.

confusionMatrix(data = data.dev$predicted.response,
                reference = data.dev$Edible,
                positive = 'Edible')
               
## Confusion Matrix and Statistics
## 

##            Reference
## Prediction  Edible Poisonous
##   Edible       201         0
##   Poisonous      0       199
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9908, 1)
##     No Information Rate : 0.5025     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5025     
##          Detection Rate : 0.5025     
##    Detection Prevalence : 0.5025     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Edible     
## 

#Predicting response variable

data.val$predicted.response <- predict(rf ,data.val)

#Create Confusion Matrix for the test data
#Calculates a cross-tabulation of observed and predicted classes with associated statistics.

confusionMatrix(data=data.val$predicted.response,
                reference=data.val$Edible,
                positive='Edible')
                
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Edible Poisonous
##   Edible      4007         8
##   Poisonous      0      3709
##                                          
##                Accuracy : 0.999          
##                  95% CI : (0.998, 0.9996)
##     No Information Rate : 0.5188         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.9979         
##  Mcnemar's Test P-Value : 0.01333        
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.9978         
##          Pos Pred Value : 0.9980         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.5188         
##          Detection Rate : 0.5188         
##    Detection Prevalence : 0.5198         
##       Balanced Accuracy : 0.9989         
##                                          
##        'Positive' Class : Edible         
## 

With odor and spore print color we can predict almost 99% of the time.

#Lets plot some jitter classification plots to show what odors and spore print colors show the poisonous mushrooms

#This plot shows that CapShape and CapSurface really dont give a good indication of edible mushrooms
p = ggplot(data,aes(x=CapShape, y=CapSurface, color=Edible))
p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('Edible','Poisonous'),values=c('blue','orange'))

ggPlot Jitter Plots

CapShape vs CapSurface

Odor vs Spore Print Color

Odor vs Edible

Spore Print Color vs Edible

Odors fishy, foul and spicy indicate the mushroom is poisonous, so get smelling.