The Naive Bayes Algorithm: Step by Step

[mathjax]The naive Bayes algorithm is fundamental to machine learning, combining good predictive power with a compact representation of the predictors. In many cases, naive Bayes is a good initial choice to estimate a baseline model, which can then be refined or compared to more sophisticated models.

In this post, I’m going to step through the naive Bayes algorithm using standard R functions. The goal is not to come up with production-level code–R already has that–but to use R for the “grunt work” needed to implement the algorithm. This lets us focus on how naive Bayes makes predictions, rather than the mechanics of probability calculations.

Files Used in This Post

You can download the complete script at the following address.

Files for post “The Naive Bayes Algorithm: Step by Step”

The Idea Behind the Algorithm

We start with a standard machine learning problem: given a data set of features and targets, how can we predict the target of a new query based on its features? Assume we have:

  1. A list of customers
  2. Measures of their engagement with our website (each with the value Low, Moderate, or High)
  3. Whether each customer signed up for Premium service

In classical statistics, we treat the values of the measurements as evidence, and ask whether there are more customers with specific values who signed up for Premium service, or who did not subscribe. For example, assume the measures of engagement are (Low, Low, Moderate), and that 4 customers with those values signed up for Premium service, while 10 did not. We would predict a customer with that level of engagement would not subscribe.

One complication is that there are relatively fewer customers who sign up compared to those that don’t: in that case, given the same measurements, we’re less likely to predict a customer will sign up because our data set has fewer cases where they did. Bayes Theorem mitigates this by weighting the measurements with the unconditional probability that the customer did or did not sign up: that is, the probability that a customer did subscribe, regardless their engagement.

This revision leads to better predictions; however, it requires us to calculate the relationships between all the measures, as well as each measure and the targets. in real data sets, this isn’t feasible: instead, we assume the measures are independent of one another, and just calculate the relationship between the measures and the target. This is the naive Bayes model, so called because the assumption of independent measures is made whether or not it is literally true.

A Step-By-Step Example

For this example, we assume than an accounting firm wants to predict the audit risk of a client based upon the characteristics of their tax return. There are four characteristics that we are interested in, along with their possible values.

  1. Percentage of Business that is Cash-Based: “Less than 25%”, “Between 25% & 50%”, “More than 50%”
  2. Number of Wire Transfers: “Zero”, “Less than 5”, “5 or More”
  3. Home Office: “Yes”, “No”
  4. Amount of Tax Owed: “Less than $1,000”, “Between $1,000 and $5,000”, “More than $5,000”

The target (Audit Risk) can be Low, Moderate, or High. We want to develop a naive Bayesian model that predicts the Audit Risk for a generic client based upon these characteristics.

Preliminaries

We start by removing all objects from the workspace, then creating the data set and assigning factor levels to make it easier to interpret.

# NaiveBayesStepByStep.R
# Illustrates the calculations used by the naive Bayes algorithm to predict a query target level given a training data set
# c. 2017 David Schwab

# Preliminaries

rm(list=ls()) 
options(digits=3)    # This makes the numbers display better; feel free to change it.

# Construct the data set

pct.cash <- factor(c(1,1,1,2,2,1,2,2,1,3,3,3,3,2,3,3,1,1,2))
wire.transfers <- factor(c(1,1,2,1,2,1,2,2,1,3,3,3,3,2,3,3,1,1,2))
home.office <- factor(c(1,1,1,1,2,2,1,2,2,2,2,1,1,1,2,1,2,1,2))
tax.owed <- factor(c(1,1,1,2,2,2,1,1,2,3,3,3,3,1,3,3,2,1,1))
audit.risk <- factor(c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,1,2,3,2,1))

# Add the factor levels and make the data frame

levels(pct.cash) <- c("Less than 25%", "Between 25% & 50%", "More than 50%")
levels(wire.transfers) <- c("Zero", "Less than 5", "5 or More")
levels(home.office) <- c("Yes", "No")
levels(tax.owed) <- c("Less than $1,000", "Between $1,000 and $5,000", "More than $5,000")
levels(audit.risk) <- c("Low", "Moderate", "High")

audit.risk.data <- data.frame(pct.cash,wire.transfers,home.office,tax.owed,audit.risk)

Calculate Conditional Probabilities for Each Feature

Next, we calculate the conditional probability that each feature is associated with each target. There are 9+9+6+9=33 separate probabilities, so we will let R do the calculation using table() and apply().

To start, we count the number of times each target level occurs and store it as audit.risk.targets. We use table() to get the counts, then cast them to numeric.

Next, we call apply() with an in-line user-defined function: this function creates a contingency table between one of the features and the target. We transpose the table and divide it by audit.risk.targets, which R recycles for each row. The result, audit.risk.cond.prob, is a list of the conditional probabilities that each feature takes each target level (for housekeeping, we delete the final element, which has audit.risk in both rows and columns).

# Next, count the instances of each target level

audit.risk.targets <- as.numeric(table(audit.risk.data$audit.risk))

# Now, calculate the conditional probabilities needed to predict a target with the naive Bayes model

audit.risk.cond.prob <- apply(audit.risk.data, 2, function(x){
  t(table(x,audit.risk.data$audit.risk)) / audit.risk.targets
})

audit.risk.cond.prob$audit.risk <- NULL       # Remove extraneous data

Calculate Prior Probabilities for Each Target Level

All that’s left is to calculate the prior probabilities for each target level: this is just the count of each level divided by the total number of data points. We also create a separate display variable for easier interpretation.

# Calculate the target priors

audit.risk.priors <- audit.risk.targets / nrow(audit.risk.data)
audit.risk.priors.display <- data.frame(target=c("Low", "Moderate", "High"),priors=audit.risk.priors)

Using the Model to Make a Prediction

To make a prediction, we display both the conditional and prior probabilities. Here is how they will look: note that we display each feature separately to arrange the columns in the correct order, using the levels defined earlier.

> audit.risk.priors.display

    target priors
1      Low  0.316
2 Moderate  0.368
3     High  0.316

> audit.risk.cond.prob$pct.cash[,levels(pct.cash)]
          x
           Less than 25% Between 25% & 50% More than 50%
  Low              0.500             0.333         0.167
  Moderate         0.429             0.429         0.143
  High             0.167             0.167         0.667

> audit.risk.cond.prob$wire.transfers[,levels(wire.transfers)]
          x
            Zero Less than 5 5 or More
  Low      0.500       0.333     0.167
  Moderate 0.429       0.429     0.143
  High     0.167       0.167     0.667

> audit.risk.cond.prob$home.office[,levels(home.office)]
          x
             Yes    No
  Low      0.667 0.333
  Moderate 0.429 0.571
  High     0.500 0.500

> audit.risk.cond.prob$tax.owed[,levels(tax.owed)]
          x
           Less than $1,000 Between $1,000 and $5,000 More than $5,000
  Low                 0.667                     0.167            0.167
  Moderate            0.429                     0.429            0.143
  High                0.167                     0.167            0.667

To make a prediction, consider the query q=(Between 25% and 50%, Less than 5, Yes, Less than $1,000). We need to calculate the naive Bayes estimate for each target level; the prediction is the target level with the greatest value.

By inspection, we can see that when the target level is Low, the conditional probabilities are .333, .333, .333, .667: the joint conditional probability is their product, which is 0.025. The prior probability of Low is 0.316, so we multiply the product by this to get the naive Bayes estimate of 0.008. Similar estimates for Moderate and High are 0.017 and 0.001, respectively. The level Moderate has the greatest value, so that is the naive Bayes prediction for this query.

NOTE: In this example, the naive Bayesian estimates are not the actual posterior probabilities of each target level: however, their relative ranking is identical to the actual probabilities, so we can be confident that Moderate is the correct estimate.

Conclusion

As we can see, the naive Bayes algorithm allows a complex data set to be represented by relatively few predictors. It also performs well in many different applications, and has the intuitive appeal of predicting the most probable target level based on the relative frequency of the different target levels, as well as the set of features. For production work, the e1071 package provides the naiveBayes() function: you can find a good overview of its use (as well as more detail on the theory behind the algorithm) here.

Troubleshooting at a standstill? Feel free to reach out!