Naive Bayes Classifiers

Hey friends! ๐Ÿ‘‹ It's me, Miss Neura, here today to unpack the Naive Bayes classifier.

Now I know "naive" doesn't sound very flattering in the name. ๐Ÿ˜… But don't let that fool you!

Naive Bayes is actually a super simple yet powerful algorithm for classification tasks like spam detection and sentiment analysis. ๐ŸŽ‰

It works by calculating conditional probabilities based on Bayes' theorem and an assumption of independence between features.

I know that sounds a little math-y, but stick with me! ๐Ÿค“ I'll break down Bayes and the "naive" assumption piece by piece in easy-to-understand terms.

By the end, you'll have a clear understanding of how Naive Bayes ingests data to make predictions for categorical variables. ๐Ÿ“ˆ The key is maximizing those probabilities!

Let's start with a quick history lesson to see where Naive Bayes originated before we dive into the nitty gritty details. โณ

Bayes History

The original Bayes' theorem dates back to the 1700s when Thomas Bayes first described it.

The theorem provided a way to calculate conditional probabilities.

It laid the foundation for understanding evidence-based statistics and probabilistic reasoning.

Over the years, Bayes' theorem became an important tool across fields like economics, medicine, and computing.

Fast forward to the 1960s - researchers started extending Bayes for classifying data in machine learning.

But it took high levels of computation to estimate the probabilities needed.

Then in the 1990s, the "naive" conditional independence assumption dramatically simplified calculations. ๐Ÿ’ก

This breakthrough yielded the Naive Bayes classifier algorithm we know and love today! ๐Ÿฅฐ

Now let's dive into exactly how Naive Bayes works its probabilistic magic! ๐ŸŽฉ โœจ

How Naive Bayes Works

The "naive" in Naive Bayes comes from an assumption - all the โ€œfeaturesโ€ we use are totally independent from each other! โœ‹๐Ÿคš

For example, say we're building a spam filter using words in the email as features. ๐Ÿ“ง

The naive assumption means the word "free" appearing has nothing to do with the word "money" appearing. ๐Ÿ’ฐ

In the real world, this is often false - spam emails tend to have multiple sketchy words together. ๐Ÿ˜ฌ

But it makes the math so much easier! ๐Ÿ˜… We just calculate the probability of each word on its own.

To classify an email as spam or not spam, we:

1๏ธโƒฃ Find the base rate of spam emails (the prior probability of spam)

2๏ธโƒฃ Calculate the probability of each word appearing in spam emails and not spam emails (the likelihoods)

3๏ธโƒฃ Use Bayes' theorem to multiply these together and get the posterior probability that the email is spam

Posterior = Prior x Likelihood1 x Likelihood2 x Likelihood3... ๐Ÿงฎ

4๏ธโƒฃ Compare the posterior probability of spam vs not spam

Whichever posterior is higher tells us how to classify the email! ๐Ÿ’Œ

So in a nutshell:

  • Assume feature independence to make the math easy ๐Ÿ’ช
  • Calculate prior and likelihoods across features ๐Ÿ“ˆ
  • Multiply to find posterior probabilities ๐Ÿงฎ
  • Classify based on the highest posterior! ๐Ÿ†

The Algorithm

Let's walk through the key steps of the Naive Bayes algorithm to see the math in action.

We'll use a simple example trying to classify emails as spam or not spam based on 2 keyword features: contains "free" and contains "money".

1๏ธโƒฃ Gather your training data

We need a training set with emails labeled as spam or not spam to start. Let's say we have 100 emails:

  • 20 are spam
  • 80 are not spam

2๏ธโƒฃ Calculate the prior probabilities

The prior probability of an email being spam P(spam) is 20/100 or 0.2

The prior probability of not spam P(not spam) is 80/100 or 0.8

These are our base rates before seeing any email features.

3๏ธโƒฃ Calculate the likelihood probabilities

Let's say in the training data:

  • 15 of the 20 spam emails contain the word "free"
  • 5 of the 80 NOT spam emails contain "free"

So the likelihood P("free"|spam) is 15/20 = 0.75

And P("free"|not spam) is 5/80 = 0.0625

We then do the same for the "money" feature.

4๏ธโƒฃ Multiply likelihoods and prior to get posteriors

For an email with "free" and "money", the posterior probabilities are:

P(spam|"free","money") = P(spam) x P("free"|spam) x P("money"|spam)

P(not spam|"free", "money") = P(not spam) x P("free"|not spam) x P("money"|not spam)

5๏ธโƒฃ Classify based on highest posterior

If P(spam|"free","money") is higher, we classify the email as spam!

The Advantages

Fast and simple โšก๏ธ

  • The naive assumption dramatically reduces computation time compared to other algorithms.
  • Training is much quicker than neural networks or SVMs.

Performs well with small data ๐Ÿ“Š

  • Unlike other algorithms, NB doesn't need tons of training data to estimate robust probabilities.
  • Can learn from fewer examples and still make decent predictions.

Easy to implement ๐Ÿ’ป

  • The math equations are pretty simple to code up.
  • Much less programming complexity compared to sophisticated techniques.

Interpretable ๐Ÿ•ต๏ธโ€โ™€๏ธ

  • Since NB relies on conditional probabilities, we can inspect what features have the highest correlations.
  • More transparent than black box models.

Resilient to irrelevant features ๐Ÿ’ช

  • Adding unnecessary inputs doesn't affect the model too much.  
  • Independent probabilities diminish irrelevant relationships.

Disadvantages of Naive Bayes

Naive assumption ๐Ÿคจ

  • Features are usually not completely independent in real-world data.
  • Violates assumption and leads to inaccurate probabilities.

Presumes dataset distributions ๐Ÿ“ˆ
  • Algorithm presumes data fits standard distribution shapes like Gaussian.
  • Real-world data may not match these expected distributions.

Prone to overfitting ๐Ÿคช

  • With lots of features, easy to overfit to the training data.
  • Poor generalization to new data. Too many inputs overspecifies.

Metrics difficult to calculate ๐Ÿ“‰

  • Standard classification metrics like precision and recall don't apply naturally.
  • Need to use different performance analysis methods.

Not suitable for complex data ๐Ÿ”ฎ

  • Correlated and nonlinear feature relationships violate independence assumption.
  • Struggles with images, audio, video data.

Applications

Spam filtering ๐Ÿ“ง

  •  Classify emails as spam or not spam based on content features.
  • The naive assumption performs well enough here.

Sentiment analysis ๐Ÿ˜€๐Ÿ˜ก

  • Determine positive or negative sentiment of texts like reviews.
  • Independent word probabilities work well.

Recommender systems ๐Ÿ›๏ธ

  • Recommend products based on past likes/dislikes and product features.
  • Probabilities help identify preferences.

Text classification ๐Ÿ“‘

  • Categorize texts into topics based on word probabilities.
  • Useful for topic modeling and document organizing.

Disease prediction ๐Ÿฉบ

  • Predict presence of disease given diagnostic test outcomes.
  • Test results can be used as independent features.  

TL;DR

  • Naive Bayes is a fast, simple classification algorithm that calculates probabilities based on Bayes' theorem and an independence assumption.
  • It performs well on small data where its simplicity is an advantage over more complex methods.
  • Best suited for problems like spam, sentiment, and recommendations where some independence between features exists.
  • Not appropriate for complex, correlated data like images, audio, or video.
  • Overall, Naive Bayes provides a useful balance of simplicity, speed, and performance!

Vocab List๏ธ

Bayes' theorem - Defines conditional probability P(A|B) as P(B|A)P(A)/P(B).

Likelihood - Probability of data given a hypothesis, P(D|H).

Prior probability - Initial probability before new evidence, P(H).

Posterior probability - Updated probability after new evidence, P(H|D).

Conditional independence - Assumption features are unrelated.

Gaussian distribution - Normal distribution shaped like a bell curve.

Leave a Comment