naive-bayes-spam-filter

WHITE PAPER NOTES


[ reference ]
-------------
Adaptive Spam Filtering
Using Only Naive Bayes Text Classifiers (CEAS 2008 PDF)


[ abstract ] 
------------ 
Core claim: plain Naive Bayes text classifiers are still competitive for spam
filtering when they are engineered carefully. The paper is not trying to sell a
fancy model; it is showing how far simple probabilistic text classification can
go with good feature selection and adaptive updates.

Two variants are emphasized:
- multinomial NB with transformed term-frequency weights (tf-nb)
- multinomial NB with Boolean features (bool-nb)

The practical angle is important: computationally cheap methods that can run
online and adapt during live traffic.


[ intro ]
---------
Reading note: the authors frame this as a realism test, not a novelty contest.
They compare text-only NB filters against more complex systems likely to appear
in challenge settings. The underlying question is:

```text
How much performance comes from a disciplined text classifier alone,
before adding all the extra production heuristics?
```

The paper positions NB as useful because it is:
- fast at inference
- easy to retrain
- cheap enough for frequent updates


[ feature selection ]
---------------------
This section does most of the heavy lifting.

Pipeline used in the paper:
1) Use only subject + body text.
2) Remove HTML tags.
3) No stemming, no stop-word removal.
4) Drop tokens that appear in fewer than 5 training messages.
5) Compute Information Gain on Boolean token presence.
6) Keep top 3000 tokens.

Notebook interpretation:
- Step 4 removes brittle noise.
- Steps 5-6 force a fixed, high-signal feature budget.
- The model quality depends as much on this selection policy as on NB itself.

ASCII intuition for Information Gain:

```text
IG(token) ~ how much uncertainty about class drops
            when we know whether token is present

high IG: token presence strongly shifts spam/ham odds
low  IG: token presence tells us almost nothing
```


[ naive bayes variants used ]
-----------------------------
The scoring target is the spam posterior:

```text
score = P(spam | x)
      = [P(spam) * P(x | spam)]
        / ([P(spam)*P(x|spam)] + [P(ham)*P(x|ham)])
```

Decision rule in the paper:

```text
if score > 0.5 => spam
else           => ham
```

In implementation, work in log space:

```text
log P(x|c) becomes sum(...) instead of product(...)
```

---
Variant A: bool-nb
---
Feature value is binary:

```text
x_i = 1 if token i appears in message, else 0
```

Token likelihood estimate in class c:

```text
p(token|c) = (1 + M_token,c) / (2 + M_c)
```

where:
- `M_token,c` = # class-c training messages containing token
- `M_c`       = total # class-c training messages

Reading note: this is very stable under noisy frequency spikes because repeated
occurrences in one email do not inflate evidence.

---
Variant B: tf-nb
---
Starts from raw token counts, then applies three transforms:

```text
a) tf transform:          x <- log(tf + 1)
b) idf transform:         x <- x * log(total_docs / docs_with_token)
c) length normalization:  x <- x / ||x||
```

Then multinomial-style likelihood with smoothing:

```text
p(token|c) = (1 + N_token,c) / (m + N_c)
```

where:
- `N_token,c` = total occurrences of token in class-c messages
- `N_c`       = total token occurrences in class c (across selected features)
- `m`         = number of selected features (3000)

Reading note: tf-nb usually wins in their prior experiments, likely because it
preserves graded evidence while still damping raw count explosions.


[ active learning ]
-------------------
Operational setup: labels for incoming mail are limited. The filter must decide
which messages are worth spending label budget on.

Policy used:
- uncertain messages (scores near 0.5) are more likely to be selected
- very confident messages are less likely
- selection is probabilistic, not deterministic

Why probabilistic selection matters:

```text
pure "pick-most-uncertain-only" can over-focus on one region
randomized weighting around 0.5 keeps exploration alive
```

They tune the uncertainty spread using the allowed label ratio:

```text
K = training_requests_allowed / incoming_messages_to_classify
```

Then they keep re-estimating spread during the stream, so the selection policy
tracks the score distribution as the environment shifts.


[ conclusion and future work ]
------------------------------
Main takeaway from these notes:

```text
Simple classifier + disciplined feature pipeline + adaptive labeling
beats many "looks-advanced" setups in constrained, online settings.
```

For future transfer, a tactical non-cyber domain:

Dynamic quality triage on a manufacturing line.

Mapping:

```text
Email spam filtering                Manufacturing triage
--------------------                --------------------
message                             produced unit / part
spam vs ham                         fail-risk vs pass-risk
tokens/features                     sensor/event features
score near 0.5                      uncertain quality verdict
label budget                        limited manual inspections
active learning                     inspect uncertain units first
```

Why this transfer is plausible:
- same structure: high-volume stream + limited human verification
- same objective: maximize value of each manual label
- same method: keep a lightweight probabilistic model that updates often

[ P.S ]
-------
Code implementation can be found here