Working with Bayesian Categorizers
by Jon Udell
November 19, 2003
Every once in a while, but never as often as you'd wish, a technology
comes along that profoundly improves your life. For me the most recent
example of such a thing is SpamBayes. More specifically, it's the
combination of the SpamBayes
engine, an open source email categorizer with some innovative twists
on Paul Graham's original plan, and Mark Hammond's Outlook
addin which smoothly integrates training with normal use of the
email client.
Months ago I
wrote about how SpamBayes has solved my spam problem more effectively
than I thought a pure content-based filter could. Time was the ultimate
test, though. Would this razor lose its edge? It hasn't. Every day I
sharpen it. A frightening majority of my messages -- on the order of
several hundred per day -- are clearly spam. A depressing minority --
maybe fifteen or twenty per day -- are clearly ham (i.e., not-spam). And
then there's a scattering of in-betweens, messages that SpamBayes can't
confidently classify one way or the other. It routes these to a MaybeSpam
folder and offers buttons for two actions: Delete as Spam, Recover from
Spam.
This arrangement is a wonderful example of the kind of synergy that's
possible between an automated assistant and a human overseer. Although I
still review my Spam folder for false positives, it's devolved into a
routine exercise that never requires thought and only very rarely requires
action. Reviewing the MaybeSpam folder, on the other hand, always requires
thought and action. In theory that should be a dead-end trip down the path
of least resistance. In practice it isn't, for two reasons. First, it's
not much effort to click one or the other of the buttons a handful of
times a day. Second, and more profoundly, the classification puzzles that
SpamBayes presents in my MaybeSpam folder are interesting. Only sometimes
do I think: "Why was it confused about that?" More often I think: "Yup, I
can see it both ways."
If it were only a matter of the typical body-part-enlargement and
Nigerian-scam messages that everyone gets, there wouldn't so much grey
area. But as a high-tech journalist I'm the target of lots of
tech-oriented promotional email too. I want (actually, need) to prioritize
the tech-oriented messages that are interesting to me and suppress the
ones that aren't. It's a subtle discrimination because on a given day
messages in both categories might arrive from the same legitimate sender,
with the same trail of SMTP headers. By making a choice between Delete as
Spam and Recover from Spam I teach SpamBayes about my interests and
non-interests in a way that would be quite difficult to articulate,
particularly since my interests and non-interests change over time.
Categorizing blog content
There's been some discussion in the blog world about using a Bayesian
categorizer to enable a person to discriminate along various
interest/non-interest axes. I took a run at this recently and, although my
experiments haven't been wildly successful, I want to report them because
I think the idea may have merit.
For starters, I looked for tools that would enable me to train and
test a categorizer. I found two that were pretty easy to work with. The
first is called Bow
(for Bag of Words), a code library for statistical language modeling
written by Carnegie-Mellon's Andrew McCallum. Rainbow, a
program based on the Bow library, can train and test Bayesian (or other)
classifiers. This software is widely available; I used fink to install Bow
and Rainbow on Mac OS X.
The second tool was Ken Williams' Perl CPAN module,
AI::Categorizer, which relies on Williams'
Algorithm::NaiveBayes and also on Benjamin Franz's and Jim
Richardson's
Lingua::Stem, a framework for word stemming that's localized for a few
different languages. With all these dependencies the formidable CPAN
installer had to chug for a while, but in the end it succeeded.
In order to test both tools on the same dataset, I decided to let
Rainbow take the lead and adapt AI::Categorizer to Rainbow's
directory-oriented style. So I started with a directory called
$HOME/train, and created per-category subdirectories under it.
For test data I started with a set of items from my weblog content. I
keep a single
XML file containing the entries written since I began enforcing a
strict XHTML discipline -- about 150 so far. I use the file for XPath-based
search, but it's handy for other things too. For this experiment, I
wrote a small Python script to break out the entries into individual XHTML
files, using the titles of the entries (which are long and descriptive) as
filenames. This arrangement made it pretty easy to review entries in the
Finder, read them in a browser when the titles weren't sufficiently
descriptive, and copy them into appropriate subdirectories under the
training directory.
After classifying a first batch of entries, I initialized Rainbow's
training database like so:
bash$ rainbow -H -i $HOME/train/*
Class `blogging'
Gathering stats... files : unique-words :: 12 : 1743
Class `books'
Gathering stats... files : unique-words :: 4 : 1945
...
The -H argument tells Rainbow to skip HTML tokens, and -i tells it to
index the specified set of subdirectories. Subsequently, I followed this
procedure:
Copy a new entry to the category (i.e. subdirectory) where I thought it belonged.
Test to see if Rainbow predicted that category.
Retrain.
For example, I fed Rainbow the contents of this column as I was
writing it. The category I would have picked for it is
data_management. After naming my working draft of the
column-in-progress 'autoCategorize.html' -- and copying it to
$HOME/train/data_management/autoCategorize.html -- I ran this command:
rainbow -x $HOME/train | grep autoCategorize.html
That produced a classification for each of the files under the
training directory; grepping for autoCategorize.html isolated just the
classification of that file:
/Users/jon/train/data_management/autoCategorize.html
data_management rss:0.9999999395 data_management:6.043669638e-08
email:1.076055135e-11 swdev:4.499116623e-12 blogging:7.956561148e-18
markup:1.973268953e-26 services:8.758472658e-34 identity:0 calendaring:0
browser:0 opensource:0 voice_video_communication:0 collaboration:0
security:0 networking:0 os:0 people:0 hci:0 books:0 vm:0 policy:0 zope:0
location:0
Since the system wasn't yet trained on the file, this result was a
prediction. It says that the three most likely categories are rss,
data_management, and email. The first, rss, is a poor result. My choice,
data_management, came second. Given the discussion of SpamBayes in this
column, it seems reasonable that the email category came third. As for the
remaining categories seen as having some relationship to this column, the
connections are weak but plausible. Conversely the categories scoring zero
are plausibly unrelated to this column, with the exception of the
opensource category which, ideally, SpamBayes would have triggered. Given
that there were only 150 documents in the training set at the time, spread
across 24 categories, I'm sure it's unreasonable to expect better
precision. My SpamBayes database, by contrast, has only two categories to
worry about and has thousands of samples in each category.
Next I retrained the system to incorporate the new file (rainbow
-H -i $HOME/train/*) and reran the classification dump. Now the
results for autoCategorize.html looked like this:
/Users/jon/train/data_management/autoCategorize.html
data_management data_management:1 rss:0 email:0 services:0 swdev:0
blogging:0 markup:0 identity:0 calendaring:0 browser:0 collaboration:0
opensource:0 voice_video_communication:0 networking:0 os:0 security:0
people:0 hci:0 books:0 policy:0 vm:0 zope:0 location:0
In other words, the system now knows unambiguously how I would prefer
to classify this column. The frequencies of words appearing in this column
will influence future classification.
As you test and classify files one at a time, you get a general sense
of how you're doing, but Rainbow can provide a more explicit
scoreboard. For example, this command:
rainbow --test-percentage 50 --test 1 | rainbow_stats
asks Rainbow to run a single iteration (--test 1) of a
test that randomly chooses half the categorized files
(--test-percentage 50). The report looks like this:
Correct: 24 out of 73 (32.88 percent accuracy)
- Confusion details, row is actual, column is predicted
classname 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 :total
0 blogging 3 . . . . 2 1 . . . 1 . . . . . . . . . . . . : 7 42.86%
1 books . . . . . 1 . . . . . . . . . . . . . 2 . . . : 3 0.00%
2 browser . . . . . 4 . . . . . . . . . . . . . . . . . : 4 0.00%
3 calendaring . . . 1 . 1 . . . . . . . . . . . . . . . . . : 2 50.00%
4 collaboration . . . . 1 1 . . . . . . . . . . . . . . . . . : 2 50.00%
5 data_management . . . . . 3 . . 1 . 2 . . . . . 2 . . . . . . : 8 37.50%
6 email . . . . . 1 1 . . . . . . . . . 1 . . . . . . : 3 33.33%
7 hci . . . . . . . . . . . . . . . . . . . 2 . . . : 2 0.00%
8 identity . . . . . . . . 1 . . . . . . . 1 . 1 . . . . : 3 33.33%
9 location . . . . . . . . . . . . . . . . . . . . . . . : .
10 markup . . . . . 3 . . . . 1 . . . . . . . . . . . . : 4 25.00%
11 networking 1 . . . . 1 . . . . . . . . . . . . . . . . . : 2 0.00%
12 opensource . . . . . . . . . . . . 1 . . . . . . 1 . . . : 2 50.00%
13 os . . . . . 1 . . . . . . . . . . . . . 1 . . . : 2 0.00%
14 people 1 . . . . . . . . . . . . . . . . . . . . . . : 1 0.00%
15 policy . . . . . . . . . . . . . . . . 1 . . . . . . : 1 0.00%
16 rss . . . . . . . . . . . . . . . . 6 . . . . . . : 6 100.00%
17 security . . . . . . . . . . . . . . . . . 1 . 1 . . . : 2 50.00%
18 services . . . . . 1 . . . . . . . . . . 1 1 . 4 . . . : 7 0.00%
19 swdev . . . . . 1 . . . . 2 . 1 . . . . . . 4 . . . : 8 50.00%
20 vm . . . . . . . . . . . . . . . . . . . 1 . . . : 1 0.00%
21 voice_video_communication 1 . . . . . . . . . . . . . . . . . . . . 1 . : 2 50.00%
22 zope . . . . . . . . . . . . . . . . . . . 1 . . . : 1 0.00%
At this point in the training, the picture varied a lot from one run
to the next because of the small sample size. But it was already possible
to see which categories were well-defined and which were blurry. In this
run, 7 documents from the blogging category were tested. Three were
correctly predicted to belong in category 0 (blogging), two incorrectly
but plausibly in category 5 (data_management), and one incorrectly but not
implausibly in categories 6 (email) and 10 (markup). By contrast category
16 (rss) was consistently well-defined across runs.
You'd expect that the acronym RSS, appearing most frequently in the
articles I've assigned to that category, would carry a lot of weight. This
command, which shows which words most influence the category, proves that
it does:
bash$ rainbow --print-word-weights rss | sort -r
0.0016115852 rss
0.0003317731 feed
-0.0069265488 xml
-0.0032424503 http
...
However, the acronym RSS also shows up frequently in documents
assigned to other categories:
bash-2.05a$ rainbow --print-word-counts rss | sort -r
96 / 2706 ( 0.03548) rss
16 / 3104 ( 0.00515) blogging
12 / 3436 ( 0.00349) data_management
7 / 1582 ( 0.00442) markup
6 / 1131 ( 0.00531) calendaring
4 / 1989 ( 0.00201) email
2 / 933 ( 0.00214) collaboration
2 / 775 ( 0.00258) browser
1 / 3613 ( 0.00028) swdev
1 / 2855 ( 0.00035) services
1 / 1969 ( 0.00051) security
1 / 1082 ( 0.00092) voice_video_communication
1 / 691 ( 0.00145) networking
1 / 509 ( 0.00196) books
And sure enough, if we check which words most influence the blogging
and data_management categories, rss ranks highly among them:
bash$ rainbow --print-word-weights blogging | sort -r | more
-0.0036468446 rss
-0.0033331176 time
-0.0032126101 people
-0.0029204316 blog
-0.0027708460 web
...
bash$ rainbow --print-word-weights data_management | sort -r | more
-0.0069353367 xml
-0.0034907475 web
-0.0031235600 data
-0.0030926130 rss
-0.0029832243 time
...
With my 150-document database, the overall accuracy scores ranged from
about 20% to about 40%. That seems terrible compared to SpamBayes. Of
course, the judgment as to whether the best category for an item is
blogging or data_management is much fuzzier than the spam/not-spam
determination. Even with many more samples, I'm not sure the accuracy
score would improve by much. Still, there's some correlation going on
here, and I suspect that it could provide some benefits. For an author who
manually categorizes items, automatic categorization -- even if overridden
-- can help clarify the boundaries of emergent categories. For a reader
it could be a way to overlay a personal taxonomy on inbound items. But
without an application that makes training and testing a seamless part of
the blogging experience, it's hard to say whether or not these scenarios
will really pan out.
Using AI::Categorizer
A different experiment reinforces that conclusion. Here's a Perl
script that uses AI::Categorizer to classify the columns I've written for
the O'Reilly Network, using the same set of training files as before.
#! perl -w
use strict;
use AI::Categorizer;
use AI::Categorizer::KnowledgeSet;
use LWP::Simple;
my $traindir = '/Users/jon/train';
my @oracols = (
'http://www.oreillynet.com/lpt/a/52',
'http://www.openp2p.com/lpt/a/1351',
'http://www.xml.com/lpt/a/ws/2002/01/01/topic_map.html',
'http://www.xml.com/lpt/a/ws/2002/03/01/udell.html',
'http://www.xml.com/lpt/a/ws/2002/04/01/outlining.html',
'http://www.xml.com/lpt/a/ws/2002/05/03/udell.html',
'http://www.xml.com/lpt/a/ws/2002/06/04/udell.html',
'http://www.xml.com/lpt/a/ws/2002/07/09/udell.html',
'http://www.xml.com/lpt/a/ws/2002/08/02/flashcomm.html',
'http://www.xml.com/lpt/a/ws/2002/09/03/udell.html',
'http://www.oreillynet.com/lpt/a/2767',
'http://www.oreillynet.com/lpt/a/2889',
'http://www.xml.com/lpt/a/ws/2002/12/09/udell.html',
'http://www.xml.com/lpt/a/ws/2003/01/13/udell.html',
'http://www.xml.com/lpt/a/ws/2003/02/11/udell.html',
'http://www.xml.com/lpt/a/ws/2003/03/04/spring.html',
'http://www.xml.com/lpt/a/ws/2003/04/15/semanticblog.html',
'http://www.xml.com/lpt/a/ws/2003/05/13/email.html',
'http://www.xml.com/lpt/a/ws/2003/06/10/xpathsearch.html',
'http://www.xml.com/lpt/a/2003/07/09/udell.html',
'http://www.xml.com/lpt/a/2003/08/13/udell.html',
'http://www.xml.com/lpt/a/2003/09/17/udell.html',
'http://www.xml.com/lpt/a/2003/10/08/udell.html',
);
sub training_docs
{
opendir (D, $traindir);
my @l = grep (! /^\./, readdir(D));
closedir (D);
my $ret = {};
foreach my $cat (@l)
{
my $d = "$traindir/$cat";
opendir (D, $d);
foreach my $f (grep (/html/,readdir(D)))
{
open (F, "$d/$f");
my $content = join('',<F>);
$content =~ s/<[^>]+>//g;
close F;
$ret->{$f} = {
categories => [$cat],
content => $content
}
}
closedir (D);
}
return $ret;
}
my $docs = training_docs();
my $c = new AI::Categorizer(collection_weighting => 'f');
while (my ($name, $data) = each %$docs)
{ $c->knowledge_set->make_document(name => $name, %$data) }
$c->learner->train( knowledge_set => $c->knowledge_set );
foreach my $d (@oracols)
{
my $content = get $d;
$content =~ m#<title>\s*([^<]+)\s*</title>#;
my $title = $1;
$title =~ s/ /_/g;
$content =~ s/<[^>]+>//g;
my $doc = AI::Categorizer::Document->new ( content => $content );
my $h = $c->learner->categorize( $doc );
print sprintf ( qq(%20s | <a href="%s">%s</a>\n), $h->best_category, $d, $title);
}
In the output, I've boldfaced the categories that I would have chosen:
blogging | Peer and Web Services are Technologies of Connection and Coordination
data_management | Googling Your Email
data_management | Speakable Web Services
data_management | Three Faces of XML in Zope
rss | The Document is the Database
rss | XSLT Recipes for Interacting with XML Data
data_management | Language Instincts
markup | Interactive Microcontent
data_management | Quick and Dirty Topic Mapping
blogging | Jon Udell: Radio UserLand 8.0 Is a Lab for Group-Forming
data_management | Jon Udell: Instant Outlining, Instant Gratification
blogging | Blogspace Under the Microscope
blogging | Seeing and Tuning Social Networks
identity | Control Your Identity or Microsoft and Intel Will
data_management | Scripting Collaborative Applications with Flash Communication Server MX
blogging | Interaction Design and Agile Methods
data_management | Scripting Groove Web Services
services | Services and Links
blogging | Applied Network Theory
services | Think Spring
data_management | The Semantic Blog
email | Using Python, Jython, and Lucene to Search Outlook Email
rss | Structured Writing, Structured Search
Not bad, but not great either. Subtracting the hits, here's how I'd
have classified the misses:
collaboration |
email |
services |
zope |
data_management |
swdev |
markup |
collaboration |
collaboration |
swdev |
services |
collaboration |
markup |
markup |
More from Jon Udell
The Beauty of REST
Lightweight XML Search Servers, Part 2
Lightweight XML Search Servers
The Social Life of XML
Interactive Microcontent
It wasn't hard to train the system on these choices. I tweaked the
script to save the files locally, again using the HTML doctitles to create
descriptive names. Then it took just a minute's worth of shuffling between
two Finder windows to do the training. Clearly, though, this awkward
procedure fails the test of normal use.
We know that autocategorization succeeds in the narrow domain of spam
filtering. Whether it can succeed more generally -- for example, by
helping blog authors and readers manage flows of items -- is yet
unclear. The raw tools are available, but until they're well integrated
into authoring and reading software, it will be hard to get a good sense
of what's possible.