The Naive Bayes classifier is a workhorse; it does a lot of the classification work so ubiquitous in our lives these days. It is
easy to understand, easy to code, and easy to scale. Many people have data for classification in some sort of RDBMS. There is
a lot of talk about bringing the analytics to the data, not the data to the analytics, when you have really big data. You just don't want to move big data around too much. So in that spirit, here is a version of a multi-class naive bayes classifier for discrete features written in SQL.
Some people say it can be faster than map-reduce.
Other people tried to patent this idea, but it seems bizarre to patent
doing some arithmetic and taking a few logs in a programming language built for that sort of task.
Suppose we have some data that is roughly organized like this:
#data:
uid, feature, value
1,ax,2
1,gqj,6
1,fd,1
2,tyf,4
3,tyf,3
3,fty,1
...
#classes
uid, class
1,1
2,2
3,NULL
4,2
...
We have some features, some identifiers, a value of the feature (maybe a word count in a document), and the class which has several values and is sometimes null. Suppose we also have multiple #data tables, corresponding to data collected over time. First thing we do is get this stuff all together:
We get an empty table like this:
select a.uid, a.feature, a.value, b.class into #pre_agg from #data a inner join #classes b on 1=1 where 'pigs'='flying';
Insert the data like this (dynamic sql comes in handy here):
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data a inner join #classes b on a.uid=b.uid;
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data2 a inner join #classes b on a.uid=b.uid;
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data3 a inner join #classes b on a.uid=b.uid;
...
Then we aggregate this table up:
select uid, feature, sum(value) as value, class into #agg from #pre_agg group by uid, feature, class;
Then we aggregate this table up again without uid's, drop the null class, and count up 1's instead of the actual value:
select feature, sum(1) as value, class into #agg2 from #agg where class is not null group by feature, class;
Lets also get a list of our classes and features:
select class into #d_classes from #agg2 group by class;
select feature into #d_features from #agg2 group by feature;
Then we need a table with each feature and each class and a dummy value to get a denser table than #agg2 with the zeros in it:
create table #dense_agg(feature bigint, class int, value double);
Let's assume a dirichlet(vector of .5) prior:
insert #dense_agg select a.feature, b.class, value = 0.5 from #d_features a left join #d_classes b on 1=1;
Now let's add the prior to the values we observed in #agg2:
update #dense_agg
set value = a.value + b.value
from #dense_agg a
inner join #agg2 b on a.feature = b.feature and a.class=b.class;
Let's also get the class sizes and a column for total:
select class, sum(value) as classtot, b.tot into #class_sizes
from #dense_agg a inner join (select sum(value) as tot from #dense_agg) b on 1=1
group by class, b.tot;
Now we can calculate a feature score for each class:
select a.*, coefficient = log(a.value/(b.value - a.value))-log(classtot/(tot-classtot))
into #coefficients from #dense_agg a
inner join (select feature, sum(value) as value from #dense_agg c group by feature) b on a.feature=b.feature
inner join #class_sizes d on a.class=d.class;
Now we can score out all the persons for each class:
select a.uid, a.class as actual, b.class as prediction, score = sum(coefficient)
into #scores from #agg a
inner join #coefficients b on a.feature=b.feature
group by a.uid, a.class, b.class;
And we can choose the winner as our prediction:
select a.uid, actual, prediction, score
from #scores a
inner join (select c.uid, maxscore = max(score) from #scores c group by c.uid) b on a.uid=b.uid and a.score = maxscore;
There are a lot of choices one can make here. For example, in the #scores table, we aren't adjusting for class sizes. Also, we are using the presence of a feature rather than its value. We aren't doing any feature selection. We aren't doing reasonable smoothing when the class sizes are drastically different. There is a reasonable argument to be made about whether the classtot variable in the odds calculation should be coming from the data with the smoothing done (#dense_agg) or the observed data (#agg2). There are a lot of improvements that can be made to this algorithm (many of them
documented by Jason Rennie in his work as a graduate student), but many of them are peculiar to your data set. I hope to get to some examples in a future post, but if not, I hope I have gotten someone started. If you found this interesting or helpful, please let me know in the comments.