Suppose we have some data that is roughly organized like this:
#data:
uid, feature, value
1,ax,2
1,gqj,6
1,fd,1
2,tyf,4
3,tyf,3
3,fty,1
...
#classes
uid, class
1,1
2,2
3,NULL
4,2
...
We have some features, some identifiers, a value of the feature (maybe a word count in a document), and the class which has several values and is sometimes null. Suppose we also have multiple #data tables, corresponding to data collected over time. First thing we do is get this stuff all together:
We get an empty table like this:
select a.uid, a.feature, a.value, b.class into #pre_agg from #data a inner join #classes b on 1=1 where 'pigs'='flying';
Insert the data like this (dynamic sql comes in handy here):
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data a inner join #classes b on a.uid=b.uid;
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data2 a inner join #classes b on a.uid=b.uid;
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data3 a inner join #classes b on a.uid=b.uid;
...
Then we aggregate this table up:
select uid, feature, sum(value) as value, class into #agg from #pre_agg group by uid, feature, class;
Then we aggregate this table up again without uid's, drop the null class, and count up 1's instead of the actual value:
select feature, sum(1) as value, class into #agg2 from #agg where class is not null group by feature, class;
Lets also get a list of our classes and features:
select class into #d_classes from #agg2 group by class;
select feature into #d_features from #agg2 group by feature;
Then we need a table with each feature and each class and a dummy value to get a denser table than #agg2 with the zeros in it:
create table #dense_agg(feature bigint, class int, value double);
Let's assume a dirichlet(vector of .5) prior:
insert #dense_agg select a.feature, b.class, value = 0.5 from #d_features a left join #d_classes b on 1=1;
Now let's add the prior to the values we observed in #agg2:
update #dense_agg
set value = a.value + b.value
from #dense_agg a
inner join #agg2 b on a.feature = b.feature and a.class=b.class;
Let's also get the class sizes and a column for total:
select class, sum(value) as classtot, b.tot into #class_sizes
from #dense_agg a inner join (select sum(value) as tot from #dense_agg) b on 1=1
group by class, b.tot;
Now we can calculate a feature score for each class:
select a.*, coefficient = log(a.value/(b.value - a.value))-log(classtot/(tot-classtot))
into #coefficients from #dense_agg a
inner join (select feature, sum(value) as value from #dense_agg c group by feature) b on a.feature=b.feature
inner join #class_sizes d on a.class=d.class;
Now we can score out all the persons for each class:
select a.uid, a.class as actual, b.class as prediction, score = sum(coefficient)
into #scores from #agg a
inner join #coefficients b on a.feature=b.feature
group by a.uid, a.class, b.class;
And we can choose the winner as our prediction:
select a.uid, actual, prediction, score
from #scores a
inner join (select c.uid, maxscore = max(score) from #scores c group by c.uid) b on a.uid=b.uid and a.score = maxscore;
There are a lot of choices one can make here. For example, in the #scores table, we aren't adjusting for class sizes. Also, we are using the presence of a feature rather than its value. We aren't doing any feature selection. We aren't doing reasonable smoothing when the class sizes are drastically different. There is a reasonable argument to be made about whether the classtot variable in the odds calculation should be coming from the data with the smoothing done (#dense_agg) or the observed data (#agg2). There are a lot of improvements that can be made to this algorithm (many of them documented by Jason Rennie in his work as a graduate student), but many of them are peculiar to your data set. I hope to get to some examples in a future post, but if not, I hope I have gotten someone started. If you found this interesting or helpful, please let me know in the comments.
Hi. Very nice code.
ReplyDeleteA question though:
You set the prior prob to 0.5 into #dense_agg, then add some summed values from #agg. So you end up summing a probability with some way bigger integers. I can't understand the meaning of this, looks like that prior probability is not properly taken in account here.
Thanks for your comments! The .5 doesn't correspond to a probability, but to the beta (or dirichlet if categories > 2) prior. A beta (.5,.5) is the reference prior for the binomial distribution, so it is a good choice. However, I prefer to choose a prior that is related to the class probabilities, so if a class occurs 10% of the time, I might add .1 to those class counts and .9 to the rest. If I want a more smoothing, I would add perhaps 10 and 90 in that step.
ReplyDeleteFor events that happen frequently, these choices don't matter very much. 1000.1/2001 is about the same as 1000/2000. But for an example where 1 out of 2 examples were in the class, 1.1/3 is much closer to .1/1 (the underlying rate) then 1/2. This is the same intuition as putting a beta prior on a binomial distribution. I would be happy to discuss further if it isn't clear.