Some ramblings about machine learning and econometrics

Sunday, July 17, 2011

Blog Endorsement and Ethics Rant

From what I can tell, Annie Pettit loves her job in market research methodology in the online space, just like me. She is a former VP at IPSOS and now she works for a firm called Research Now. From a recent post of hers, she thinks we all should. She has some key insights about market research and stats in the online space; I rarely see something I disagree with. Her posts aren't the technical cutting edge sort of things I usually read like Healthy Algorithms or the Gelman blog. They are a good, plain spoken highlighting of industry fundamentals. They have started to become a motivational tool for me and others.

She seems to be passionate about ethics, as she blogs about it occasionally (e.g. http://lovestats.wordpress.com/2011/02/04/top-9-lame-excuses-to-behave-unethically-mrx/). I am big on this topic too, and we need it badly in market research. I have seen someone advertise they can do a model fit to "insure [sic] the test group outperforms the control group." For me that is exhibit A. Our role in methodology, insights, analytics, statistics, or whatever you call it is not to pull enough levers until you get something that agrees with what you already decided before all the math and all the coding began. This happens all too often at my organization. I see a lot of people tend towards  "story time" as described by Andrew Gelman and Kaiser Fung  in which you do some stats procedure, maybe find an unexpected correlation or a step change, and just start weaving explanations from thin air as to the cause and call it a day. Those explanations have a special name in this field: hypotheses. You have only begun your analysis, those new hypotheses need to have methodologies constructed to test them as well. Don't leave out the highly relevant interaction term because it is hard to explain; don't leave that covariate out of your matching analysis because it makes the treatment effect look smaller; don't commission market research just to pat yourself on the back when your campaigns or product ideas look good and bury it when it suggests your efforts aren't as effective as you'd hoped. Without an ethical approach to analytics, it is just black boxes and wizardry to observers and will always have a reputation as such.


As discussed in many other places, "The Scottish author Andrew Lang once said of an unsophisticated forecaster, “He uses statistics as a drunken man uses lamp-posts…for support rather then illumination.” " Don't be that guy.

Thanks for your blog Annie; I hope we cross paths professionally one day. We need more evangelists for ethics and methodology in our space.

Friday, July 15, 2011

Naive Bayes in SQL

The Naive Bayes classifier is a workhorse; it does a lot of the classification work so ubiquitous in our lives these days. It is easy to understand, easy to code, and easy to scale. Many people have data for classification in some sort of RDBMS. There is a lot of talk about bringing the analytics to the data, not the data to the analytics, when you have really big data. You just don't want to move big data around too much. So in that spirit, here is a version of a multi-class naive bayes classifier for discrete features written in SQL. Some people say it can be faster than map-reduce. Other people tried to patent this idea, but it seems bizarre to patent doing some arithmetic and taking a few logs in a programming language built for that sort of task.



Suppose we have some data that is roughly organized like this:

#data:
uid, feature, value
1,ax,2
1,gqj,6
1,fd,1
2,tyf,4
3,tyf,3
3,fty,1
...

#classes
uid, class
1,1
2,2
3,NULL
4,2
...

We have some features, some identifiers, a value of the feature (maybe a word count in a document), and the class which has several values and is sometimes null. Suppose we also have multiple #data tables, corresponding to data collected over time. First thing we do is get this stuff all together:

We get an empty table like this:
select a.uid, a.feature, a.value, b.class into #pre_agg from #data a inner join #classes b on 1=1 where 'pigs'='flying';

Insert the data like this (dynamic sql comes in handy here):
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data a inner join #classes b on a.uid=b.uid;
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data2 a inner join #classes b on a.uid=b.uid;
insert #pre_agg select a.uid, a.feature, a.value, b.class from #data3 a inner join #classes b on a.uid=b.uid;
...

Then we aggregate this table up:
select uid, feature, sum(value) as value, class into #agg from #pre_agg group by uid, feature, class;



Then we aggregate this table up again without uid's, drop the null class, and count up 1's instead of the actual value:
select feature, sum(1) as value, class into #agg2 from #agg where class is not null group by feature, class;


Lets also get a list of our classes and features:
select class into #d_classes from #agg2 group by class;
select feature into #d_features from #agg2 group by feature;



Then we need a table with each feature and each class and a dummy value to get a denser table than #agg2 with the zeros in it:

create table #dense_agg(feature bigint, class int, value double);

Let's assume a dirichlet(vector of .5) prior:
insert #dense_agg select a.feature, b.class, value = 0.5 from #d_features a left join #d_classes b on 1=1;

Now let's add the prior to the values we observed in #agg2:
update #dense_agg
 set value = a.value + b.value
 from #dense_agg a
inner join #agg2 b on a.feature = b.feature and a.class=b.class;


Let's also get the class sizes and a column for total:
select  class, sum(value) as classtot, b.tot into #class_sizes
from #dense_agg a inner join (select sum(value) as tot from #dense_agg) b on 1=1
group by class, b.tot;



Now we can calculate a feature score for each class:
select a.*, coefficient = log(a.value/(b.value - a.value))-log(classtot/(tot-classtot))
into #coefficients from #dense_agg a
inner join (select feature, sum(value) as value from #dense_agg c group by feature) b on a.feature=b.feature
inner join #class_sizes d on a.class=d.class;


Now we can score out all the persons for each class:

select a.uid, a.class as actual, b.class as prediction, score = sum(coefficient)
into #scores from #agg a
inner join #coefficients b on a.feature=b.feature
group by a.uid, a.class, b.class;

And we can choose the winner as our prediction:

select a.uid, actual, prediction, score
from #scores a
inner join (select c.uid, maxscore = max(score) from #scores c group by c.uid) b on a.uid=b.uid and a.score = maxscore;


There are a lot of choices one can make here. For example, in the #scores table, we aren't adjusting for class sizes. Also, we are using the presence of a feature rather than its value. We aren't doing any feature selection. We aren't doing reasonable smoothing when the class sizes are drastically different. There is a reasonable argument to be made about whether the classtot variable in the odds calculation should be coming from the data with the smoothing done (#dense_agg) or the observed data (#agg2). There are a lot of improvements that can be made to this algorithm (many of them documented by Jason Rennie in his work as a graduate student), but many of them are peculiar to your data set. I hope to get to some examples in a future post, but if not, I hope I have gotten someone started. If you found this interesting or helpful, please let me know in the comments.

Inspiration

Nathan Yau makes a great point over on FlowingData: learn to code about data and you can only be better at analytics. This wasn't some earth shattering thing; I already felt that way, however, it was the final impetus for this blog. I have been learning to code about data for a few years now, and I want to share some of the things I have learned and get some feedback. I mostly code in R and Python, and I am interested in econometrics, Bayesian statistics, big data, and machine learning. You can see some of the stuff I read over in the blog roll.