Twitter has been causing me some problems of late. Other than the fact it’s been loosing tweets all weekend that is. The signal to noise is causing it to scale poorly. I’d like to follow more people but need some way to cut down on the amount of noise I have to deal with to get to the interesting “stuff”. This would allow me to follow some other those rather high volume characters but ignore a lot of their less useful posts. I can remember the days when I had low volumes of email (and no spam). These days my inbox is choked, I expect my Twitter feed will go the same way with time.
I figured one tack would be to use the same approach as email spam filters. Spam filters are based on an idea that was suggested by Paul Graham a long while back and use Bayesian filters to identify “good” and “bad” email based on the the frequency of words in different emails.
The first step was to build a platform to experiment with…
I needed a C# based Twitter client with source code availability and a library for doing Bayesian filtering. Thankfully these were both easy to find…
Next I needed a Naive Bayesian Spam Filter, this is the basic approach most email spam filters use. Jason Kester’s article on The Code Project “A Naive Bayesian Spam Filter for C#” provided an implementation of the approach described by Paul Graham. Judging by the comment threads I need to look fairly closely at the code as there may be at least one bug in the algorithm but it’ll do for prototyping and I can fix it later.
I’m not going to go into the details of how Bayesian filters work but essentially a spam filter maintains a record of the frequency of all the words found in “bad” emails and the frequency of all the words found in “good” email. It then categorizes each new email based on the frequency of words found in it.
I got the whole thing working Saturday in between a painting project in my living room. As it turns out most of yesterday’s fun and games was focused on learning just enough WPF to get the UI working. Getting the filtering algorithm hooked up was pretty trivial.
Here’s the prototype. It’s the Witty UI with three important editions (red circles from top to bottom). Firstly each tweet has a filter score associated with it. The closer this is to 1.0 the more likely the tweet isn’t interesting. The green and red buttons allow the user to vote on each tweet and classify it as “good” or “bad”. Lastly tweets that either have a high filter score or have been classified “bad” have their opacity increased. Eventually when I have confidence in the filter I’ll try collapsing some tweets completely unless the user explicitly expanded them.
Hopefully in the next couple of weeks I can get it enough good and bad exemplar data that I’ll have an idea if it works. Some questions that spring to mind…
- Can Bayesian filtering work well with really short messages? Emails are usually a lot longer than the 140 characters available in a tweet.
- Can the algorithm be improved by using word pairs or associating the sender with words? With junk email the sender is never the same but with Twitter I could score user/word pairs rather than just words.
- How long will it take to train? I’ve signed up to follow some very noisy Twitter users to speed things up a bit.
Of course filtering in this was would make Twitter hide some tweets. Will this make for a better or worse experience? Given that I don’t read every single tweet sent to me I’m probably not going to care if I loose a few false negatives. If it enhances the likelihood of my reading an interesting tweet then that’s good enough.
I need to go back and fully understand the algorithm here and see if it can be tweaked. Right not I’m pretty much using it as a black box off the shelf. In the spirit of “do the simplest thing possible that works”.
Now it’s time to evaluate and iterate.
If anyone has any suggestions of comments I’d be pleased to hear them!