Improving Twitter with a Bayesian filtering client
Monday, April 21, 2008 – 6:03 PMTwitter has been causing me some problems of late. Other than the fact it’s been loosing tweets all weekend that is. The signal to noise is causing it to scale poorly. I’d like to follow more people but need some way to cut down on the amount of noise I have to deal with to get to the interesting “stuff”. This would allow me to follow some other those rather high volume characters but ignore a lot of their less useful posts. I can remember the days when I had low volumes of email (and no spam). These days my inbox is choked, I expect my Twitter feed will go the same way with time.
I figured one tack would be to use the same approach as email spam filters. Spam filters are based on an idea that was suggested by Paul Graham a long while back and use Bayesian filters to identify “good” and “bad” email based on the the frequency of words in different emails.
The first step was to build a platform to experiment with…
I needed a C# based Twitter client with source code availability and a library for doing Bayesian filtering. Thankfully these were both easy to find…
I took the code for Witty, which is fairly full featured Twitter client with a nice UI implemented in WPF. Witty doesn’t have quite as many features as Twhirl, which is my usual client, but it’s to test ideas out on and I didn’t fancy learning the Adobe AIR platform used by Twhirl and implementing the Bayesian filter in JavaScript.
Next I needed a Naive Bayesian Spam Filter, this is the basic approach most email spam filters use. Jason Kester’s article on The Code Project “A Naive Bayesian Spam Filter for C#” provided an implementation of the approach described by Paul Graham. Judging by the comment threads I need to look fairly closely at the code as there may be at least one bug in the algorithm but it’ll do for prototyping and I can fix it later.
I’m not going to go into the details of how Bayesian filters work but essentially a spam filter maintains a record of the frequency of all the words found in “bad” emails and the frequency of all the words found in “good” email. It then categorizes each new email based on the frequency of words found in it.
I got the whole thing working Saturday in between a painting project in my living room. As it turns out most of yesterday’s fun and games was focused on learning just enough WPF to get the UI working. Getting the filtering algorithm hooked up was pretty trivial.
Here’s the prototype. It’s the Witty UI with three important editions (red circles from top to bottom). Firstly each tweet has a filter score associated with it. The closer this is to 1.0 the more likely the tweet isn’t interesting. The green and red buttons allow the user to vote on each tweet and classify it as “good” or “bad”. Lastly tweets that either have a high filter score or have been classified “bad” have their opacity increased. Eventually when I have confidence in the filter I’ll try collapsing some tweets completely unless the user explicitly expanded them.
Hopefully in the next couple of weeks I can get it enough good and bad exemplar data that I’ll have an idea if it works. Some questions that spring to mind…
- Can Bayesian filtering work well with really short messages? Emails are usually a lot longer than the 140 characters available in a tweet.
- Can the algorithm be improved by using word pairs or associating the sender with words? With junk email the sender is never the same but with Twitter I could score user/word pairs rather than just words.
- How long will it take to train? I’ve signed up to follow some very noisy Twitter users to speed things up a bit.
Of course filtering in this was would make Twitter hide some tweets. Will this make for a better or worse experience? Given that I don’t read every single tweet sent to me I’m probably not going to care if I loose a few false negatives. If it enhances the likelihood of my reading an interesting tweet then that’s good enough.
I need to go back and fully understand the algorithm here and see if it can be tweaked. Right not I’m pretty much using it as a black box off the shelf. In the spirit of “do the simplest thing possible that works”.
Now it’s time to evaluate and iterate.
If anyone has any suggestions of comments I’d be pleased to hear them!
13 Responses to “Improving Twitter with a Bayesian filtering client”
I had this idea today when I discovered ‘track’. Since tracking only works with sms or im, I wanted to code the filtering into a jabber client so you can track terms and cut out the noise.
I’d like to know how things went after you played with it more.
By Nick on Apr 29, 2008
That’s awesome!
Any thoughts on how this could be rolled into Witty? Maybe we could provide some hooks and this could be implemented as a plugin? Or maybe it should be baked in to the product?
Would you be interested contributing your work to Witty?
By Jon Galloway on May 8, 2008
This is assuming there is actually an “interesting” tweet in there somewhere. This could be a classic example of eventually the computer starting to mark everything as spam (twitter+spam=spatt??) and the coder is wondering why his algorithm is broken…when really twitter just is 99% noise.
By Anthony on Apr 28, 2009
Anthony,
Yes. When I experimented with this it was hard to train the filter. With email spam it’s pretty clear what is spam and what isn’t. With Tweets you’re really trying to decide what’s interesting (to me now) and what’s not.
I’m wondering if a crowd sourcing approach might not be worth considering? Having said that I did sort of train the filter to ignore pretty much all of the Scoblizer’s tweets after a while :).
Ade
By Ade Miller on Apr 28, 2009
Thanks for the great idea!
I implemented (in a most primitive way) a command-line Facebook status viewer, incorporating Bayesian filtering.
Along the way, I found a tentative answer to your question about Bayesian filtering on word pairs: According to the documentation, the risk of adding more filtering features is that you can overfit to your data, degrading your filter’s ability to classify new things. But associating the user with their words may be a good compromise. My suggestion would be to implement categorization both with and without user-word pairs, and then compare how well they categorize future messages.
By Surly on May 4, 2009
Very good point. A Bayesian approach in general helps with “mobsourced” information, but Twitter’s unstructured/uncategorized input doesn’t give it much to work with, so I don’t have great confidence that the approach will be effective.
We at Vanno took a similar approach vis a vis Bayes, but made sure our input was much more structured – to enhance the “signal” from the start.
Check us out – http://www.vanno.com/ and http://blog.vanno.com/
By Nick DiGiacomo on May 14, 2009
Great to see people are thinking about this. I wish it existed as a standard client.
To really improve it, I think a combination of the below two approaches would work well:
http://startupidea.wordpress.com/2009/05/28/twitter-streams-by-keyword-analysis/
http://startupidea.wordpress.com/2009/05/26/bayesian-include-engine-for-twitter/
By Justin Vincent on May 29, 2009
Just to make things more interesting, how about scoring things by conversation rather than by individual tweet? i.e. treat an entire chain of @replies as a single entity when assigning a score.
By Tim Hall on Feb 15, 2010
I did the same thing with my own Twitter client (which is integrated into my home automation system – a single feed for the house, Twitter, Facebook, …) See http://blog.abodit.com.
The one improvement I made was to add a “URL lengthener” that traces all redirects until it gets to the actual site and then it splits these actual URLs into words. Now when you approve or disapprove of a Tweet you are also indicating if the linked site itself was interesting or not.
By Ian Mercer on Feb 23, 2010
Ian,
Very interesting idea. I wonder if you could extend this to filter the content of the finally resolved page also? Or is this what you application does?
Ade
By Ade Miller on May 9, 2010