How The Archivist Polls TwitterJul 7, 2010 In Development By Karsten Januszewski
You may be wondering how frequently The Archivist updates archives. Well, the answer to the question is more complicated that it may first appear. Let’s dig in.
The Archivist interacts with Twitter using the Twitter Search API, which it polls at variable intervals based on the frequency with which a particular archive is updated. We call this the elastic degrading polling function. This algorithm helps The Archivist be a good Twitter citizen, allowing us to poll Twitter conservatively while at the same time maintaining archives with the latest tweets.
Here’s how the algorithm works: When a user makes an archive ‘active’, the polling process begins. Every archive is inspected once an hour to determine how ‘hot’ it is. We determine how hot an archive is by recording how many results we get back each time we poll Twitter (the maximum we can pull at any one time is 1500). We use this number to determine how frequently to poll Twitter for that archive. Depending on the number, we either hold off on polling for a given interval or query again, based on the following buckets:
So, let’s look at an example. Say we have an archive going for the term ‘Wittgenstein’. When the Archivist checks on this archive at 10 AM, it discovers that the last query for Wittgenstein only returned 10 tweets. It also discovers that this archive was last updated at 9 AM. The Archivist won’t poll Twitter for this archive, because the tweet count isn’t high enough and the archive had been queried within 24 hours. Since the archive is in the 24 hour bucket, the same thing will happen when The Archivist checks on this archive each hour.
Once 9 AM rolls around on the next day, since 24 hours have passed,Twitter will be polled for the Wittgenstein archive .
Now,let’s say for some reason there’s a flurry of tweets about Wittgenstein—when that archive was updated at 9am, it pulled 600 tweets. In this case, the archive adjusts because it has become hot. It is now in the 1 hour bucket instead of the 24 hour bucket. So, when 10 AM rolls around, the Wittgenstein archive gets updated again.
But let’s say at 10 AM it pulls only 250 tweets. Well, now the archive moves to the 8 hour bucket. So,the Wittengenstein archive will not be polled again again until 6 PM. Let’s say it pulls 1000 tweets. Well, it goes back to the 1 hour bucket, since it appears to be hot. At 7 PM the term is checked again. This time, the response is only 10 tweets. It seems to have cooled off quickly, so we’ll move it back to the 24 hour bucket.
Some of you may notice that there’s a chance that The Archivist could possibly miss tweets when a term becomes hot. This is a reality of our architecture and is justified by the following: First, once a term gets hot, the amount of data can grow quickly. Ultimately, in that scenario,The Archivist becomes a statistical sample as opposed to a true historical record. Second, Twitter itself doesn’t guarantee that all tweets will be returned for a given search. See http://help.twitter.com/entries/66018-my-tweets-or-hashtags-are-missing-from-search and http://dev.twitter.com/doc/get/search for more on this. Consequently there is no way that The Archivist can ever claim to be a true historical record. Third, The Archivist is optimized for following non-trending topics over a long period of time, as opposed to trending topics over a short time. For a tool optimized for the latter scenario, see Archivist Desktop. Another option would be to run your own instance of The Archivist Web and tweak the polling algorithm, which would be trivial to do. Contact me if you are interested in doing so.