Architecture Of The ArchivistAug 16, 2010 In Development By Karsten Januszewski
Building The Archivist introduced several architectural difficulties, which were solved over a number of iterations and trials and error.
Implementing the core three features of The Archivist (archiving, analyzing and exporting) in a scalable and responsive way proved to be a challenge. While I’m tempted not to write this article as some will anticipate my mistakes, I’m going to suck it up for the sake of maybe helping someone else out there.
With this in mind, I’d like to go over the journey of The Archivist architecture with the hope that others may benefit from my experience.
If you don’t know what The Archivist is, you can read about here.
Architecture I: Prototype
My first architecture was a prototype meant to prove the feasibility of the application. Like many prototypes, I wrote this one quickly to prove the basic idea.
I started out with an instance of SQL Express and created four tables:
- A User table. Because the archivist allowed different users to create archives of tweets, I needed a user table.
- A Tweets table. Of course I needed a table to store tweets.
- A Search table. In this table, I stored the searches: search term, last updated, is active, etc. It was joined to the tweets table.
- An Archive table. This table was basically a look up table between searches and users. Because two users might start the same search,there was no point in duplicating that search,thus the creation of the archive table.
In fact, here’s exactly what it looked like:
I whipped up some data access objects using Linq To Sql and I was off and running. Tweets were pulled from Twitter, serialized to my data access objects and inserted into the database. To do the analysis, I pulled tweets out of SQL back into CLR objects and ran LINQ queries on demand at runtime to create the various aggregate graphs of data. To export the tweets to Excel, the graph of CLR objects was transformed to a tab delimited text file on the fly.
It was working great in my development environment with one user, one search and about 100 tweets.
I got this basic architecture working with the Azure development environment and then pushed it up to Azure. I had a Web Role for the rudimentary UI and a worker role for polling Twitter.
I then started doing some very basic performance and load tests. Guess what happened? First, performance crawled to a halt. Who can guess where the bottlenecks were? Well, first, the on demand LINQ queries were painfully slow. So was the export to Excel. Also, the performance of inserting tweets into SQL was a surprise bottleneck I hadn’t seen in my development environment.
Performance was a problem, but my database was filling up too… and fast. I had created the default SQL Azure instance of 1 gig. But three archives of 500,000 tweets could fill that up. I was going to fill up databases quickly.
Back to the drawing boards.
Architecture II: SQL-centric
I had always been concerned about the on-demand LINQ aggregate queries and expected they wouldn’t scale. As a result, I looked into moving them to SQL Server as stored procedures. While that helped performance, it still didn’t meet the speed benchmarks required to show all six visualizations in the dashboard. I then got turned on to indexed views. Ah ha, I thought! The solution to my problems! By making the stored procs indexed views in SQL, I thought I had the solution.
But I had another problem on my hands: the size of the SQL databases. My first thought was to have multiple databases and come up with a sharding structure. But the more I looked at the scalability needs growth of The Archivist, the more skeptical I became of this architecture. It seemed complex and error-prone. And I could end up having to manage 100 SQL Servers? In the cloud? Hmm…
I also had performance problems with doing inserts into SQL Azure. In part, this had to do with the SQL Azure replication model, which does three inserts for each insert. Not an issue with a single insert, but it wasn’t looking good for bulk inserts of tweets.
So, SQL could solve my analysis problems but didn’t seem like a good fit for my storage problems.
Architecture III: No SQL (Well, Less SQL)
The more astute among you probably saw this coming: We dumped SQL for storing tweets and switched to blob storage. So, instead of serializing tweets to CLR objects and then persisting them via LINQ to SQL, I simply wrote the JSON result that came back from the Twitter Search API right to blob storage. The tweet table was no more.
At first, I considered moving the rest of the tables to table storage, but that code was working and not causing any problems. In addition, the data in the other three tables is minimal and the joins are all on primary keys. So I didn’t move them, and never have. So it turns out that The Archivist has a hybrid storage solution: half blob storage and half SQL server.
You might be wondering how I correlate the two stores. It’s pretty simple: Each search has a unique search id in SQL. I use that unique search id as a container name in Blob Storage. Then, I store all the tweets related to that search inside that container.
Moving from SQL to Blob Storage immediately solved my two big archival problems:
First, I no longer had to worry about a sharding structure or managing all those SQL Servers. Second, my insert performance problems disappeared.
But I still had problems with both analysis and export. Both of these operations were simply too taxing on large datasets to do at runtime with LINQ.
Time for another meeting with Joshua and David. Enter additional Azure worker roles and Azure queues.
Architecture IV: Queues
To solve both of these problems, I introduced additional Azure worker roles that were linked by queues. I then moved the processing of each archive (running aggregate queries as well as generating the file to export) to worker roles.
So in addition to my search worker role, I introduced two new roles, one for appending and one for aggregating. It worked as follows:
The Search role queried the search table for any searches that hadn’t been updated in the last half hour. Upon completion of a search that returned new tweets, it put a message in the queue for the Appender.
The Appender role picked up the message and appended the new tweets to the master archive. The master archive was stored as a tab delimited text file. Perfect! This solved my performance problem getting the tweets into that format, as it was now stored natively as that format and available for download from storage whenever necessary.
Upon completing appending, a message was added to the queue for the Aggregator role. That role deserialized the archive to objects via my own deserialization engine.
Once I had the graph of objects, the LINQ queries were run. The result of each aggregate query was then persisted to blob storage as JSON. Again, perfect! Now all six aggregate queries, which sometimes could take up to five minutes to process, were available instantaneously to the UI. (This is why performance is so snappy on the charts for archives of massive size.) In addition, I could also provide these JSON objects to developers as an API.
This architecture worked for quite awhile and seemed to be our candidate architecture.
Until a new problem arose.
The issue was that the aggregate queues were getting backed up. With archives growing to 500,000 tweets, running the aggregate queries was bottlenecking. A single archive could take up to five minutes to process. With archives being updated every half hour and the number of archives growing, the time it took to run the aggregate queries on each archive was adding up. Throwing more instances at it was a possible solution, but I stepped back from the application and really looked at what we were doing. I began to question the need for all that aggregate processing, which was taking up not only time but also CPU cycles. With Azure charging by the CPU hour, that was a consideration as well.
As it was, I was re-running the aggregate queries every half hour on every archive that had new tweets. But really, the whole point of the aggregate queries was to get a sense of what was happening to the data over time. Did the results of those queries change that much between an archive of 300,000 tweets and 301,000 tweets? No. In addition, I was doing all that processing for an archive that may only be viewed once a day – if that. The whole point of The Archivist is to spot trends over time. The Archivist is about analysis. It is not about real time search.
It dawned on me: do I really need to be running those queries every half-hour?
Architecture V: No Queues
We made the call (pretty late in the cycle) to run the aggregate queries only once every 24 hours. Rather than using queues, I changed the Aggregator role to run on a timer similar to how the Searcher role worked. I then partitioned work done by the Appender and moved its logic to the Searcher or the Aggregator.
One thing I like about this architecture is that it gets rid of the queues entirely. In the end, no queues means the whole application is simpler and, in software, the fewer moving parts, the better. So, the final architecture ends up looking like this:
Both the Searcher and the Aggregator are encapsulated in dlls, such that they can be called directly by the web role (when the user enters an initial query into the database) or by a worker role (when an archive is updated). So each worker role runs on a timer.
We had to make another change to how the Searcher worked pretty late in the game. The Searcher was polling Twitter every half hour for every single archive. This seemed like overkill, especially for archives that didn’t get a lot of tweets—it was a lot of unnecessarily polling of Twitter. With a finite amount of queries allowed per hour to the Search API, this needed revising.
The solution was an elastic degrading polling algorithm. Basically, it means that we try to determine how ‘hot’ an archive is depending on how many tweets it returns. This algorithm is discussed in detail here if you’d like to read more.
If you’d like to learn more about exactly how the architecture works, check out this Channel9 interview at about 10:00 in, where I walk through the whole thing.
The Archivist has been an exciting and challenging project to work on. This post didn’t even get into other issues we faced during the development, which included getting the ASP.NET Charts to work in Azure; dealing with deduplication and the Twitter Search API; writing and reading from blob storage; nastiness with JSON deserialization; differences in default date time between SQL Server and .NET; struggles trying to debug with the Azure development; and lots more. But we shipped!
And, perhaps more interestingly, we have made all the source code available for anyone to run his or her own instance of The Archivist.