The basic problem in this scenario is how to persist data to the database without slowing down the web application, but with minimal risk of losing the data (e.g. if a web server crashes). As for reading the data required for the application in an efficient but timely manner, this is a perfect candidate for micro-caching, which I describe in a previous article. When approaching this problem, one can come up with a variety of options for how to log each unit of work:
- Direct to the database – INSERT to a log table – periodically push from the log table to a summary table.
- Direct to the database – INSERT item into summary table if needed – UPDATE to increment summary table counter thereafter
- Increment counter in memory – push to database periodically
- Increment counter in file system – push to database periodically
- Use MSMQ to queue updates.
The benefit of Option A is that each insert will be very fast, and there shouldn’t be any locking on the table except when the move to the log table is done. With Option B, there will be some lock contention while updates are performed, but only on a per-item (row) basis. The whole table is (typically) never locked*, so the locks will only be in contention if a particular row is seeing a very large amount of activity. Both options share the advantage of being reliable, in that data is immediately persisted to the database. However, they also share a related disadvantage, in that they impose a steady load on the database.
*It is possible for locks to be escalated to table locks, as this article describes. However, one can avoid this possibility by using techniques described in the article, or by adding WITH ROWLOCK to the INSERT/UPDATE statement. Generally, though, it is best not to do this, as SQL Server will usually choose the right kind of locking our interference (test your system with and without and see if it makes a difference – if not, leave WITH ROWLOCK off).
Option C, combined with data persistence using either option A or B’s technique, removes some of the database load but at the expense of some data volatility. The two are inversely related--the more I reduce the load on the database, the more data I risk in the event my application crashes or restarts. Since server restarts are somewhat rare but certainly not unheard of in my experience (individual web applications might restart daily or once a week, depending on how often they’re being updated, or if the application pool is set up to automatically restart), it is best to keep the cache window fairly small. Since by definition these high volume applications are writing to the database many times per second, a one second period is a good starting point (and further increases tend to have greatly diminished returns).
Option D I added just for completeness. I would not recommend this system. For one thing, in my scenario, my web farm uses a shared file system, so there would contention for the file(s) used for logging. For another, this would simply trade the bottleneck of database access for the bottleneck of file system access. It would have the benefit of persisting data between app restarts, but I don’t think this advantage makes up for the disadvantages. As with Option C, the data in the file(s) would be persisted to the database periodically using either of the methods described in options A and B.
Option E would probably work best in an enterprise scenario where it is critical that no data be lost. I did not attempt to implement this scenario because I didn’t have any prior knowledge of MSMQ. But I did find this article, which would appear to offer a good start for anyone wishing to go down this path. AppLogger Using MSMQ also looks like it would be worth checking out. Perhaps I’ll examine this option in a future article.
For now, let’s revisit options A and B. Why choose one versus the other? I’m assuming in the way I’ve described these tables that I’m going to need a summary table. Basically, my application will log several million hits per day, so it makes sense to me to summarize these rather than add several million rows of data per day. That still leaves the question of when to summarize. In the case of option A, the writes are all INSERTS, and the summary table is only updated from the log table periodically. Option B uses a combination of INSERT and UPDATE statements to eliminate the need for the log table by always writing directly to the summary table. I haven’t done any benchmarking of option A versus option B, but my gut tells me that option A will be faster, because inserting to the end of a table is pretty much always going to beat doing an update somewhere in the middle of a table (not to mention that UPDATEs are, generally, more expensive than INSERTs).
However, I like the simplicity of option B, as it requires fewer tables and does not introduce the need to periodically clean out the log table. Also, the data in option B’s summary table is always current (or at least as current as the last update from the web application). Option A’s summary table will never show real-time data, since it depends on infrequent updates from the log table. This may not be a problem in many applications, but in my case I’m already introducing some lag time with my caching on the web server, so I don’t need any further delays on the database.
That said, my preference is clearly C/B. My database structure is going to consist of a summary table to which I’ll log activity using a combination of INSERTs and UPDATEs, and I’ll cut down on how often I need to hit the database by implementing some in-memory counters on the web server. At this point, I just have a ‘gut feeling’ that the option C is going to help compared to simply hitting the database directly (option B). Let’s run some tests.