Use Write Caching to Optimize High Volume Data Driven Applications
page 2 of 11
by Steven Smith
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 62808/ 87

The Problem

The basic problem in this scenario is how to persist data to the database without slowing down the web application, but with minimal risk of losing the data (e.g. if a web server crashes).  As for reading the data required for the application in an efficient but timely manner, this is a perfect candidate for micro-caching, which I describe in a previous article.  When approaching this problem, one can come up with a variety of options for how to log each unit of work:

  1. Direct to the database – INSERT to a log table – periodically push from the log table to a summary table.
  2. Direct to the database – INSERT item into summary table if needed – UPDATE to increment summary table counter thereafter
  3. Increment counter in memory – push to database periodically
  4. Increment counter in file system – push to database periodically
  5. Use MSMQ to queue updates.

The benefit of Option A is that each insert will be very fast, and there shouldn’t be any locking on the table except when the move to the log table is done.  With Option B, there will be some lock contention while updates are performed, but only on a per-item (row) basis.  The whole table is (typically) never locked*, so the locks will only be in contention if a particular row is seeing a very large amount of activity.  Both options share the advantage of being reliable, in that data is immediately persisted to the database.  However, they also share a related disadvantage, in that they impose a steady load on the database.

*It is possible for locks to be escalated to table locks, as this article describes.  However, one can avoid this possibility by using techniques described in the article, or by adding WITH ROWLOCK to the INSERT/UPDATE statement.  Generally, though, it is best not to do this, as SQL Server will usually choose the right kind of locking our interference (test your system with and without and see if it makes a difference – if not, leave WITH ROWLOCK off).

Option C, combined with data persistence using either option A or B’s technique, removes some of the database load but at the expense of some data volatility.  The two are inversely related--the more I reduce the load on the database, the more data I risk in the event my application crashes or restarts.  Since server restarts are somewhat rare but certainly not unheard of in my experience (individual web applications might restart daily or once a week, depending on how often they’re being updated, or if the application pool is set up to automatically restart), it is best to keep the cache window fairly small.  Since by definition these high volume applications are writing to the database many times per second, a one second period is a good starting point (and further increases tend to have greatly diminished returns).

Option D I added just for completeness.  I would not recommend this system.  For one thing, in my scenario, my web farm uses a shared file system, so there would contention for the file(s) used for logging.  For another, this would simply trade the bottleneck of database access for the bottleneck of file system access.  It would have the benefit of persisting data between app restarts, but I don’t think this advantage makes up for the disadvantages.  As with Option C, the data in the file(s) would be persisted to the database periodically using either of the methods described in options A and B.

Option E would probably work best in an enterprise scenario where it is critical that no data be lost.  I did not attempt to implement this scenario because I didn’t have any prior knowledge of MSMQ. But I did find this article, which would appear to offer a good start for anyone wishing to go down this path.  AppLogger Using MSMQ also looks like it would be worth checking out.  Perhaps I’ll examine this option in a future article.

For now, let’s revisit options A and B.  Why choose one versus the other?  I’m assuming in the way I’ve described these tables that I’m going to need a summary table.  Basically, my application will log several million hits per day, so it makes sense to me to summarize these rather than add several million rows of data per day.  That still leaves the question of when to summarize.  In the case of option A, the writes are all INSERTS, and the summary table is only updated from the log table periodically.  Option B uses a combination of INSERT and UPDATE statements to eliminate the need for the log table by always writing directly to the summary table.  I haven’t done any benchmarking of option A versus option B, but my gut tells me that option A will be faster, because inserting to the end of a table is pretty much always going to beat doing an update somewhere in the middle of a table (not to mention that UPDATEs are, generally, more expensive than INSERTs). 

However, I like the simplicity of option B, as it requires fewer tables and does not introduce the need to periodically clean out the log table.  Also, the data in option B’s summary table is always current (or at least as current as the last update from the web application).  Option A’s summary table will never show real-time data, since it depends on infrequent updates from the log table.  This may not be a problem in many applications, but in my case I’m already introducing some lag time with my caching on the web server, so I don’t need any further delays on the database.

That said, my preference is clearly C/B.  My database structure is going to consist of a summary table to which I’ll log activity using a combination of INSERTs and UPDATEs, and I’ll cut down on how often I need to hit the database by implementing some in-memory counters on the web server.  At this point, I just have a ‘gut feeling’ that the option C is going to help compared to simply hitting the database directly (option B).  Let’s run some tests.


View Entire Article

User Comments

Title: Modern SQL XML Support   
Name: Steve Smith
Date: 2008-08-08 2:33:41 PM
Comment:
Jason Follas posted recently on how to do the SQL XML work more efficiently in SQL 2005+:
http://jasonfollas.com/blog/archive/2008/06/19/coding-in-sql-server-an-evolution.aspx
Title: MSMQ   
Name: Derek
Date: 2008-01-14 9:37:07 AM
Comment:
Thanks for the article. Having some experience with MSMQ I can tell you it will probably perform well, but we've had a lot of issues with using it in a clustered environment. This is a well-known shortcoming of MSMQ.

We've gone to some 3rd party pub-sub solutions that work much more efficiently and work well in a clustered environment.
Title: Using a synchronized HashTable   
Name: Tim
Date: 2005-08-22 1:41:25 PM
Comment:
Also, how would this be affected by using Hashtable.Synchronized(...) to create the hashtable. It appears that automatically protects the hashtable's Add method and the indexing property Set accessor with a lock, which would protect you during writes.
Title: What about the code for updating the item in Cache?   
Name: Tim
Date: 2005-08-22 12:08:14 PM
Comment:
Can you show sample code for updating the item in Cache? For example, if item is added to cache with certain expiration, etc., when you update it, do you have to add it back with the same parameters, or do you just reference it like Cache[myitem] = updatedValue, and it will retain all previously set values...

Thanks
Title: Programmer   
Name: Microfolx
Date: 2005-05-21 10:33:19 AM
Comment:
This is a very nice site. Keep it up.
Title: Re: Consistency   
Name: Brett
Date: 2005-05-20 5:15:54 PM
Comment:
Nice article... I might consider trying normal XML writers rather than StringBuilder. Or even XML serialization. Probably slightly slower, but reduces errors.

On the point regarding charging for page views. I have two suggestions. One is to periodically reload the current click count from the database and not worry about being exact. Hey if they get 1 million or 1,005,000, does it really matter? This isn't the space shuttle. The second would be to switch to a non-caching code path when you get close to whatever your maximum value is. So when you hit, say 990,000 views (based on your hit rate, etc.) then switch to the non-caching version for that customer.
Title: Consistency   
Name: Brian
Date: 2005-05-20 2:03:29 PM
Comment:
I had to accomplish this exact same scenario. The problem I ran into was the caching. Lets say that one of your advertisers pays you to show their ad 1 million times. How can you insure that it doesn’t go over? If you are holding the most current counters in cache, won’t each process be looking at their own cache, not to mention each server in the web farm and all it’s processes be looking at their own cache.

Another general Cache question. When I write something to cache, isn’t it only available for that particular process? So if you setup your application to have 5 worker processes, won’t that have 5 separate cache entries?
Title: web farm   
Name: parker
Date: 2005-04-24 3:19:08 AM
Comment:
How do you account for disperate sessions in a web farm scenario? Is the cache shared amoung the servers?
Title: Too Good   
Name: Zulfiqar Damani
Date: 2005-04-18 12:46:36 AM
Comment:
It presented very well, also it was interesting too.
Title: Thanks   
Name: Happel
Date: 2005-04-14 8:23:23 AM
Comment:
Thanks for sharing this article, very interesting.






Community Advice: ASP | SQL | XML | Regular Expressions | Windows


©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-04-20 3:07:22 AM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search