Mythbuster: Normalizing with Integers is Always Best: ASP Alliance


Mythbuster: Normalizing with Integers is Always Best

page

by J. Ambrose Little
Feedback

Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 23789/ 32

In the Beginning

[Download Sample]

For as long as I’ve been in development (which is, admittedly, not as long as many), it has been an oft-repeated adage that you should not use variable character (VARCHAR) columns as primary keys in database tables. No, rather, you should use integers, they say. And most databases that I’ve run into in my career follow this advice, having almost exclusively integer (INT) auto-increment (IDENTITY) artificial primary key columns—the good ol’ “Id” (or “ID,” if you prefer) column. There are exceptions, of course, but by and large, this has been the design approach that appears to be most common, at least in my experience.

The problem with this approach, barring disputes about natural versus artificial primary keys, is that you start creating a multitude of lookup tables to house one value just for the sake of consistency and, theoretically, speed. For instance, let’s say you have a table called Orders that has a column that should store the country whence you need to ship the items in the order, and you need to constrain entry to a list of real countries for the sake of consistency.

Adhering to the aforementioned adage, you would likely then dutifully create a Countries table with Id and Name columns (and possibly one or more abbreviation columns). In your Orders table, you might create a CountryId column to store a key reference to the appropriate row in the Countries table. You would then proceed to build a stored procedure to retrieve a list of countries to use in, for example, an ASP.NET DropDownList control, using the Name for the DataTextField and the Id for the DataValueField.

For experienced data-oriented developers, this scenario will have a soporific effect because you’ve done it so many times. I would also wager that most of us would agree that it is the best approach. But is it? Let’s reason together about it.

Common Sense

[Download Sample]

Common sense would have it that it is always better to use an integer column as a primary key. But just what are the pros and cons of using the lookup table scenario with integer keys? Let’s start with the benefits.

Centralized name management – storing the artificial Id key in related tables will make it easier to keep data consistent in the event of changes in the names. This is part and parcel with the second normal form, if you care about living up to such pedantry.
Inherent uniqueness – artificial integer identity keys are inherently unique, so you don’t have to go to any trouble to ensure key uniqueness.
Reduced storage space – standard integer columns are four bytes long. A country name, for example, could be up to 50 bytes long. This is another benefit of normalization in general.
Increased performance – the adage is that performance will be better for integer primary keys, mostly when dealing with foreign keys that reference the primary key in JOINs.

Wow, that’s a pretty hefty list of benefits. What about the cons?

Values are not human-readable – since the keys are artificial and have no semantic relationship with the entity being described, it makes reading such values in the table for humans impossible without more work; hence the lookup terminology because we have to look up the values in a reference to see what they stand for.
Increase in total database entities – because we are forced to create lookups (where they might otherwise not be necessary) and create new, fake keys, we increase the total number of entities that we have to create and manage.
Added complexity – because the values in the rows have no meaning without looking up the “real” values, every operation that involves using that data is more complicated.
Replication trouble – when you replicate rows in a table that has such artificial lookup values, you must ensure that there is a corresponding lookup table in the destination that is also replicated and has exactly the same artificial identity values.

As with all lists that try to be comprehensive, I will conclude with the common caveat that there may indeed be considerations left out, but I trust you will bear with me in limiting consideration to these for the purposes of the article.

Centralized name management can be a significant reason to use a lookup table, but that should not necessarily be the deciding factor. If your alternative named key value is not volatile, as is the case with country names and many other simple lookup tables, then this consideration may not weigh heavily in your decision. However, if you have no meaningful keys that are unique and non-volatile, you may certainly want to opt for the artificial key and lookup scenario.

Reduced storage space will only be important at the extreme high-end of database requirements. If you are looking at a potential difference of, on average, 15-20 bytes between an integer column and a variable character column, even when dealing with a million rows, you are only looking at a difference of about 15-20 megabytes, which is nothing in today’s world of high-capacity storage.

On the other hand, considering the added complexity of making the data usable because artificial integer keys are inherently meaningless, you may want to consider seriously other alternatives before jumping on the lookup table bandwagon. Consider that every time you want to present the data in any user interface or for exporting to other systems, you will need to do extra work to make the data meaningful.

To get the pros and cons of using a meaningful, human-readable key, you can, for the most part, invert the pros and cons lists above. The cons are addressed while a few of the benefits no longer hold true and can become sources of trouble.

Performance Obsession

[Download Sample]

Strangely enough, many developers are still obsessed with performance, oftentimes at the expense of more important considerations, such as usability, developer time (both in initial development and subsequent support), scalability, and maintainability. I say this because I’ve often run across situations where the difference in performance between two different approaches is negligible if not non-existent for usability concerns, which ultimately should be the key concern we have when speaking of performance—your users will not care if a particular operation takes 100 milliseconds longer to complete than it might have otherwise.

And this is the case in the difference between using a variable character field (a semantically relevant named key) and an integer field (a semantically bereft artificial key). The tests, outlined below, exhibit that even when joining 1.7 million rows, the difference in performance between the two approaches is not terribly significant.

Further, when dealing with a simple, single-value lookup table, as we do in the test to list country names for shipping purposes, the difference between having to do the lookup and not doing the lookup is greater than the difference between using an integer key versus a variable character key for the lookup.

Theories in Action

[Download Sample]

For the test, I first prepared two separate country tables with the purpose of benchmarking the difference between doing integer lookups versus variable character lookups. The create scripts for both are listed below.

Countries Tables

if exists (select * from dbo.sysobjects 
 where id = object_id(N'[CountriesById]') 
  and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [CountriesById]
GO


CREATE TABLE [CountriesById] (
 [Id] [int] IDENTITY (1, 1) NOT NULL ,
 [Name] [varchar](50) NOT NULL,
 [PostalAbbreviation] [char](2) NOT NULL,
 CONSTRAINT [PK_CountryId] PRIMARY KEY  CLUSTERED 
 (
  [Id]
 )  ON [PRIMARY] 


) ON [PRIMARY]
GO



if exists (select * from dbo.sysobjects 
 where id = object_id(N'[CountriesByName]') 
  and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [CountriesByName]
GO


CREATE TABLE [CountriesByName] (
 [Name] [varchar](50) NOT NULL,
 [PostalAbbreviation] [char](2) NOT NULL,
 CONSTRAINT [PK_CountryName] PRIMARY KEY  CLUSTERED 
 (
  [Name]
 )  ON [PRIMARY] 
) ON [PRIMARY]
GO

As you can see, the difference is in the primary key. In the first case, an artificial numeric identity column is used; in the second, the actual country name. Both tables have an abbreviation column in which is stored a two-character abbreviation solely for the purposes of making sense of having the lookup table for the second case. I loaded up the tables from an existing countries table in another database, so I ended up with 240 entries. (I’ve exported the CountriesByName table to a CSV, included in the sample download, for anyone who wants to run these tests themselves.)

The next step was to open up Query Analyzer and modify the Orders table in the Northwind database to add an Entropy column and a CountryId column as seen in the following listing.

Add Orders Columns

USE Northwind
GO 
ALTER TABLE dbo.Orders ADD
 Entropy float(53) NULL,
 CountryId INT NULL
GO

I added the Entropy column to add a bit of flavor to the row values because I intended to increase significantly the size of the Orders table by copying existing rows. CountryId was also added as a foreign key to the Id in the CountriesById table.

The next step was to change the type of the ShipCountry column to match the type of the Name column in the countries tables. The reason for this is that it is by default (in SQL Server 2000, at least) an NVARCHAR column, which can add noticeable strain (skewing the results) when joining it to the VARCHAR type that is the key in the CountriesByName table. I noted this because I initially did not make the following change, and it increased the execution time by a notable amount.

Change ShipCountry

ALTER TABLE dbo.Orders 
 ALTER COLUMN ShipCountry VARCHAR(50)
GO

The next step was to populate the CountryId values for the rows in the Orders table. Luckily, after changing UK to United Kingdom and USA to United States, all of the values in ShipCountry matched the values in my countries tables, so I was able to massage the data into shape in a few easy statements.

Populate CountryId

-- Update orders to have matching country names and IDs 
UPDATE Orders SET 
 ShipCountry = 'United States' 
WHERE ShipCountry = 'USA'


UPDATE Orders SET 
 ShipCountry = 'United Kingdom' 
WHERE ShipCountry = 'UK'


UPDATE Orders SET 
 CountryId = (SELECT C.[Id] 
   FROM dbo.CountriesById C 
   WHERE C.[Name] = ShipCountry)

Now that the initial 830 rows were prepped, I felt it would be best to increase the size of the Orders table to something that most applications don’t see, in order to get a better feel for the comparative speed of the two approaches. Beware! If you run the following statements, you may as well go make yourself a cup of tea or a pot of coffee. It took my fairly beefy box about five to six minutes to run (and be sure you have a couple gigabytes free on your data drive).

Bloat Orders Table

UPDATE Orders SET Entropy = ROUND(RAND(OrderID)*1000, 2)
GO
SET NOCOUNT ON
GO
DECLARE @counter smallint
SET @counter = 10
WHILE @counter < 22
   BEGIN
 INSERT INTO Orders ( 
  CustomerId, 
  EmployeeId, 
  OrderDate, 
  RequiredDate, 
  ShippedDate, 
  ShipVia,
  Freight, 
  ShipName, 
  ShipAddress, 
  ShipCity, 
  ShipRegion, 
  ShipPostalCode, 
  ShipCountry, 
  Entropy, 
  CountryId)
 SELECT 
  CustomerId, 
  EmployeeId, 
  OrderDate, 
  RequiredDate, 
  ShippedDate, 
  ShipVia,
  Freight, 
  ShipName, 
  ShipAddress, 
  ShipCity, 
  ShipRegion, 
  ShipPostalCode, 
  ShipCountry, 
  RAND(@counter*Entropy)*10000, 
  CountryId
 FROM Orders nolock
      SET @counter = @counter + 1
   END
GO
SET NOCOUNT OFF
GO

After sipping on my cup of tea, the Orders table was successfully bloated to 1,699,840 rows, which I felt was sufficient for the test. Switching back over to Query Analyzer, I decided to run three tests. First, I would do a query that grouped by Orders.CountryId and Countries.Name, joining on the Id for the CountriesById table and on the Name for the CountriesByName table, as below.

First Test – Joined Grouping

SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesById C ON C.[Id] = O.CountryId
GROUP BY O.CountryId, C.[Name]


SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesByName C ON C.[Name] = O.ShipCountry
GROUP BY O.CountryId, C.[Name]

After warming up the engine by running the queries a couple of times, I compared the execution times using SQL Profiler. Joining on the integer did have an advantage; it took about one second to run the query, with 43,666 reads and 0 writes. Joining on Name took 1.7 seconds to run, incurring just a few more reads (43,733) and still 0 writes.

Clearly, the winner is joining on the integer key; however, I’d like to point out that this advantage only gains you fractions of a second even when querying 1.7 million rows. The difference, in terms of ratio, remained fairly constant when comparing on fewer rows. For example, the difference on about 150k rows was about 40 milliseconds.

The next test was to compare retrieving the complete row set with the same joins.

Second Test – Joined Without Grouping

SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesById C ON C.[Id] = O.CountryId


SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesByName C ON C.[Name] = O.ShipCountry

With the above queries, the difference was less significant, in terms of ratio, but a bit more in overall response time. Joining by the integer field took about 6.6 seconds to complete, while joining on the variable character took about 8.2 seconds. The difference is still negligible, but you can see how when dealing with row sets in the millions you would want to consider this efficiency gain and weigh it against the other factors.

Last Test – No Joining

SELECT O.[ShipCountry] 
FROM dbo.Orders O

Now, you may be wondering what this last test is all about. It serves to illustrate a key point, a consideration that you should definitely keep in mind when thinking about whether or not to use a semantically meaningful variable character value instead of an artificial key and lookup table, even if it goes against the second normal form.

Because I’m using such a value, it is unnecessary for me to even make a join to a lookup table to retrieve the same, meaningful data. And guess what, this query beats out the integer join by 2.2 seconds (and by 2.7 seconds when adding a simple index on the ShipCountry column); it came in at 4.3 seconds on average without an index and 3.9 seconds with the index.

Doggie Bag

[Download Sample]

So what can you take home with you? First, you can take home that some folks can talk about just about anything for a long time. Second, you can take home that, while this may not be busting a myth, it is putting an adage in perspective. What I mean is that while the adage that integer joins are faster than variable character joins is true, that does not necessarily mean you should always opt for creating a lookup table, regardless of what academic rules might dictate.

I would suggest that if the lookup table has only one value in it, that of a semantically pertinent and (de facto) unique value, such as the name of a country, you should seriously consider just duplicating that character data for the sake of simplicity, usability (from a developer and report writer perspective), and performance (remember the last test—if you don’t join, the query is notably faster).

You can still constrain entry to a list. In fact, if you do go ahead and create a separate table to store the list values, you can more easily keep those values in sync even if they change by taking advantage of the cascading update option available in SQL Server. But you don’t have to refer back to the list every time you want to display meaningful data for other tables that use data from your list.

Also keep in mind that if you do need to do a lookup in such cases, it is not all that bad, really, especially if you’re not dealing with many, many rows. If you find that you think the benefits of a semantically meaningful key are worth sacrificing a few milliseconds here and there, go for it.

However, if you do not have such a key for an entity, do not want to rely on users to create one (if that’s an option), or will be dealing with multiple millions of records, you should probably stick with the integer key approach. If you will be replicating a lot, though, you may want to consider using the UNIQUEIDENTIFIER column type for your artificial keys, as that can simplify situations, though it too has drawbacks that we won’t cover here.