Mythbuster: Normalizing with Integers is Always Best: ASP Alliance

Print Add To Favorites Email To Friend Rate This Article
Mythbuster: Normalizing with Integers is Always Best

page 4 of 5

by J. Ambrose Little
Feedback

Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 23784/ 50

Article Contents:

Theories in Action

[Download Sample]

For the test, I first prepared two separate country tables with the purpose of benchmarking the difference between doing integer lookups versus variable character lookups. The create scripts for both are listed below.

Countries Tables

if exists (select * from dbo.sysobjects 
 where id = object_id(N'[CountriesById]') 
  and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [CountriesById]
GO


CREATE TABLE [CountriesById] (
 [Id] [int] IDENTITY (1, 1) NOT NULL ,
 [Name] [varchar](50) NOT NULL,
 [PostalAbbreviation] [char](2) NOT NULL,
 CONSTRAINT [PK_CountryId] PRIMARY KEY  CLUSTERED 
 (
  [Id]
 )  ON [PRIMARY] 


) ON [PRIMARY]
GO



if exists (select * from dbo.sysobjects 
 where id = object_id(N'[CountriesByName]') 
  and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [CountriesByName]
GO


CREATE TABLE [CountriesByName] (
 [Name] [varchar](50) NOT NULL,
 [PostalAbbreviation] [char](2) NOT NULL,
 CONSTRAINT [PK_CountryName] PRIMARY KEY  CLUSTERED 
 (
  [Name]
 )  ON [PRIMARY] 
) ON [PRIMARY]
GO

As you can see, the difference is in the primary key. In the first case, an artificial numeric identity column is used; in the second, the actual country name. Both tables have an abbreviation column in which is stored a two-character abbreviation solely for the purposes of making sense of having the lookup table for the second case. I loaded up the tables from an existing countries table in another database, so I ended up with 240 entries. (I’ve exported the CountriesByName table to a CSV, included in the sample download, for anyone who wants to run these tests themselves.)

The next step was to open up Query Analyzer and modify the Orders table in the Northwind database to add an Entropy column and a CountryId column as seen in the following listing.

Add Orders Columns

USE Northwind
GO 
ALTER TABLE dbo.Orders ADD
 Entropy float(53) NULL,
 CountryId INT NULL
GO

I added the Entropy column to add a bit of flavor to the row values because I intended to increase significantly the size of the Orders table by copying existing rows. CountryId was also added as a foreign key to the Id in the CountriesById table.

The next step was to change the type of the ShipCountry column to match the type of the Name column in the countries tables. The reason for this is that it is by default (in SQL Server 2000, at least) an NVARCHAR column, which can add noticeable strain (skewing the results) when joining it to the VARCHAR type that is the key in the CountriesByName table. I noted this because I initially did not make the following change, and it increased the execution time by a notable amount.

Change ShipCountry

ALTER TABLE dbo.Orders 
 ALTER COLUMN ShipCountry VARCHAR(50)
GO

The next step was to populate the CountryId values for the rows in the Orders table. Luckily, after changing UK to United Kingdom and USA to United States, all of the values in ShipCountry matched the values in my countries tables, so I was able to massage the data into shape in a few easy statements.

Populate CountryId

-- Update orders to have matching country names and IDs 
UPDATE Orders SET 
 ShipCountry = 'United States' 
WHERE ShipCountry = 'USA'


UPDATE Orders SET 
 ShipCountry = 'United Kingdom' 
WHERE ShipCountry = 'UK'


UPDATE Orders SET 
 CountryId = (SELECT C.[Id] 
   FROM dbo.CountriesById C 
   WHERE C.[Name] = ShipCountry)

Now that the initial 830 rows were prepped, I felt it would be best to increase the size of the Orders table to something that most applications don’t see, in order to get a better feel for the comparative speed of the two approaches. Beware! If you run the following statements, you may as well go make yourself a cup of tea or a pot of coffee. It took my fairly beefy box about five to six minutes to run (and be sure you have a couple gigabytes free on your data drive).

Bloat Orders Table

UPDATE Orders SET Entropy = ROUND(RAND(OrderID)*1000, 2)
GO
SET NOCOUNT ON
GO
DECLARE @counter smallint
SET @counter = 10
WHILE @counter < 22
   BEGIN
 INSERT INTO Orders ( 
  CustomerId, 
  EmployeeId, 
  OrderDate, 
  RequiredDate, 
  ShippedDate, 
  ShipVia,
  Freight, 
  ShipName, 
  ShipAddress, 
  ShipCity, 
  ShipRegion, 
  ShipPostalCode, 
  ShipCountry, 
  Entropy, 
  CountryId)
 SELECT 
  CustomerId, 
  EmployeeId, 
  OrderDate, 
  RequiredDate, 
  ShippedDate, 
  ShipVia,
  Freight, 
  ShipName, 
  ShipAddress, 
  ShipCity, 
  ShipRegion, 
  ShipPostalCode, 
  ShipCountry, 
  RAND(@counter*Entropy)*10000, 
  CountryId
 FROM Orders nolock
      SET @counter = @counter + 1
   END
GO
SET NOCOUNT OFF
GO

After sipping on my cup of tea, the Orders table was successfully bloated to 1,699,840 rows, which I felt was sufficient for the test. Switching back over to Query Analyzer, I decided to run three tests. First, I would do a query that grouped by Orders.CountryId and Countries.Name, joining on the Id for the CountriesById table and on the Name for the CountriesByName table, as below.

First Test – Joined Grouping

SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesById C ON C.[Id] = O.CountryId
GROUP BY O.CountryId, C.[Name]


SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesByName C ON C.[Name] = O.ShipCountry
GROUP BY O.CountryId, C.[Name]

After warming up the engine by running the queries a couple of times, I compared the execution times using SQL Profiler. Joining on the integer did have an advantage; it took about one second to run the query, with 43,666 reads and 0 writes. Joining on Name took 1.7 seconds to run, incurring just a few more reads (43,733) and still 0 writes.

Clearly, the winner is joining on the integer key; however, I’d like to point out that this advantage only gains you fractions of a second even when querying 1.7 million rows. The difference, in terms of ratio, remained fairly constant when comparing on fewer rows. For example, the difference on about 150k rows was about 40 milliseconds.

The next test was to compare retrieving the complete row set with the same joins.

Second Test – Joined Without Grouping

SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesById C ON C.[Id] = O.CountryId


SELECT O.CountryId, C.[Name] 
FROM dbo.Orders O
INNER JOIN dbo.CountriesByName C ON C.[Name] = O.ShipCountry

With the above queries, the difference was less significant, in terms of ratio, but a bit more in overall response time. Joining by the integer field took about 6.6 seconds to complete, while joining on the variable character took about 8.2 seconds. The difference is still negligible, but you can see how when dealing with row sets in the millions you would want to consider this efficiency gain and weigh it against the other factors.

Last Test – No Joining

SELECT O.[ShipCountry] 
FROM dbo.Orders O

Now, you may be wondering what this last test is all about. It serves to illustrate a key point, a consideration that you should definitely keep in mind when thinking about whether or not to use a semantically meaningful variable character value instead of an artificial key and lookup table, even if it goes against the second normal form.

Because I’m using such a value, it is unnecessary for me to even make a join to a lookup table to retrieve the same, meaningful data. And guess what, this query beats out the integer join by 2.2 seconds (and by 2.7 seconds when adding a simple index on the ShipCountry column); it came in at 4.3 seconds on average without an index and 3.9 seconds with the index.

« (Page 3)

View Entire Article

(Page 5) »

User Comments

Title: I am not Against Normalization
Name: J. Ambrose Little
Date: 2005-09-16 9:11:49 PM
Comment:
Give it a rest, folks. If you disagree, you disagree. The point is that it is not an unbendable rule to use artificial integer keys for lookup tables...

Title: Internationalization
Name: ThinCover
Date: 2005-08-10 9:24:25 AM
Comment:
Your example is very poorly chosen. One, a standardized artificial key for country codes, the ISO-3166 standard, already exists. And two, you've made it nearly impossible to internationalize the application since you've stored the country names in English.

Title: Let's address those "downsides" you list
Name: Richard P
Date: 2005-08-09 7:14:04 PM
Comment:
1) Values are not human readable.

Who cares? Why do they need to be stored as human-readable in the DB? That's what the Presentation Layer of your application is for. If your web/GUI/Data Warehouse developer can't join a table with a lookup table, fire them yesterday.

2) Increase in database entities.

Um, no, sorry, you're completely wrong here. You'll need lookup tables as the authoritative "list of valid possible values" anyhow. Or do you advocate not having a list of valid values and just letting every user enter whatever they want ad-hoc? I can't imagine you would. So you have the same number of tables.

3) Added complexity.

Again, I disagree. Some operations that involve displaying the value of the field in the lookup tables are more complex. JOINing and such operations don't give a hoot if you're joining on meaningless integers or character values.

But without meaningless keys, you cannot safely change the display value of the data. If you use the country code instead of a meaningless primary key, you're screwed when, for instance, "Burma" gets changed to "Myanmar". How is "BUR" in a table standing for "Myanmar" human-readable?

4. Replication troubles.

Again, this is a non-issue. Unless someone is talking about "one table for everything" design (in which case they're a complete amateur), you have a list of tables that need to be replicated. You just add the lookup tables to the list.

Finally, you run some benchmarks against values vs. using a lookup table. In 99% of todays applications, you only use the artificial key value in the query anyways.

Take a simple salesman tracking app. User looks up "Joe Smith" and sees he's in the "Southwest" sales region. The "Southwest" region is region 352256 in the DB, but the user doesn't care. He just clicks on the "Find other salespeople in this region" button which then executes

SELECT p.name, p.address, p.phone, p.fax, p.email
FROM salesman s, person p
WHERE s.emplid =

Title: One more thing
Name: BCoder
Date: 2005-05-22 1:50:32 AM
Comment:
It seems that you are thinking about looking at data right out of the database, kind of like it was done a while ago when 90% of the worlds applications where written in COBOL.
Back then ALPHA CODES where in. Just remember that storing INT 150K times saves more space then storing the real name of the country 150K times.

There is a point at which data normalization does not make any sense. That point is when you have a table filled with all keys and no real data at all. That is where there will be real problems for the developer... PersonFirstNameID, PersonLastNameID... It could look this way if there wasnt some balancing done by DBA's.

Title: Something to think about
Name: BCoder
Date: 2005-05-22 1:43:59 AM
Comment:
When not using a lookup table for the country issue presented in this article there are a number of other factors to think about. What happens when one country splits up into 2 or even 3 countries? Let’s just say that it changes its name. The update locks records that do not need to be locked if you store the "name" of the country in the primary table instead of a ID in the column. But I guess that doesn’t matter because I’ve saved my self time and I really don’t care about what slows down an application. When maintaining data this makes a big difference. Another thing that you didn’t talk about was multi users. When a lookup is used a lot it stays in memory and would actually decrease the amount of time it would take because the IO reads would drop...

Are we tracking historical data? historical changes at the record and field level?

Here is something to think about and lookup... when doing a select query add the NOLOCK option to each referenced table... You can also speed up queries even more if you use derived tables....

If it takes me two more weeks to develop some feature or do something a certain way... it is well worth the time especially it I never have to go back and touch it again. It is very selfish to think about the time it would take me to do something one way verse another. In the long run the most approprate way will reduce my overall work effort.

Just think lookups are managed by someone other then a developer and other then a DBA. Let’s keep these people working not cleaning up things that are design flaws.

Think of it this way: If you spend a couple more minutes a day doing something in a different way how much time does that save you in the future when you have to correct previous mistakes, that is if you let your self to make a mistake.

Anyway, in real business world situations where DBA's are designing things and programers do not have any control over this issue, you do what they tell you even if you think they are crazy.

Title: RE: Comments
Name: J. Ambrose Little
Date: 2005-05-20 9:34:32 AM
Comment:
Thanks all for the comments. It's good to have dissenting opinions to encourage people to think critically about issues.

@ svevarn:
Lao, People's Democratic Republic <-- 33 characters long

The 50 character figure was just being generous. Since we're using VARCHAR, it doesn't matter much if we make it max 50 or max 33...

@ Matthew: Thanks for bringing up that point. Cascading updates on keys will only work in database, but I presume a cascaded key update would also replicate, no?

@ Craig: No, not a joke. I'm just suggesting that there are times when using INT keys is less desirable than not.

@ Arturo: When speaking of architecture, things are rarely as black and white as you seem to think. A lookup table, as its name implies, should be used to look up non-key values about an entity. If you really don't have anything meaningful to lookup beyond a name, I'd suggest a lookup table is likely superfluous.

As for your other comments, I can see you have trouble thinking outside the box and reading things in context, so I'll just let what I've said on those subjects stand.

Title: I disagree
Name: svevarn
Date: 2005-05-20 9:18:30 AM
Comment:
"Reduced storage space – standard integer columns are four bytes long. A country name, for example, could be up to 50 bytes long. This is another benefit of normalization in general."

1) Use countrycode "SE" "UK" "US".... char(2) 2 bytes less than int.
2) Name 1 country with more than 30 chars?

natural keys should be used as often as possible. Data should make sense, not look "prettier" just because its an number.

Title: Programmer
Name: Matthew MacFarland
Date: 2005-05-20 5:14:20 AM
Comment:
I agree that subsecond improvements are not important in most cases. The main reason that I use artifical keys is not performance but maintenance. I have used natural keys in the past and these get propogated to other tables as FKs, then users call and want the values updated. In our replicated environment these types of bulk updates are hard to do. Artifical keys shield dbas and programmers from the volitility of user controlled data. Don't believe them when they tell you "this won't change".

Title: Hmmmm
Name: Craig
Date: 2005-05-20 1:52:37 AM
Comment:
I assume this is a late April fools joke.

1. Values are not human-readable. This is a good thing not a bad thing as you state. A key should have not have any business meaning. Period.
2. Increase in total database entities. This is not really an issue IMO. Who cares how many tables you have, it has minimal effect on anything.
3. Added complexity. I think the opposite is true. Having unnormalised data scatter throughout the database is very complex.
4. Replication trouble. What trouble? Just replicate the lookup tables first.

Title: Database normalization
Name: Arturo Martinez
Date: 2005-05-19 4:30:49 PM
Comment:
The purpose of a lookup table in that case is to ensure that each country name is spelled correctly in every place that is used. Period.

Anyone who has had to do maintenance work on database systems could tell you how hard is to work with unnormalized databases. Database normalization is not "theory" and is far from "pedantry": it's a set of necessary steps to obtain a database design that allows for consistent storage and efficient access of data in a relational database that had been proved for years.

"Strangely enough, many developers are still obsessed with performance" What? This should be enough for you to be listed in TheDailyWTF

Title: RE: country abbreviation
Name: J. Ambrose Little
Date: 2005-05-16 3:20:44 PM
Comment:
Hi again,

Yeah, if you want to constrain entry to a list, you will likely want to store the list somewhere, but that's not necessarily a lookup table. If you store the value you need in the records themselves (instead of a lookup key), you won't need to look them up, regardless of if your UI provided a list to constrain entry to.

I don't use abbreviations for countries because most people don't know all the abbreviations, and I'm fairly sure not all countries' postal services understand them and abide by them. It's just safer to use the country name, IMO, to ensure delivery and to make it more readable for users.

Title: country abbreviation
Name: James Curran
Date: 2005-05-16 3:00:10 PM
Comment:
Yes, but in this case (and I grant that this only works with certain specific tables), the abbreviation-as-PK is adequately human-readable, and all that is needed in most of the time, so no lookup is needed. And other times, you can't get away from a lookup table, even when at first you think you can: For example, for user input, you might think you could just enter plain text, but you will need to create a dropdown from the lookup, unless you could accept a wide variety of names for the same country ("US", "USA" "U.S.A.", "United States", "America" etc)

Title: RE: The middle ground...
Name: J. Ambrose Little
Date: 2005-05-16 11:13:09 AM
Comment:
Hi James,

Yes, I didn't address that, but the idea with using the name is, primarily, that no lookup needs to occur for most cases.

Title: The middle ground....
Name: James Curran
Date: 2005-05-16 11:08:44 AM
Comment:
Oddly, you don't mention the primary key that I would use in such a case : the country abbreviation. It has all of the "pros" of an integer PK, while negating 3 of the 4 "cons". Con #4 remains, but I'm not should I'd even qualify that as a "con", rather just the price of data intregrity.