[Download Sample]
For the test, I first prepared two separate country tables with the purpose of benchmarking the difference between doing integer lookups versus variable character lookups. The create scripts for both are listed below.
Countries Tables
if exists (select * from dbo.sysobjects
where id = object_id(N'[CountriesById]')
and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [CountriesById]
GO
CREATE TABLE [CountriesById] (
[Id] [int] IDENTITY (1, 1) NOT NULL ,
[Name] [varchar](50) NOT NULL,
[PostalAbbreviation] [char](2) NOT NULL,
CONSTRAINT [PK_CountryId] PRIMARY KEY CLUSTERED
(
[Id]
) ON [PRIMARY]
) ON [PRIMARY]
GO
if exists (select * from dbo.sysobjects
where id = object_id(N'[CountriesByName]')
and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [CountriesByName]
GO
CREATE TABLE [CountriesByName] (
[Name] [varchar](50) NOT NULL,
[PostalAbbreviation] [char](2) NOT NULL,
CONSTRAINT [PK_CountryName] PRIMARY KEY CLUSTERED
(
[Name]
) ON [PRIMARY]
) ON [PRIMARY]
GO
As you can see, the difference is in the primary key. In the first case, an artificial numeric identity column is used; in the second, the actual country name. Both tables have an abbreviation column in which is stored a two-character abbreviation solely for the purposes of making sense of having the lookup table for the second case. I loaded up the tables from an existing countries table in another database, so I ended up with 240 entries. (I’ve exported the CountriesByName table to a CSV, included in the sample download, for anyone who wants to run these tests themselves.)
The next step was to open up Query Analyzer and modify the Orders table in the Northwind database to add an Entropy column and a CountryId column as seen in the following listing.
Add Orders Columns
USE Northwind
GO
ALTER TABLE dbo.Orders ADD
Entropy float(53) NULL,
CountryId INT NULL
GO
I added the Entropy column to add a bit of flavor to the row values because I intended to increase significantly the size of the Orders table by copying existing rows. CountryId was also added as a foreign key to the Id in the CountriesById table.
The next step was to change the type of the ShipCountry column to match the type of the Name column in the countries tables. The reason for this is that it is by default (in SQL Server 2000, at least) an NVARCHAR column, which can add noticeable strain (skewing the results) when joining it to the VARCHAR type that is the key in the CountriesByName table. I noted this because I initially did not make the following change, and it increased the execution time by a notable amount.
Change ShipCountry
ALTER TABLE dbo.Orders
ALTER COLUMN ShipCountry VARCHAR(50)
GO
The next step was to populate the CountryId values for the rows in the Orders table. Luckily, after changing UK to United Kingdom and USA to United States, all of the values in ShipCountry matched the values in my countries tables, so I was able to massage the data into shape in a few easy statements.
Populate CountryId
-- Update orders to have matching country names and IDs
UPDATE Orders SET
ShipCountry = 'United States'
WHERE ShipCountry = 'USA'
UPDATE Orders SET
ShipCountry = 'United Kingdom'
WHERE ShipCountry = 'UK'
UPDATE Orders SET
CountryId = (SELECT C.[Id]
FROM dbo.CountriesById C
WHERE C.[Name] = ShipCountry)
Now that the initial 830 rows were prepped, I felt it would be best to increase the size of the Orders table to something that most applications don’t see, in order to get a better feel for the comparative speed of the two approaches. Beware! If you run the following statements, you may as well go make yourself a cup of tea or a pot of coffee. It took my fairly beefy box about five to six minutes to run (and be sure you have a couple gigabytes free on your data drive).
Bloat Orders Table
UPDATE Orders SET Entropy = ROUND(RAND(OrderID)*1000, 2)
GO
SET NOCOUNT ON
GO
DECLARE @counter smallint
SET @counter = 10
WHILE @counter < 22
BEGIN
INSERT INTO Orders (
CustomerId,
EmployeeId,
OrderDate,
RequiredDate,
ShippedDate,
ShipVia,
Freight,
ShipName,
ShipAddress,
ShipCity,
ShipRegion,
ShipPostalCode,
ShipCountry,
Entropy,
CountryId)
SELECT
CustomerId,
EmployeeId,
OrderDate,
RequiredDate,
ShippedDate,
ShipVia,
Freight,
ShipName,
ShipAddress,
ShipCity,
ShipRegion,
ShipPostalCode,
ShipCountry,
RAND(@counter*Entropy)*10000,
CountryId
FROM Orders nolock
SET @counter = @counter + 1
END
GO
SET NOCOUNT OFF
GO
After sipping on my cup of tea, the Orders table was successfully bloated to 1,699,840 rows, which I felt was sufficient for the test. Switching back over to Query Analyzer, I decided to run three tests. First, I would do a query that grouped by Orders.CountryId and Countries.Name, joining on the Id for the CountriesById table and on the Name for the CountriesByName table, as below.
First Test – Joined Grouping
SELECT O.CountryId, C.[Name]
FROM dbo.Orders O
INNER JOIN dbo.CountriesById C ON C.[Id] = O.CountryId
GROUP BY O.CountryId, C.[Name]
SELECT O.CountryId, C.[Name]
FROM dbo.Orders O
INNER JOIN dbo.CountriesByName C ON C.[Name] = O.ShipCountry
GROUP BY O.CountryId, C.[Name]
After warming up the engine by running the queries a couple of times, I compared the execution times using SQL Profiler. Joining on the integer did have an advantage; it took about one second to run the query, with 43,666 reads and 0 writes. Joining on Name took 1.7 seconds to run, incurring just a few more reads (43,733) and still 0 writes.
Clearly, the winner is joining on the integer key; however, I’d like to point out that this advantage only gains you fractions of a second even when querying 1.7 million rows. The difference, in terms of ratio, remained fairly constant when comparing on fewer rows. For example, the difference on about 150k rows was about 40 milliseconds.
The next test was to compare retrieving the complete row set with the same joins.
Second Test – Joined Without Grouping
SELECT O.CountryId, C.[Name]
FROM dbo.Orders O
INNER JOIN dbo.CountriesById C ON C.[Id] = O.CountryId
SELECT O.CountryId, C.[Name]
FROM dbo.Orders O
INNER JOIN dbo.CountriesByName C ON C.[Name] = O.ShipCountry
With the above queries, the difference was less significant, in terms of ratio, but a bit more in overall response time. Joining by the integer field took about 6.6 seconds to complete, while joining on the variable character took about 8.2 seconds. The difference is still negligible, but you can see how when dealing with row sets in the millions you would want to consider this efficiency gain and weigh it against the other factors.
Last Test – No Joining
SELECT O.[ShipCountry]
FROM dbo.Orders O
Now, you may be wondering what this last test is all about. It serves to illustrate a key point, a consideration that you should definitely keep in mind when thinking about whether or not to use a semantically meaningful variable character value instead of an artificial key and lookup table, even if it goes against the second normal form.
Because I’m using such a value, it is unnecessary for me to even make a join to a lookup table to retrieve the same, meaningful data. And guess what, this query beats out the integer join by 2.2 seconds (and by 2.7 seconds when adding a simple index on the ShipCountry column); it came in at 4.3 seconds on average without an index and 3.9 seconds with the index.