Wednesday, June 8, 2016

Relational Database Design:Simple rules for creating primary key

Introduction

Some years ago, I had some discussions with one of my friends about using defining the primary key. In the old days, as a software developer you had to implement your database first so most people though about how does the database works and how they can implement their solution in a suitable way.
In these days, developers use ORMs most of the time which is very good, but at the same time, they start to ignore how their code will implement inside the DB. So it is important to consider how your data will be translated by ORM and what will you have inside your DB afterwards.


Some Rules 
Well I think most people these rules, but there is no harm in mentioning them.

Primary keys should never change

Your RDBMS, uses keys to manage tables, sort them, find them and make relation between them and the most important key obviously is the primary key. If you ever try to change the value of the primary key, it will affect other related tables.

You cannot use a natural key or a key form other system

It is possible for natural keys to change, so you are obviously violation the first rule. You might say, oh someones ID will not change. But it is possible for the government the change the system of producing the ids then you have to change a lot of stuff.

They cannot have any formula

It is also a violation of rule 1, since you may need to change the formula in the feature.

The uniqueness has to be easy to maintain

Your RDBMS will prevent you from putting a duplicate inside your primary key, so if you are generating your key with a method that can create duplicates, you will have lots of problems that you cannot fix easily.

Use short but suitable key type

All RDBMS' use B+ threes as they index structure. They need to put your keys inside a table and fetch them so if you use a big key, your RDBMS can put less items inside a page and therefore, it has to access the disc more times  which is the bottle neck of every business application.

In SQL Server, number of index rows in each page can be calculated using this formula:
Index_Rows_Per_Page = 8096 / (Index_Row_Size + 2)

Which and size of each row equals:

Index_Row_Size = Fixed_Key_Size + Variable_Key_Size + Index_Null_Bitmap + 1 (for row header overhead of an index row) + 6 (for the child page ID pointer)

Considering the above, size of index row for int is
int (4 bytes): 4+3+1+6=14
Which means you can put 506 rows inside a page


What are choises

Considering the above, I will start with the worst one!

Never use (n)char or (n)varchar!

If you are using (n)char or (n)varchar for your key, I am almost sure that you are violating of all rules above, since no one will store a string key generated by his system :)
You will also need to worry about upper and lower case and the size of the key is also obviously big!

For a varchar(50) you have: (2+1+50)+3+6+1 = 63
Which means you have only 124 keys inside a page which is awful

Even for a varchar(20) you have: (2+1+20)+3+6+1 = 33
Which means you have only 231 keys inside a page which is still awful

And also, when you are creating your key, you will be vulnerable to concurrent requests. Like when 2 threads ask for a new key inside your application 


Less than 1% of the times use GUID

GUIDs are good data structure that can help you make sure that your key is unique. But at the same time, they are very big and they cannot be stored as a cluster index because there is no order in generating them.
The good fact about them is that they are easy to move because there would be no conflict. Also, some times, you have to create the key inside a code, then it is of course better to use a GUID to reduce the chance of generating the same key, but I would say, try avoiding them as you can.
For a GUID you have: (2+1+16)+3+6+1 = 29

Which means you have only 261 keys inside a page which is still bad


99% percent of the times use int with identity 

Int with the option of identity will help you to keep everything simple. It is very small and will take not much space. Use it with identity so you can make sure that the code will not generate a duplicate because of concurrent process. And it can also contain more than 2 billion different keys which is enough for most business systems.

Don't use small int or tiny int
Tiny int and small int are too small but their size doesn't have that much effect in comparison with integer.

For smallint (2 bytes): 2+3+1+6=12
which means you can put 578 rows inside a page which is 14% improvement but it can contain only 32,000 different values which is not that much

For tiny int(1 bytes): 1+3+1+6=11
which means you can put 622 rows inside a page which seems to be 7% improvement but since it can contain only 256 values :| you cannot use the other 366! :) 

Less than 1% of the times use Bigint
Ok! In very special projects, you might have a table that can contain more than 2 Billion rows! Like you are working for Amazon :) Then use bigint which I don't think happens for most developers in their professional life time :)

References:
Please find the formula for calculating size of rows, etc here in MSDN
And for size of variables for to this page also in MSDN

1 comment: