Sunday, February 12, 2012

clustered index on IDENTITY column

Hi, what opinion do you people have with clustered index on IDENTITY column?
Is there more advantage or divantage?
Wouldn't it causes all insertion to be added to the same page. Is this 'hot
page' a big issue? Can it helps to prevent fragmentation? How is it so?
Secondly, is adding IDENTITY column itself recommended? Is it because by
plainly using the logical unique (eg. Passport Number that is assumed to be
unique) would be slower, maybe because it is not integer; or maybe composite
primary key? Is a IDENTITY primary key useful because of its use as a foreig
n
in related table?
thanks
EugeneHi:
Using a clustered index on an identity column is recommended. It helps
because all the values of a clustered index based on identity values are
unique.
The issue is a bit more complex, however, it wouldn't cause more insertion
problems or hot pages than using a not so distinct primary key column.
Perhaps, if you had a specific problem you are workign on, the context for
the question will be clearer.
Thanks,
Webmaster
http://www.kdkeys.net
"Eugene" <Eugene@.discussions.microsoft.com> wrote in message
news:E684212D-5AD0-4F00-8549-780DE4225D86@.microsoft.com...
> Hi, what opinion do you people have with clustered index on IDENTITY
> column?
> Is there more advantage or divantage?
> Wouldn't it causes all insertion to be added to the same page. Is this
> 'hot
> page' a big issue? Can it helps to prevent fragmentation? How is it so?
> Secondly, is adding IDENTITY column itself recommended? Is it because by
> plainly using the logical unique (eg. Passport Number that is assumed to
> be
> unique) would be slower, maybe because it is not integer; or maybe
> composite
> primary key? Is a IDENTITY primary key useful because of its use as a
> foreign
> in related table?
> thanks
> Eugene|||Using a clustered index on an IDENTITY column is usually best, provided it
is the primary key for the table. Here are a couple reasons:
(1) identity columns are ususally smaller, so joins are faster. In addition
the clustered index key is used as the row locator in every nonclustered
index, so a smaller clustered index key means smaller nonclustered indexes.
Smaller is always faster, because fewer disk reads are required to access an
index.
(2) identity columns usually increase, therefore new rows are generally
added at the end of the clustered index, thus avoiding page splits and all
of the performance issues associated with them.
In my opinion, it is always better to have a primary key that is guaranteed
to be stable, as opposed to a natural or composite key that can change.
"Eugene" <Eugene@.discussions.microsoft.com> wrote in message
news:E684212D-5AD0-4F00-8549-780DE4225D86@.microsoft.com...
> Hi, what opinion do you people have with clustered index on IDENTITY
column?
> Is there more advantage or divantage?
> Wouldn't it causes all insertion to be added to the same page. Is this
'hot
> page' a big issue? Can it helps to prevent fragmentation? How is it so?
> Secondly, is adding IDENTITY column itself recommended? Is it because by
> plainly using the logical unique (eg. Passport Number that is assumed to
be
> unique) would be slower, maybe because it is not integer; or maybe
composite
> primary key? Is a IDENTITY primary key useful because of its use as a
foreign
> in related table?
> thanks
> Eugene|||Hi Eugene,
A few issue here
Clustered Index are the data itself. So the leaf pages of the index are the
data pages. Having an identity make no difference to this what so every.
All it means in the identity column sequence is the same as the logical
layout of the leaf pages. Can this stop or prevent fragmentation yeah
perhaps but if say your Passport Number increased in value and didn't change
(as primary key) this would do the same thing. The issue really isn't
identity or not it is the candidate for PK shouldn't change over time. If
it is likely to then it should not be PK. Remember that a leaf page is only
8K in size the data in that page for is a little bit smaller. So the number
of rows that can fit in a page depends on the row size and the intital fill
factor.
There big debate between the issue of int ot big int over other data types
for PK (in terms of speed and index size ) I would say that if you are
looking at this as a problem then your scope for issues are too small. Disk
space is too cheap to worry about in most cases and in most cases the size
of a PK or index isn't then performance problem. More importantly is making
sure you have covering indexes or at least indexes on the tables that relate
to the grouping and where statements in your views and sql statements.
kind regards
Greg O
"Eugene" <Eugene@.discussions.microsoft.com> wrote in message
news:E684212D-5AD0-4F00-8549-780DE4225D86@.microsoft.com...
> Hi, what opinion do you people have with clustered index on IDENTITY
> column?
> Is there more advantage or divantage?
> Wouldn't it causes all insertion to be added to the same page. Is this
> 'hot
> page' a big issue? Can it helps to prevent fragmentation? How is it so?
> Secondly, is adding IDENTITY column itself recommended? Is it because by
> plainly using the logical unique (eg. Passport Number that is assumed to
> be
> unique) would be slower, maybe because it is not integer; or maybe
> composite
> primary key? Is a IDENTITY primary key useful because of its use as a
> foreign
> in related table?
> thanks
> Eugene|||Hi All, thanks for the kind reply.
I am not looking at a particular project now, I want to know the
technical/academic knowledge/experience on this consideration.
Brian, you mentioned, adding to the end of the clustered index prevent page
split, how can it be? Isn't that when the page becomes full, the page might
still split if the b-tree node cannot accomodate further? Or are you
comparing between insertion in the middle of the page? Is the difference in
page split potential very big?
Greg, if the passport number is same throughtout the db lifespan, and
incrementing, then it would be in the same sequence as IDENTITY. HOw bout if
it is always same, but not incrementing (meaning it might be added in the
middle)? What is the impact compare to sequential insertion? I think
something like my question to Brian.
Nope, I am not looking at int and bigint comparison. I agree that disk cost
is cheap, but I have reservation that since bigint is larger, it may result
in less rows being packed in a single page, thus more pages, page split :P
especially in the case where each row size is very small (the extra four
bytes would be insignicant if the row size is large, right?) I have a
question here on the int and bigint cpu performance. Is it because our
current computer is 32bit, so it performs better with four bytes int? Then,
next time when most of the computing systems move to 64bit machine, would
bigint be a better choice?
thanks, thanks a great lot
Eugene|||On Sun, 7 Aug 2005 01:29:01 -0700, Eugene wrote:

>Hi All, thanks for the kind reply.
>I am not looking at a particular project now, I want to know the
>technical/academic knowledge/experience on this consideration.
>Brian, you mentioned, adding to the end of the clustered index prevent page
>split, how can it be? Isn't that when the page becomes full, the page might
>still split if the b-tree node cannot accomodate further?
Hi Eugene,
With an insertion at the end of the page, there won't be a page split.
If the page is full, a new page is simply opened, and the new row is
inserted as the first row on the new page. The new page will of course
also have to be inserted in the next higher level of the B-tree index,
but with the insertion at the end of the table, that new index entry
will also be at the end of the page.

> Or are you
>comparing between insertion in the middle of the page? Is the difference in
>page split potential very big?
That is where the difference occurs. If a row needs to be inserted in
the middle of a page that is already full, half of the rows need to be
moved to a new page to make place. That takes more time than simply
opening a new page for the new row.
Run the following example to see the difference between clustering on an
increasing value or clustering on a random value (be sure to run it a
few times, to exclude the influence of other processes running on your
computer).
CREATE TABLE Test (Col1 int NOT NULL PRIMARY KEY NONCLUSTERED IDENTITY,
Col2 int NOT NULL UNIQUE CLUSTERED,
OtherData char(40) NOT NULL)
go
-- Use same seed to start with
SELECT RAND(123)
go
DECLARE @.Start datetime, @.End datetime, @.Done char(1)
SET @.Done = 'N'
SET @.Start = CURRENT_TIMESTAMP
WHILE @.Done = 'N'
BEGIN
INSERT INTO Test (Col2, OtherData)
SELECT RAND() * 2000000000, ''
IF SCOPE_IDENTITY() >= 10000
SET @.Done = 'Y'
END
SET @.End = CURRENT_TIMESTAMP
SELECT @.Start AS Started, @.End AS Finished, DATEDIFF(ms, @.Start, @.End)
AS Elapsed
go
sp_spaceused 'Test', 'TRUE'
go
DROP TABLE Test
go
CREATE TABLE Test (Col1 int NOT NULL PRIMARY KEY CLUSTERED IDENTITY,
Col2 int NOT NULL UNIQUE NONCLUSTERED,
OtherData char(2) NOT NULL)
go
-- Use same seed to start with
SELECT RAND(123)
go
DECLARE @.Start datetime, @.End datetime, @.Done char(1)
SET @.Done = 'N'
SET @.Start = CURRENT_TIMESTAMP
WHILE @.Done = 'N'
BEGIN
INSERT INTO Test (Col2, OtherData)
SELECT RAND() * 2000000000, ''
IF SCOPE_IDENTITY() >= 10000
SET @.Done = 'Y'
END
SET @.End = CURRENT_TIMESTAMP
SELECT @.Start AS Started, @.End AS Finished, DATEDIFF(ms, @.Start, @.End)
AS Elapsed
go
sp_spaceused 'Test', 'TRUE'
go
DROP TABLE Test
go

>Nope, I am not looking at int and bigint comparison. I agree that disk cost
>is cheap, but I have reservation that since bigint is larger, it may result
>in less rows being packed in a single page, thus more pages, page split :P
>especially in the case where each row size is very small (the extra four
>bytes would be insignicant if the row size is large, right?)
Yes, on big rows, the 4 extra bytes won't make much difference (unless
they are just the few bytes that make the difference between three rows
per page or two rows per page, of course).
But don't forget the indexes. The clustered index key is included in
every nonclustered index as well. Indexes are usually small, so the 4
extra bytes do make a difference there.
If you change the code above to use bigint instead of int, you'll see
that the time to insert all rows goes up a bit. But the real difference
will be in retrieving data - especially on queries that might use an
index scan on the nonclustered index.

> I have a
>question here on the int and bigint cpu performance. Is it because our
>current computer is 32bit, so it performs better with four bytes int? Then,
>next time when most of the computing systems move to 64bit machine, would
>bigint be a better choice?
I doubt if that is a factor of importance. It is my experience that
performance is governed by physical I/O first, logical I/O second. The
amount of work the processor has to do is usually irrelevant - the CPU
will probably still spend most of it's time waiting for the next page to
be read from disk.
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||Eugene,
It doesn't prevent page splits, it minimizes them. In a clustered index,
the rows of the table live in the leaf pages, which are all at the same
level in the tree. Rows are added to the leaf pages at the end, in order,
so insertions don't cause page splits for leaf pages. With an integer key,
each index page can store 539 index rows, so splits of index pages are also
minimized. If only inserts occur, then I believe that page splits do not
occur at all, because the tree height is the same and optimal regardless of
whether the rightmost node at each level has 1 key (except the root which
has 2 keys, of course) and all nodes to the left are full, or if each node
at each level has the same number of keys. Index pages are therefore
inserted at the end of each level and a new root page is created when the
index becomes full. I'm pretty sure that's what happens, but it's possible
that the rightmost node at each level splits when inserting into a full
index. Even so, the number of splits is still minimized, because inserts do
not occur in the middle of the structure.
"Eugene" <Eugene@.discussions.microsoft.com> wrote in message
news:1B76D478-166D-4A60-AA49-0A38988AD97E@.microsoft.com...
> Hi All, thanks for the kind reply.
> I am not looking at a particular project now, I want to know the
> technical/academic knowledge/experience on this consideration.
> Brian, you mentioned, adding to the end of the clustered index prevent
page
> split, how can it be? Isn't that when the page becomes full, the page
might
> still split if the b-tree node cannot accomodate further? Or are you
> comparing between insertion in the middle of the page? Is the difference
in
> page split potential very big?
> Greg, if the passport number is same throughtout the db lifespan, and
> incrementing, then it would be in the same sequence as IDENTITY. HOw bout
if
> it is always same, but not incrementing (meaning it might be added in the
> middle)? What is the impact compare to sequential insertion? I think
> something like my question to Brian.
> Nope, I am not looking at int and bigint comparison. I agree that disk
cost
> is cheap, but I have reservation that since bigint is larger, it may
result
> in less rows being packed in a single page, thus more pages, page split :P
> especially in the case where each row size is very small (the extra four
> bytes would be insignicant if the row size is large, right?) I have a
> question here on the int and bigint cpu performance. Is it because our
> current computer is 32bit, so it performs better with four bytes int?
Then,
> next time when most of the computing systems move to 64bit machine, would
> bigint be a better choice?
> thanks, thanks a great lot
> Eugene|||A hotspot at the end of a table does exist, but the impact of this is
generally minimal, especially if your RAID controllers can cache writes as
well as reads. (RAID controllers with a battery can implement a write-back
cache as opposed to a write-through cache, which means that the data to be
written remain in RAM and are only periodically flushed out to disk.) I'm
not really convinced that a "hotspot" affects overall performance anyway,
because having writes spread throughout a table can require the head to move
more often (disk ss), which has a much more detrimental impact on
performance.
"Brian Selzer" <brian@.selzer-software.com> wrote in message
news:O4##Ya1mFHA.2444@.tk2msftngp13.phx.gbl...
> Eugene,
> It doesn't prevent page splits, it minimizes them. In a clustered index,
> the rows of the table live in the leaf pages, which are all at the same
> level in the tree. Rows are added to the leaf pages at the end, in order,
> so insertions don't cause page splits for leaf pages. With an integer
key,
> each index page can store 539 index rows, so splits of index pages are
also
> minimized. If only inserts occur, then I believe that page splits do not
> occur at all, because the tree height is the same and optimal regardless
of
> whether the rightmost node at each level has 1 key (except the root which
> has 2 keys, of course) and all nodes to the left are full, or if each node
> at each level has the same number of keys. Index pages are therefore
> inserted at the end of each level and a new root page is created when the
> index becomes full. I'm pretty sure that's what happens, but it's
possible
> that the rightmost node at each level splits when inserting into a full
> index. Even so, the number of splits is still minimized, because inserts
do
> not occur in the middle of the structure.
>
> "Eugene" <Eugene@.discussions.microsoft.com> wrote in message
> news:1B76D478-166D-4A60-AA49-0A38988AD97E@.microsoft.com...
> page
> might
> in
bout
> if
the
> cost
> result
:P
> Then,
would
>|||On Sat, 6 Aug 2005 11:59:07 -0700, Eugene
<Eugene@.discussions.microsoft.com> wrote:
>Hi, what opinion do you people have with clustered index on IDENTITY column
?
>Is there more advantage or divantage?
It's very fashionable, whatever the theoretical or practical
arguments.

>Wouldn't it causes all insertion to be added to the same page. Is this 'hot
>page' a big issue? Can it helps to prevent fragmentation? How is it so?
Other posters have addressed this.

>Secondly, is adding IDENTITY column itself recommended? Is it because by
>plainly using the logical unique (eg. Passport Number that is assumed to be
>unique) would be slower, maybe because it is not integer; or maybe composit
e
>primary key? Is a IDENTITY primary key useful because of its use as a forei
gn
>in related table?
Just a note that if you have a typically modest application of a few
dozen users on a typically modest database of say under 1gb running on
a typically powerful server of 2 processors, 3ghz, 4gb RAM, RAID5,
mirrored logs, ... then you're about 1000% overpowered and few of
these concerns will ever become visible, assuming your data model and
application are put together at all competently.
Heck, half the apps I see anymore turn out to me missing PKs or other
major indices, accidentally dropped (or duplicated!) during
maintenance over the months or years, and nobody notices for months or
years except for a few grumbles, and then the first thing they usually
do is upgrade the hardware, not audit the system!
J.|||I just googled "a little Dr. Codd" and found nothing... same with BOL.
har har har
"--CELKO--" <jcelko212@.earthlink.net> wrote in message
news:1123458943.236091.255690@.g43g2000cwa.googlegroups.com...
> plainly using the logical unique (eg. Passport Number that is assumed
> to be
> unique) would be slower, maybe because it is not integer; or maybe
> composite
> primary key? <<
> Do not use IDENTITY or other proprietary exposed physical locators in
> an RDBMS. Read a little Dr. Codd for the details of what surrogate is.
> Also look up validation and verification as properties for all data
> elements. Then look up the design of navigational databases which
> newbies mimic with IDENTITY. In fact, making IDENTITY the key would
> allow duplicate passport numbers to go undetected
> 2) Most of the time you do not wish to waste your one clustered index
> on a unique column. The Sybase/SQL Server implementation is based on a
> file system. It uses physically contigous storage and tree indexes.
> This means that you get one and only one clustered index per table. If
> you do a lot of GROUP BY's on one set of columns (totals by city, state
> or whatever), then a table scan on that sorted ordering will be much
> faster than random access. This is where you gain performance.
> 3) The "hot page" is not as big an issue as it has been in earlier
> releases. Data quality and integrity is a much, much bigger issue. If
> you have an (n) column nautral key in your data model, you **must**
> enforce it and IDENTITY will not change this fact of life.
> We live in an age of 64-bit hardware, parallel processors and all that
> jazz. Tuning an RDBMS at the byte-level is a waste of time and
> resources. The right answer is to remove redundancy instead. Do the
> math -- how much time do you need to read a byte off of a hard drive?
> What is the speed of main storage? So even if I need 100 times more
> processor time to do a join, I am ahead of ther game.
>

No comments:

Post a Comment