Sunday, February 12, 2012

clustered index on a identity field

Hi All,
I have heard that if you create a clustered index on a identity field and
its a PK, it reduces the page splits...
But how?
Thanks,
Pradif you mean that the number of pages required to define a table will
grow more slowly, that is correct. but there are many reasons not to do
this. in my experience, clustering any primary key is a pretty bad
idea, especially an identity column. the reasons are twofold:
1) clustering ensures that all like values will be ordered together. if
you apply this to a value that is gauranteed to be unique, all you are
accomplishing is gauranteeing the physical order of the rows on disk,
which would serve no real purpose (i.e. none of your apps should care
what order the rows are on disk).
in my opinion, you should reserve a clustered index for a column that
has non-unique values, and is often used as a search criteria by value.
that way, all of the rows with the same value are physically grouped
together, greatly improving the performance of the disk read to collect
all the data.
2) creating a clustered index on an identity column is very likely to
create insert hotspots, because it is gauranteed that almost every new
row will be written to the same datapage. if you have a large number of
inserts occurring, this can create lock issues, if the current
insertion page is locked for any reason from insertion. it's far
healthier if your database can insert rows into a nicely dispersed set
of data pages, to prevent issues of lock escalation impeding inserts.
anyway, those are just my two cents. i'm sure other people have other
opinions.|||See if this helps:
Tips on Optimizing SQL Server Clustered Indexes
http://www.sql-server-performance.c...red_indexes.asp
AMB
"Pradeep Kutty" wrote:

> Hi All,
> I have heard that if you create a clustered index on a identity field and
> its a PK, it reduces the page splits...
> But how?
> Thanks,
> Prad
>
>|||Can you fix your clock?
"Pradeep Kutty" <pradeepk@.healthasyst.com> wrote in message
news:eCh33%23HtFHA.616@.TK2MSFTNGP11.phx.gbl...
> Hi All,
> I have heard that if you create a clustered index on a identity field and
> its a PK, it reduces the page splits...
> But how?
> Thanks,
> Prad
>|||I disagree.
The physical location on disk of index rows in a clustered index is
undefined. There is no guarantee that the index pages of a clustered index
reside in any specified order on the physical hardware. What is defined is
that the rows within a specific leaf page are in order, and each clustered
index leaf page contains next and previous pointers to locate the next and
previous leaf pages. One of the main benefits of using a clustered index is
that once the first row in an ordered scan is found, SQL Server doesn't need
to walk the b-tree to find the next row. It can use the next pointer to
continue the ordered scan. This can improve query performance considerably.
You should avoid using a non-unique key in a clustered index. If a table
has a clustered index, then every nonclustered index uses the clustered
index key to locate the row in the table. If the key of a clustered index
is not unique, then SQL Server adds a hidden 4-byte uniquifier to the key to
make it unique. The entire key along with the uniqifier is then stored in
every nonclustered index row.
In addition, there is a significant degradation of join performance. If you
put a nonclustered index on the primary key, then every join with related
tables requires an additional lookup to satisfy the query. For each row,
the index for the primary key constraint is consulted yielding the clustered
index key. That key is then used to lookup the row in the clustered index.
Therefore, if you have related tables, then the clustered index should live
on the primary key.
Your assertion about insert hotspots fails to mention the additional index
maintenance (page splits and updates) that is required if your clustered
index isn't on the IDENTITY column. If the clustered index is on an
IDENTITY column, then there will never be any page splits, because inserts
become appends, both in the b-tree index nodes and in the leaf nodes. You
can also use a 100% fillfactor, because there will never be any inserts,
only appends. The tree grows up: at each level of the b-tree, a new page is
appended if it is needed, and if a new page is needed at the root level,
then a new root page is also added. That's why inserts into a table with
only a clustered index on the IDENTITY column perform almost as well as
inserts into a table without any indexes at all. Deletes and updates in
this case require minimal index maintenance, because SQL Server doesn't join
index pages, it only splits them when an insert exceeds the fillfactor
threshold. Deletes just leave holes, so if you expect a lot of deletes,
then you should schedule a periodic DBCC INDEXDEFRAG or DBCC DBREINDEX.
SQL Server uses lazy spooling, which means that changes to the database are
cached and then periodically flushed to disk (checkpointed). (SQL Server
only ensures that writes to the transaction log are flushed to disk before
returning from a commit.) This means that more often than not, the last
page is still resident in memory if there are a large number of inserts,
thereby turning the insert hotspot into a performance improvement.
Finally, unless otherwise forced, the exclusive locks applied by an insert
should not escalate unless the page is full and all of the rows in the page
were just inserted by the current transaction. Therefore, there is little
if any lock contention due to inserts.
"jason" <iaesun@.yahoo.com> wrote in message
news:1126190553.831020.325290@.f14g2000cwb.googlegroups.com...
> if you mean that the number of pages required to define a table will
> grow more slowly, that is correct. but there are many reasons not to do
> this. in my experience, clustering any primary key is a pretty bad
> idea, especially an identity column. the reasons are twofold:
> 1) clustering ensures that all like values will be ordered together. if
> you apply this to a value that is gauranteed to be unique, all you are
> accomplishing is gauranteeing the physical order of the rows on disk,
> which would serve no real purpose (i.e. none of your apps should care
> what order the rows are on disk).
> in my opinion, you should reserve a clustered index for a column that
> has non-unique values, and is often used as a search criteria by value.
> that way, all of the rows with the same value are physically grouped
> together, greatly improving the performance of the disk read to collect
> all the data.
> 2) creating a clustered index on an identity column is very likely to
> create insert hotspots, because it is gauranteed that almost every new
> row will be written to the same datapage. if you have a large number of
> inserts occurring, this can create lock issues, if the current
> insertion page is locked for any reason from insertion. it's far
> healthier if your database can insert rows into a nicely dispersed set
> of data pages, to prevent issues of lock escalation impeding inserts.
> anyway, those are just my two cents. i'm sure other people have other
> opinions.
>|||jason wrote:
> if you mean that the number of pages required to define a table will
> grow more slowly, that is correct. but there are many reasons not to
> do this. in my experience, clustering any primary key is a pretty bad
> idea, especially an identity column. the reasons are twofold:
> 1) clustering ensures that all like values will be ordered together.
> if you apply this to a value that is gauranteed to be unique, all you
> are accomplishing is gauranteeing the physical order of the rows on
> disk, which would serve no real purpose (i.e. none of your apps
> should care what order the rows are on disk).
> in my opinion, you should reserve a clustered index for a column that
> has non-unique values, and is often used as a search criteria by
> value. that way, all of the rows with the same value are physically
> grouped together, greatly improving the performance of the disk read
> to collect all the data.
> 2) creating a clustered index on an identity column is very likely to
> create insert hotspots, because it is gauranteed that almost every new
> row will be written to the same datapage. if you have a large number
> of inserts occurring, this can create lock issues, if the current
> insertion page is locked for any reason from insertion. it's far
> healthier if your database can insert rows into a nicely dispersed set
> of data pages, to prevent issues of lock escalation impeding inserts.
> anyway, those are just my two cents. i'm sure other people have other
> opinions.
I would like to comment on some of what you wrote. Clustering on a
non-unique columns has a few implications.
- SQL Server cannot handle non-unique clustered keys internally and will
automatically add a UNIQUEIDENTIFIER to the rows when necessary. This
increases the key length by 4-bytes for each non-unique value. Using an
IDENTITY value, which is by definition unique, does not have this added
overhead.
- The clustered index key is the pointer into the actual data.
Therefore, the clustered index key is a part of each non-clustered index
key. This can make the non-clustered indexes grow considerably. Given
that using a non-unique set of columns is likely to include at least one
non-integer based column, you are looking at large keys all around. This
will slow down inserting and updating because of the need to hit each
non-clustered index. Clustered IDENTITY columns add the least amount of
overhead to non-clustered indexes.
- A non-unique clustered index key is likely to be more susceptible to
updating. Updating a clustered index key means physically moving the row
to a new location and this will likely cause page splitting, which is a
slow process that causes internal fragmentation. This will slow down
inserting and updating because of the need to locate/relocate the row
and the need to update each non-clustered index key. IDENTITY columns
cannot be updated and are immune to this problem.
- Insert hotspots aren't really a problem any longer like they were in
the old day of page locking. SQL Server uses row locking for inserts.
Keeping the disk heads from moving around can actually increase
performance because of the slow nature of the hard drive. Using an
IDENTITY keeps the disk heads in the same location and prevents page
splitting as new rows are added.
- A non-unique clustered index may require additional maintenance to
avoid page splitting. That is, you may have to implement index rebuilds
using a FILLFACTOR that leaves a certain amount of space available for
inserts on each page. The maintenance is ongoing and is best done during
off-hours; assuming your business has off-hours. This increases the
table size and number of pages and will hurt overall performance.
- Page splitting causes external fragmentation. That is, even if you use
a clustered index on a set of non-unique columns, there's no reason to
assume that pages that contain the same key will be contiguous. So a
read of two pages of a specific clustered index key can cause the disk
heads to move from one end of the table to another. This can be avoided
with some index maintenance. Clustered IDENTITY indexes do not have this
problem.
Having said that, there are reasons to use non-unique clustered indexes.
As you stated, they can, in fact, help the fetching of a number of rows
of a specific key. If your database is used primarily for this type of
operation, a non-unique clustered index may be the way to go. I've
certainly used them this way on many occasions.
David Gugick
Quest Software
www.imceda.com
www.quest.com|||interesting, you clearly have some well informed opinions. they just
happen to disagree with some of the sources i've used to come to my
understanding :) these things happen. i admit, my understanding of
clustered indexing, and indexing in general, is incomplete, but here
are some points that i thought i'd share, to see if perhaps there's an
absolute answer:
* from every text at my disposal, the primary advantage to a clustered
index is the fact that the data is guaranteed to be in the same order
as the index.
* the leaf nodes of a clustered index don't actually point to the next
and previous rows, because they don't have to. the next and previous
values are implied by the next and previous positions on disk.
* nonclustered indexes do point to next and previous data rows, because
the index order may not match the physical order, so a pointer is
required
* the greatest advantages to using clustered indexes, therefore, are
cases where the physical order of the data can increase your
performance. chief among these are (1) columns commonly used in group
by and order by clauses (2) columns on the many side of a one-to-many
relationship. because there is a tremendous increase to the performance
of the reads for these operations (3) columns whose values are
frequently evaluated between a range
* by the same reasoning, clustered indexes are wasted on unique value
columns, as they serve no tangible purpose (except as you mention,
perhaps ameliorating problems using identity columns without a
clustered index)
* using a clustered index on a unique value is wasted. the only reason
is the default for new tables created in enterprise manager is because
it grants advantages to some queries above a table that has zero
clustered indexes (better one than none)
here are my sources for this information:
microsoft sql server 2000 bible, paul nielsen, pp768-771
sql server 2000 for experienced dba's, brian knight, pp252-255
http://msdn.microsoft.com/library/d...ql7perftune.asp
http://msdn.microsoft.com/library/d...br />
apr3.asp
these sources cite specific examples of where and why you would want to
use clustered indexes, and they are all on data that is grouped by,
sorted by, or evaluated in a range. especially date fields (which are
rarely unique, yes?)
one caveat that is mentioned is that you shouldn't apply a clustered
index to a column that is updated often, for the reason you stated,
that the clustered index is appended to every nonclustered index in the
table. but you cite this as a performance hindrance? my undrestanding
is that the presence of the clustered index is actually increasing
performance (if at a small cost of space) by skipping the pointer to
data page step of a lookup.
what are your sources, that we might compare / contrast?
jason|||"jason" <iaesun@.yahoo.com> wrote in message
news:1126206276.049357.153630@.g43g2000cwa.googlegroups.com...
> interesting, you clearly have some well informed opinions. they just
> happen to disagree with some of the sources i've used to come to my
> understanding :) these things happen. i admit, my understanding of
> clustered indexing, and indexing in general, is incomplete, but here
> are some points that i thought i'd share, to see if perhaps there's an
> absolute answer:
> * from every text at my disposal, the primary advantage to a clustered
> index is the fact that the data is guaranteed to be in the same order
> as the index.
>
And the clustered index requires one fewer IO per s since the leaf is the
data page; plus s performance of all non-clustered indexes requires 1-4
extra IO's.

> * the leaf nodes of a clustered index don't actually point to the next
> and previous rows, because they don't have to. the next and previous
> values are implied by the next and previous positions on disk.
>
No. The leaf nodes are data pages. All data pages in a table are arranged
in a doubly linked list: each page has a next page and previous page
pointer. A table scan traverses the table by following the next/previous
pointers from page to page. For a table with a clustered index the order of
rows on a page and the order of pages determined by the next/previous
pointers follow the order of the index. So after you traverse the index
down to the leaf level once you can then follow the next/previous pointers
to follow the index order. So you don't have to go back to the index pages
to satisfy a range query: just s to the first value and the scan to the
end.
A common misconception is that data pages in a clustered index are
guaranteed to be physically ordered. They are not. Only the logical order
of the doubly-linked list of data pages is guaranteed. Difference between
the logical ordering of the pages and the physical order can arise over time
due to index fragmentation.

> * nonclustered indexes do point to next and previous data rows, because
> the index order may not match the physical order, so a pointer is
> required
> * the greatest advantages to using clustered indexes, therefore, are
> cases where the physical order of the data can increase your
> performance. chief among these are (1) columns commonly used in group
> by and order by clauses (2) columns on the many side of a one-to-many
> relationship. because there is a tremendous increase to the performance
> of the reads for these operations (3) columns whose values are
> frequently evaluated between a range
> * by the same reasoning, clustered indexes are wasted on unique value
> columns, as they serve no tangible purpose (except as you mention,
> perhaps ameliorating problems using identity columns without a
> clustered index)
A row in a table with a clustered index can only be accessed through the
clustered index. There's simply no other way to find the row. If you have
a non-clustered unique index, and a non-unique clustered index then
accessing the table through the unique index will be more expensive since
you must traverse the unique index only to get, at the leaf level, a
non-unique clustered index key (plus 4-byte uniqifier), and then you must
traverse the non-unique index down to the leaf level (data page) on which
the row is located. There are situation where this is a net improvement in
performance, but for applications which do primarily single-row
insert/update/delete clustering the primary key optimizes the most important
transaction.

> * using a clustered index on a unique value is wasted. the only reason
> is the default for new tables created in enterprise manager is because
> it grants advantages to some queries above a table that has zero
> clustered indexes (better one than none)
This is not generally the case since choosing the clustered index always
optimizes some access paths and degrades others. You must analyze your
workload and measure the impace of clustering a non-unique tuple.
There are no valid sources for recieved wisdom or rules of thumb for this.
Only knoledge of how SQL Server is implemented and testing.
David|||Perhaps you should bin those books and open up BOL.
SQL Server Architecture/Database Architecture/Physical Database
Architecture/Table and Index Architecture
This section contains information about the physical structure of clustered
indexes, nonclustered indexes, and heaps (tables without clustered indexes).
Creating and Maintaining Databases/Indexes
This section contains information about when to use clustered indexes, why
you should keep the key small, etc.
"jason" <iaesun@.yahoo.com> wrote in message
news:1126206276.049357.153630@.g43g2000cwa.googlegroups.com...
> interesting, you clearly have some well informed opinions. they just
> happen to disagree with some of the sources i've used to come to my
> understanding :) these things happen. i admit, my understanding of
> clustered indexing, and indexing in general, is incomplete, but here
> are some points that i thought i'd share, to see if perhaps there's an
> absolute answer:
> * from every text at my disposal, the primary advantage to a clustered
> index is the fact that the data is guaranteed to be in the same order
> as the index.
> * the leaf nodes of a clustered index don't actually point to the next
> and previous rows, because they don't have to. the next and previous
> values are implied by the next and previous positions on disk.
> * nonclustered indexes do point to next and previous data rows, because
> the index order may not match the physical order, so a pointer is
> required
> * the greatest advantages to using clustered indexes, therefore, are
> cases where the physical order of the data can increase your
> performance. chief among these are (1) columns commonly used in group
> by and order by clauses (2) columns on the many side of a one-to-many
> relationship. because there is a tremendous increase to the performance
> of the reads for these operations (3) columns whose values are
> frequently evaluated between a range
> * by the same reasoning, clustered indexes are wasted on unique value
> columns, as they serve no tangible purpose (except as you mention,
> perhaps ameliorating problems using identity columns without a
> clustered index)
> * using a clustered index on a unique value is wasted. the only reason
> is the default for new tables created in enterprise manager is because
> it grants advantages to some queries above a table that has zero
> clustered indexes (better one than none)
> here are my sources for this information:
> microsoft sql server 2000 bible, paul nielsen, pp768-771
> sql server 2000 for experienced dba's, brian knight, pp252-255
>
http://msdn.microsoft.com/library/d...une.as
p
>
http://msdn.microsoft.com/library/d...l/sql7sapr3.asp[
color=darkred]
> these sources cite specific examples of where and why you would want to
> use clustered indexes, and they are all on data that is grouped by,
> sorted by, or evaluated in a range. especially date fields (which are
> rarely unique, yes?)
> one caveat that is mentioned is that you shouldn't apply a clustered
> index to a column that is updated often, for the reason you stated,
> that the clustered index is appended to every nonclustered index in the
> table. but you cite this as a performance hindrance? my undrestanding
> is that the presence of the clustered index is actually increasing
> performance (if at a small cost of space) by skipping the pointer to
> data page step of a lookup.
> what are your sources, that we might compare / contrast?
> jason
>[/color]|||> There are no valid sources for recieved wisdom or rules of thumb for this.
> Only knoledge of how SQL Server is implemented and testing.
I'm getting that impression, which is why discussions like these are
extremely useful. They can supplement experiences you haven't had yet,
and as you say, give a more complete picture of how SQL Server is
implemented. So thank you very much!
Jason

No comments:

Post a Comment