Wednesday, March 7, 2012

COALESCE to speed up queries?

Another general question. I was tweaking some other queries and I ran
across the following while trying to speed up an inner join. I re-wrote the
query as a sub-query using both IN and EXISTS, trying to force the query to
use the Clustered Primary Key on the second table (Case_Details). When I
run the following query:
SELECT cd.OffenderID,
cd.CaseID
FROM dbo.Case_Details cd
WHERE EXISTS (SELECT 'X'
FROM Offender_Details od
WHERE od.LName = 'Smith'
AND od.FName = 'James'
AND od.DOB_Year = 1960
AND od.OffenderID = cd.OffenderID)
I get this plan:
|--Parallelism(Gather Streams)
|--Hash Match(Right Semi Join,
HASH:([od].[OffenderID])=([cd].[OffenderID]),
RESIDUAL:([od].[OffenderID]=[cd].[OffenderID]))
|--Bitmap(HASH:([od].[OffenderID]), DEFINE:([Bitmap1003]))
| |--Parallelism(Repartition Streams, PARTITION
COLUMNS:([od].[OffenderID]))
| |--Clustered Index
Seek(OBJECT:([OffenderData].[dbo].[Offender_Details].[IX_Offender_Details]
AS [od]), SEEK:([od].[LName]='Smith' AND [od].[FName]='James' AND
[od].[DOB_Year]=1960) ORDERED FORWARD)
|--Parallelism(Repartition Streams, PARTITION
COLUMNS:([cd].[OffenderID]), WHERE:(PROBE([Bitmap1003])=TRUE))
|--Index
Scan(OBJECT:([OffenderData].[dbo].[Case_Details].[s2] AS [cd]))
It seems to be scanning the Case_Details table Non-clustered Index instead
of using the Primary Key which consists of (OffenderID, CaseID) which are a
BIGINT NOT NULL and INT NOT NULL, respectively. The Primary Key on
Case_Details is Clustered. The Offender_Details PK is non-clustered, and
consists of (OffenderID) BIGINT NOT NULL. The Offender_Details PK and
Case_Details PK are related via the OffenderID. Foreign Key constraints are
in place.
This query takes 26,470 ms to complete.
Now all this is to say that when I change the query to the following:
SELECT cd.OffenderID,
cd.CaseID
FROM dbo.Case_Details cd
WHERE EXISTS (SELECT 'X'
FROM Offender_Details od
WHERE od.LName = 'Smith'
AND od.FName = 'James'
AND od.DOB_Year = COALESCE(1960 , 1960)
AND od.OffenderID = cd.OffenderID)
I get this query plan:
|--Nested Loops(Inner Join, OUTER REFERENCES:([od].[OffenderID]) WITH
PREFETCH)
|--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1015], [Expr1016],
[Expr1017]))
| |--Compute Scalar(DEFINE:([Expr1015]=Convert(If 1 then 1960 else
1960)-1, [Expr1016]=Convert(If 1 then 1960 else 1960)+1, [Expr1017]=If
(Convert(If 1 then 1960 else 1960)-1=NULL) then 0 else 6|If (Convert(If 1
then 1960 else 1960)+1=NULL) then
| | |--Constant Scan
| |--Clustered Index
Seek(OBJECT:([OffenderData].[dbo].[Offender_Details].[IX_Offender_Details]
AS [od]), SEEK:([od].[LName]='Smith' AND [od].[FName]='James' AND
[od].[DOB_Year] > [Expr1015] AND [od].[DOB_Year] < [Expr1016]),
WHERE:(Convert([od]
|--Clustered Index
Seek(OBJECT:([OffenderData].[dbo].[Case_Details].[PK_Case_Details] AS [cd]),
SEEK:([cd].[OffenderID]=[od].[OffenderID]) ORDERED FORWARD)
The COALESCE() function appears to force Case_Details to (properly) use a
Clustered Index Seek instead of a Non-Clustered Index Scan.
This modified query runs in 656 ms.
Any ideas on why a COALESCE(x, x) forces the proper query plan in this
instance?
Thanks.It seems paralellism affects the outcome. Have you tried adding 'maxdop' to
your query and see if it improves it.
Btw, an index scan is not always bad. There are cases where a single scan is
much better than doing thousands of seeks.
--
-oj
"Michael C#" <xyz@.abcdef.com> wrote in message
news:Qud8e.2279$sG3.1410@.fe09.lga...
> Another general question. I was tweaking some other queries and I ran
> across the following while trying to speed up an inner join. I re-wrote
> the query as a sub-query using both IN and EXISTS, trying to force the
> query to use the Clustered Primary Key on the second table (Case_Details).
> When I run the following query:
> SELECT cd.OffenderID,
> cd.CaseID
> FROM dbo.Case_Details cd
> WHERE EXISTS (SELECT 'X'
> FROM Offender_Details od
> WHERE od.LName = 'Smith'
> AND od.FName = 'James'
> AND od.DOB_Year = 1960
> AND od.OffenderID = cd.OffenderID)
> I get this plan:
> |--Parallelism(Gather Streams)
> |--Hash Match(Right Semi Join,
> HASH:([od].[OffenderID])=([cd].[OffenderID]),
> RESIDUAL:([od].[OffenderID]=[cd].[OffenderID]))
> |--Bitmap(HASH:([od].[OffenderID]), DEFINE:([Bitmap1003]))
> | |--Parallelism(Repartition Streams, PARTITION
> COLUMNS:([od].[OffenderID]))
> | |--Clustered Index
> Seek(OBJECT:([OffenderData].[dbo].[Offender_Details].[IX_Offender_Details]
> AS [od]), SEEK:([od].[LName]='Smith' AND [od].[FName]='James' AND
> [od].[DOB_Year]=1960) ORDERED FORWARD)
> |--Parallelism(Repartition Streams, PARTITION
> COLUMNS:([cd].[OffenderID]), WHERE:(PROBE([Bitmap1003])=TRUE))
> |--Index
> Scan(OBJECT:([OffenderData].[dbo].[Case_Details].[s2] AS [cd]))
> It seems to be scanning the Case_Details table Non-clustered Index instead
> of using the Primary Key which consists of (OffenderID, CaseID) which are
> a BIGINT NOT NULL and INT NOT NULL, respectively. The Primary Key on
> Case_Details is Clustered. The Offender_Details PK is non-clustered, and
> consists of (OffenderID) BIGINT NOT NULL. The Offender_Details PK and
> Case_Details PK are related via the OffenderID. Foreign Key constraints
> are in place.
> This query takes 26,470 ms to complete.
> Now all this is to say that when I change the query to the following:
> SELECT cd.OffenderID,
> cd.CaseID
> FROM dbo.Case_Details cd
> WHERE EXISTS (SELECT 'X'
> FROM Offender_Details od
> WHERE od.LName = 'Smith'
> AND od.FName = 'James'
> AND od.DOB_Year = COALESCE(1960 , 1960)
> AND od.OffenderID = cd.OffenderID)
> I get this query plan:
> |--Nested Loops(Inner Join, OUTER REFERENCES:([od].[OffenderID]) WITH
> PREFETCH)
> |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1015],
> [Expr1016], [Expr1017]))
> | |--Compute Scalar(DEFINE:([Expr1015]=Convert(If 1 then 1960
> else 1960)-1, [Expr1016]=Convert(If 1 then 1960 else 1960)+1,
> [Expr1017]=If (Convert(If 1 then 1960 else 1960)-1=NULL) then 0 else 6|If
> (Convert(If 1 then 1960 else 1960)+1=NULL) then
> | | |--Constant Scan
> | |--Clustered Index
> Seek(OBJECT:([OffenderData].[dbo].[Offender_Details].[IX_Offender_Details]
> AS [od]), SEEK:([od].[LName]='Smith' AND [od].[FName]='James' AND
> [od].[DOB_Year] > [Expr1015] AND [od].[DOB_Year] < [Expr1016]),
> WHERE:(Convert([od]
> |--Clustered Index
> Seek(OBJECT:([OffenderData].[dbo].[Case_Details].[PK_Case_Details] AS
> [cd]), SEEK:([cd].[OffenderID]=[od].[OffenderID]) ORDERED FORWARD)
> The COALESCE() function appears to force Case_Details to (properly) use a
> Clustered Index Seek instead of a Non-Clustered Index Scan.
> This modified query runs in 656 ms.
> Any ideas on why a COALESCE(x, x) forces the proper query plan in this
> instance?
> Thanks.
>|||Thanks for the feedback. On the graphical query plan, the index scan
estimates 77 million rows; whereas the Index Seek estimates 226 rows. I
think the Index Seek might be marginally better in this case. The query
plan also shows that parallelism accounts for 4% and the Index Scan accounts
for 91% of the total plan. So I don't know that parallelism is a huge
problem in this case. Since the COALESCE version doesn't use Parallelism, I
don't know that maxdop will help.
Basically, by using COALESCE(x, x) in the example given, I've somehow
dropped the execution time from 26.5 seconds to about 650 ms. If anyone can
explain why this is so, I'd appreciate it. Thanks.
"oj" <nospam_ojngo@.home.com> wrote in message
news:e5NJDTuQFHA.2348@.tk2msftngp13.phx.gbl...
> It seems paralellism affects the outcome. Have you tried adding 'maxdop'
> to your query and see if it improves it.
> Btw, an index scan is not always bad. There are cases where a single scan
> is much better than doing thousands of seeks.
> --
> -oj|||From the information here so far I can only speculate - most probably adding
the COALESCE changed the cardinality estimates of the filter predicate and
that influenced the optimizer to take completely different path.
Parallelism discrepancy may be a byproduct. What are the overall costs of
the two respective query plans?
Lubor
"Michael C#" <xyz@.abcdef.com> wrote in message
news:rOj8e.2075$ZQ1.854@.fe11.lga...
> Thanks for the feedback. On the graphical query plan, the index scan
> estimates 77 million rows; whereas the Index Seek estimates 226 rows. I
> think the Index Seek might be marginally better in this case. The query
> plan also shows that parallelism accounts for 4% and the Index Scan
accounts
> for 91% of the total plan. So I don't know that parallelism is a huge
> problem in this case. Since the COALESCE version doesn't use Parallelism,
I
> don't know that maxdop will help.
> Basically, by using COALESCE(x, x) in the example given, I've somehow
> dropped the execution time from 26.5 seconds to about 650 ms. If anyone
can
> explain why this is so, I'd appreciate it. Thanks.
> "oj" <nospam_ojngo@.home.com> wrote in message
> news:e5NJDTuQFHA.2348@.tk2msftngp13.phx.gbl...
> > It seems paralellism affects the outcome. Have you tried adding 'maxdop'
> > to your query and see if it improves it.
> >
> > Btw, an index scan is not always bad. There are cases where a single
scan
> > is much better than doing thousands of seeks.
> >
> > --
> > -oj
>
>|||Surprisingly, the more COALESCEs I add, the lower the plan cost seems to go.
For the non-COALESCE query, the overall cost is 731.88400. For a single
COALESCE on LName field, it goes down to 265.74600. For A COALESCE on LName
and FName, it goes down to 29.86400.
Thanks.
"Lubor Kollar" <lubork@.online.microsft.com> wrote in message
news:O$s60p2QFHA.2788@.TK2MSFTNGP09.phx.gbl...
> From the information here so far I can only speculate - most probably
> adding
> the COALESCE changed the cardinality estimates of the filter predicate and
> that influenced the optimizer to take completely different path.
> Parallelism discrepancy may be a byproduct. What are the overall costs of
> the two respective query plans?
> Lubor|||Correction, it only works if I use COALESCE twice. If I add a third
COALESCE, the cost starts going back up. Strange.
Thanks.
"Michael C#" <xyz@.abcdef.com> wrote in message
news:FFw8e.1935$V02.1216@.fe08.lga...
> Surprisingly, the more COALESCEs I add, the lower the plan cost seems to
> go. For the non-COALESCE query, the overall cost is 731.88400. For a
> single COALESCE on LName field, it goes down to 265.74600. For A COALESCE
> on LName and FName, it goes down to 29.86400.
> Thanks.
> "Lubor Kollar" <lubork@.online.microsft.com> wrote in message
> news:O$s60p2QFHA.2788@.TK2MSFTNGP09.phx.gbl...
>> From the information here so far I can only speculate - most probably
>> adding
>> the COALESCE changed the cardinality estimates of the filter predicate
>> and
>> that influenced the optimizer to take completely different path.
>> Parallelism discrepancy may be a byproduct. What are the overall costs of
>> the two respective query plans?
>> Lubor
>|||The only explanation for what you see and with limited knowledge of your
schema I think that by adding the COALESCE the initial cost estimate of the
query is higher then without it. This initial cost determines how far will
the optimizer go searching all possible plans. Even if the initial plan
costs with and without the COALESCE may be very close to each other it may
be just enough to "discover" much cheaper plan simply by going a bit further
in the optimization when the COALESCE is used. SQL Server does not show you
the initial costs; the final cost of the query plan is usually lower than
the initial cost. It cannot be higher.|||That makes sense, but it's a little surprising that a WHERE clause
containing [column]=COALESCE('x', 'x') could convince the optimizer to go
further steps than it normally would with [column]='x'; especially since
it's pretty obvious that these are equivalent comparisons. Ah well, it
works and I'm happy :)
Thanks
"Lubor" <Lubork@.online.microsoft.com> wrote in message
news:%23uBfO2WRFHA.3140@.tk2msftngp13.phx.gbl...
> The only explanation for what you see and with limited knowledge of your
> schema I think that by adding the COALESCE the initial cost estimate of
> the
> query is higher then without it. This initial cost determines how far will
> the optimizer go searching all possible plans. Even if the initial plan
> costs with and without the COALESCE may be very close to each other it may
> be just enough to "discover" much cheaper plan simply by going a bit
> further
> in the optimization when the COALESCE is used. SQL Server does not show
> you
> the initial costs; the final cost of the query plan is usually lower than
> the initial cost. It cannot be higher.|||On Wed, 20 Apr 2005 15:10:43 -0400, Michael C# wrote:
>That makes sense, but it's a little surprising that a WHERE clause
>containing [column]=COALESCE('x', 'x') could convince the optimizer to go
>further steps than it normally would with [column]='x'; especially since
>it's pretty obvious that these are equivalent comparisons. Ah well, it
>works and I'm happy :)
Hi Michael,
Here's a thread in .programming that might explain this behaviour. The
start doesn't look similar to this one, but the explanation looks like
it's applicable here.
http://groups-beta.google.com/group/microsoft.public.sqlserver.programming/browse_frm/thread/d0144148fec39a9b/80ab60f02abb8ecd?hl=en#80ab60f02abb8ecd
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||Interesting. Thanks for the pointer. I wonder if auto-parameterization
affects the number of estimated rows/number of rows - that was the major
difference between my queries/execution plans.
Thanks.
"Hugo Kornelis" <hugo@.pe_NO_rFact.in_SPAM_fo> wrote in message
news:s1jd61p4gfp7so2dfkf1qb21mc0dct27u8@.4ax.com...
> On Wed, 20 Apr 2005 15:10:43 -0400, Michael C# wrote:
>>That makes sense, but it's a little surprising that a WHERE clause
>>containing [column]=COALESCE('x', 'x') could convince the optimizer to go
>>further steps than it normally would with [column]='x'; especially since
>>it's pretty obvious that these are equivalent comparisons. Ah well, it
>>works and I'm happy :)
> Hi Michael,
> Here's a thread in .programming that might explain this behaviour. The
> start doesn't look similar to this one, but the explanation looks like
> it's applicable here.
> http://groups-beta.google.com/group/microsoft.public.sqlserver.programming/browse_frm/thread/d0144148fec39a9b/80ab60f02abb8ecd?hl=en#80ab60f02abb8ecd
> Best, Hugo
> --
> (Remove _NO_ and _SPAM_ to get my e-mail address)

No comments:

Post a Comment