How can I get an Index Seek on this PK CLUSTERED index?

I feel like the Clustered Index Scan on [PK_Cash In Reconciliation Fact Inventory] should be a seek and only have to read 11 or so million rows rather than the 2 billion rows that it does with the scan.

Can anyone explain why it is performing a full index scan or help me get a seek out of this plan?

The query ran for 2 days and was only 28% complete. We need this to run in less than 24 hours.

avatar image By SQLHammer 83 asked Nov 07, 2016 at 04:14 PM
more ▼
(comments are locked)
10|10000 characters needed characters left

2 answers: sort voted first

If a seek is possible, you can force it with a FORCESEEK hint.

If the optimizer chose a hash join with bitmap over a nested loops join, it did so for cost reasons. Costing estimates can be wrong, especially for parallel nested loops joins plans that tend to execute faster in practice than the model assumes, due to improvements in hardware since the model was implemented, and the way costs are adjusted for nested loops (i.e. not at all on the inner side).

Nested loops can also be significantly worse than hash join. It depends on the number of CPUs, data distribution, and the number of pages that must be brought in from persistent storage (read-ahead is significantly less effective on the inner side of a nested loops join than for a sequential scan).

Note the scan has an INROW predicate. Rows that cannot possibly join (based on the bitmap constructed on the hash join's build input) are eliminated during the scan. if the bitmap is effective (no way to know since you supplied an estimated plan), the number of rows flowing from the scan across the exchange and into the probe side of the hash join will be significantly less than estimates show (since it is a Bitmap, not an Opt_Bitmap).

That said, the Sort is also likely to be a large runtime cost, especially if insufficient memory is reserved or available. It seems like the source and target Fact tables have different partitioning schemes. The Sort looks to reduce the number of expensive partition switches at the target, by sorting rows in partition ID order.

There might be opportunities to improve on the plan, but this requires deeper knowledge of the schema and the task you are performing. If you are interested in exploring that, please add the table and index DDL for the three tables to your question (as an attachment if it is large) and include the partitioning function and scheme definition(s).

It would also be helpful if you would describe the overall process a bit, and explain in more detail why the WHERE clause looks as it does, and why there is a forced index currently.

avatar image By SQLkiwi ♦ 6.6k answered Nov 07, 2016 at 06:18 PM
more ▼
(comments are locked)
avatar image SQLHammer Nov 07, 2016 at 06:43 PM

Thank you for your analysis up to this point. We are likely going to move forward with a FORCESEEK for our immediate needs but I am going to continue my investigation for academic/educational purposes.

I have attached another estimated plan which came from a different environment which supposedly has an exact copy of the database. I am in the middle of trying to find the differences between the two environments to see why one used a seek and the other did not.

Regarding the index hint, I am reaching out to the developer to find out why that was necessary.

T1.sqlplan

t1.sqlplan (75.5 kB)
avatar image SQLkiwi ♦ Nov 07, 2016 at 07:33 PM

@SQLHammer The usual reasons for a different cost (and therefore plan) are differences in statistics, number of logical processors available to SQL Server, and/or the amount of memory SQL Server is configured to use.

avatar image SQLHammer Nov 07, 2016 at 08:23 PM

Stats appear to be the same but core count and memory are vastly different.

10|10000 characters needed characters left

I have noticed few interesting things in the query, assuming this is same as what runs in production

  • Where clause has multiple conditions on same column f.[Time Key]

  • Between is useless because StartRangeKey and EndRangeKey parameters are the same value, it would work just fine rewriting with equal operator

  • There is total of 19008 iterations of the while loop to complete entire operation, with such large estimate of rows per each iteration I could see why it is taking so long.

I would suggest trying to replace original where clause

   WHERE   f.[Time Key] BETWEEN @startRangeKey AND @endRangeKey
                     AND f.[Time Key] >= 2014100100
                     AND f.[Transaction Date Key] >= 2014100100

with this

   WHERE   f.[Time Key] = @startRangeKey 



avatar image By Vlady Oselsky 156 answered Nov 07, 2016 at 08:08 PM
more ▼
(comments are locked)
avatar image SQLHammer Nov 07, 2016 at 08:24 PM

Thank you for looking into it. I agree that the duplicate Time Key predicates are not required. I expect them to be reduced during plan simplification, anyways. We could change the predicate to

WHERE f.[Time Key] BETWEEN @startRangeKey AND @endRangeKey

but it won't impact the performance at all.

avatar image Vlady Oselsky Nov 07, 2016 at 08:25 PM

What about the fact that Start and End are the same value?

10|10000 characters needed characters left
Your answer
toggle preview:

Up to 50 attachments (including images) can be used with a maximum of 209.7 MB each and 209.7 MB total.