SQL server choosing wrong index to run this query
Instead of TransactionsTMernbrIDTDate, it is using TransTDateTmernbr
The result is a query that takes almost a minute instead of a fraction of a second to complete.
Index Stats are completely up to date.
When running same query on Dev server which has a backup a few weeks old, then the right index is picked by sql server
So, here is what I see and why the optimizer picked the plan you get. After that, I'll show how you can probably fix this.
The statistics for column t_mernbr are included as part of the statistics on index TransactionsTMernbrIDTDate, because t_mernbr is the leading column of that index. For the value in your query (7000043445), no specific entry in the statistics histogram exists. The value falls between two steps, for 7000038795 and 7000053075. The values SQL Server will use are the range values of the latter step. These show that at the last statistics update, there were 516,874 rows between these two values, for 39 distinct values. SQL Server assumes a uniform distribution within each interval, so each of those 39 values is assumed to have exactly 13,205 rows. SQL Server also always assumes that you would only query for data that exists, so it assumes that the value 7000043445 exists in the table – and as mentioned, that there are 13,205 rows with this value.
Based on that expectation, it has to choose between two strategies. The one you see on your test system is to seek index TransactionsTMernbrIDTDate to quickly find the first row with t_mernbr equal to 7000043445, then read all 13,205 of them, sort them by date, and then present you the one row with the lowest date value. This sounds cheap based on the seek, but it still has to read 13,205 rows, AND it has to sort them (which costs a lot of cpu and memory). The alternative option is to use an ordered scan of index TransTDateTmernbr. This index is sorted by date, so in this case it can simply start at the beginning until it hits the first row that has the correct t_mernbr value, return it, and stop immediately. No need to read the rest or to sort anything. But at the price of having to read rows for "random" t_mernbr values until it finds the first match.
At this point the optimizer does some simple math. The expected number of matching rows is 13,205. The total number of rows in the table is 96,957,900. That means that 0.136% is a match. Again assuming that all data has a uniform distribution, this means that the expectation is to read 7,342.5 rows before it finds the first match.
At this point the comparison between the two plans becomes easy. The first plan has to read 13,205 rows from an index that includes a thousand columns (and is therefor quite wide – so 13,205 rows will be a lot of pages) and sort them, before the query can return data. The second plan has to read only 7,342.5 rows from an index that is a lot narrower (one indexed column, one included column, and the clustered index reference that every index needs), and does not need a sort. Easy choice!
Now when you run it, the choice backfires. The root case is a bad assumption by SQL Server. It assumed that the value you are looking for exists, but in reality it doesn't. The plan it chose, to start at the earliest date and stop as soon as it finds a match, ends up reading all 96,957,900 rows before being able to return nothing. The plan you have on Dev uses the index on t_mernbr to find all matching rows, finds nothing, returns nothing to the Sort, that then also finishes extremely quickly and returns nothing.
So now you understand why the "bad" index was chosen. You may also wonder why the same does not happen on Dev. The answer is: even if both the data and the indexes are the same (which you should definitely verify because it's not guaranteed), the statistics most likely are not. And if the statistics feed different data to the optimizer, then the decision process remains the same but the numbers will be different. And in this case the result was that the other plan was estimated to be cheaper.
Of course, what you really want to know is how to fix this. In many cases this is a hard problem, because we have no control over which values will be included in the statistics histogram and which values will be put in interpolation ranges. We also cannot tell SQL Server to stop assuming that the value we are looking for exists. But in your specific case, there luckily is (probably) an easy fix.
Index TransactionsTMernbrIDTDate is currently defined as an index on t_mernbr, with included column t_date. You can change this to be an index on t_mernbr, t_date (so both columns are indexed, with t_mernbr as the leading column). For this specific query, the optimizer now knows that all rows with the same t_mernbr are arranged within the index by date, so it can use the index to find the first matching row and immediately return it. That strategy beats any other strategy so regardless of statistics and estimates should always be chosen. I expect that if you change the index as I recommend, it will always be used for this query and results will always be instantaneous, regardless of the value passed in.
Index changes do always have consequences, so let's see what the price for this change will be.
- The size of the index will grow. This increase in space does not only affect storage size, but also the buffer pool, and the execution time of index scans. However, we are not adding a column but promoting it from included to indexed, which means that the growth will be very small, probably less than 0.1%. And that means that the side effects related to the growth are also minimal.
- Other queries currently using the index can be affected. In this case, the effect should not be detrimental. Queries previously using the index can still use it the same way; and some might even benefit from the additional ordering available. There can always be surprises with the optimizer but I don't expect regresions in other queries.
- If the t_date column is frequently updated, then the change in the index structure can cause additional page splits to occur. If t_date is never or only very infrequently updated, then this is no concern.
- There is no ALTER INDEX statement to promote an included column to an indexed column. You will need to DROP / CREATE (or CREATE with the DROP_EXISTING option) the index. That can cause congestion on your system, and the index will not be available during the process which means that queries normally using the index will be slow. Best to do this during a maintenance window.
- After making index changes, all existing execution plans that use or might use the index will be invalidated in the plan cache and need to be recompiled when executed again. This will cause extra load and may affect performance. And if change the index on a live system, it will even happen twice (when the old index becomes unavailable, and after the new index is ready). If possible, plan to do this before a period of relatively low use, so that SQL Server can rebuild plans during a lull and is ready when the next peek starts.
I have done as you suggested and your prediction is exactly what happened. After dropping and recreating the index with the "t_date" column upgraded from included to indexed, the optimizer now chooses the right index and the query is done in milliseconds instead of minutes.
If I may intrude on your generosity, and in order for me to be able to diagnose this in the future, where exactly can you see this:
The value falls between two steps, for
7000038795 and 7000053075. The values
SQL Server will use are the range
values of the latter step. These show
that at the last statistics update,
there were 516,874 rows between these
two values, for 39 distinct values.
SQL Server assumes a uniform
distribution within each interval, so
each of those 39 values is assumed to
have exactly 13,205 rows.
I can't find this information myself on Plan Explorer, although it is exactly correct.