Right most node has the wrong estimate

Dan Holmes 2013-05-08 13:58:17

The predicate driving that index seek is:
WHERE at.TripDate >= @TimeRangeStart — @TimeRangeStart = '20130506'

However it is always giving an estimated row count of 1 when the actual value is 88. This is the relevant histogram.

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS

2013-05-03 00:00:00.000 261 88 3 87

2013-05-06 00:00:00.000 0 87 0 1

2013-05-07 00:00:00.000 0 88 0 1

Based on the predicate and the histogram i would have expected an estimate of 175. There is a non-seek predicate but why would that reduce the number to 1? I have tried a filtered index too and that wasn't used even when i dropped this one.

SQLkiwi 2013-05-08 15:51:31
###Filtered Index

A filtered index would not be used if the plan is cached for reuse (with different parameter values). This is logical, because the optimizer cannot guarantee the filtered index would be valid for different parameter values. Adding OPTION (RECOMPILE) to the statement would enable use of the filtered index, though you will only see it used in an actual plan, not an estimated one.

Cardinality Estimate

The seek in question may well be estimated at around 175 rows, but the residual predicates will reduce that value, typically by multiplying the expected selectivities together. The cardinality estimate for the following query should be around 175 rows:

SELECT COUNT_BIG(*)
FROM [dbo].[tblActualFRTrips]
WHERE 1 <> 2
AND TripDate >= '20130506';

The 1 <> 2 is there just to prevent simple parameterization (we want estimates for the specific value '20130506').

Add one or both of the residual predicates to the query to see how they affect the estimate.

[ScheduledStart] < dateadd(day,1,'20130507')
[ScheduledEnd] >= '20130507'

You can also run the query for just one of those predicates to see the estimation for each one. You might find adding OPTION (RECOMPILE) to the query results in an acceptable estimate (because the optimizer sees constant-folded values instead of parameters). On the other hand, the estimate may still be too low because cardinality estimation often assumes predicates are independent (and they may well not be in this case).

For completeness, also try the query with trace flag 4199 on (this enables various plan-affecting optimizer and cardinality estimate bug fixes), and/or trace flag 2301. The latter flag is documented, and enables a more advanced treatment of estimates under some common scenarios.

Depending on the outcome of the experiments above, there may be other things we could try to get a better estimate – please let me know how you get on.

Summary from comments:

  • Ensure statistics for all columns are up-to-date (especially ascending keys)
  • Try adding a multi-column statistic to capture the correlations
  • Ultimately, it may not be possible to get a good estimate, so consider materializing the result of the problematic query in a temporary table and start the query from there.
Dan Holmes 2013-05-08 16:05:29
SELECT COUNT_BIG(*)
FROM [dbo].[tblActualFRTrips]
WHERE 1 <> 2
AND TripDate >= '20130506';
Estimate is 175

Adding [ScheduledStart] < dateadd(day,1,'20130507') Estimate is 175 Adding [ScheduledEnd] >= '20130507'
Estimate is 1

Adding [ScheduledEnd] >= '20130507' without [ScheduledStart]
Estimate is 1

OPTION(RECOMIPLE) didn't change the estimate

Adding OPTION(QUERYTRACEON 4199, QUERYTRACEON 2301) didn't change anything unless i need to be looking somewhere other than this node in the plan.

SQLkiwi 2013-05-08 16:09:38
Thanks Dan. Could you also try OPTION(RECOMPILE, QUERYTRACEON 4199, QUERYTRACEON 2301) please just to close that avenue off? And does the histogram for ScheduledEnd have information >= 20130507 (i.e. do we have an ascending key problem)?
Dan Holmes 2013-05-08 16:22:12
OPTION(RECOMPILE, QUERYTRACEON 4199, QUERYTRACEON 2301) didn't change anything.

The histogram for ScheduledEnd
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS

2013-03-26 09:59:50.000 253 3 127 1.992126

2013-03-29 08:59:50.000 253 3 127 1.992126

2013-03-29 17:59:00.000 68 1 34 2

2013-03-29 18:00:00.000 0 3 0 1

There is not data for 20130507, in fact all of april and may is missing.I should have noticed that.

Dan Holmes 2013-05-08 16:25:39
I updated that stat with FULLSCAN and went back to this
SELECT COUNT_BIG(*)
FROM [dbo].[tblActualFRTrips]
WHERE 1 <> 2
AND TripDate >= '20130506'
AND [ScheduledEnd] >= '20130507'

Still the estimate is 1 row. The new histogram
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS

2013-05-06 09:59:00.000 516 1 259 1.992278

2013-05-07 17:59:00.000 150 1 75 2

2013-05-07 18:00:00.000 0 3 0 1

SQLkiwi 2013-05-08 16:31:08
Ok well updating statistic that may help (though prepare yourself for possible disappointment). You could also look at creating a multi-column stat on the three columns, but the optimizer may well choose to continue using the single column stats (complex reasons). Ultimately, the easiest way might be to materialize the result of that seek operator to its own temp table and start the query from there.
Dan Holmes 2013-05-08 18:07:50
I created a stat on TripDate, ScheduledEnd, ScheduledStart and another on Tripdate, ScheduledStart. Neither of those affected the estimated row count.

The performance of this query isn't an issue but the rate at which the estimates differ from actuals is bothersome. Given how seeminly easy this situation was i thought there would be a simple solution. Guess not.