SQL 20014 cardinality estimator causing poor porformance and bad plan

KDR 2016-04-01 13:43:48

By adding OPTION (QUERYTRACEON 9481) to the query the results come back in less than 1 second. I'm trying to understand why the new CE is causing such bad performance.

KDR 2016-04-01 13:48:02
Attached is the plan when using the old CE with trace 9481 link text

SQLkiwi 2016-04-02 18:35:01
The anonymized query appears to reference some moderately complex views, and includes a number of features that pose problems for accurate cardinality estimation, whichever CE model is in use. As far as I can tell from the information provided, the fact that the 70 model CE chooses a 'better' plan is, to a very large extent, simply luck.

The two CE models make different assumptions, and use different calculations, so it is no great surprise that the two plans are different. The fact that one is 'wrong' is ways that ultimately benefit performance is the chance aspect. Regressions when switching to the new CE are expected, and using 9481 is one way to avoid that.

That said, in an effort to say something useful aside from generalities, the query is problematic because it is very hard to estimate cardinality when EXISTS is combined with a disjunction (OR). One unhappy plan shape that often results is shown in the 120 CE plan: a semi join with multiple inputs to a Concatenation on its inner side.

Some of the trauma that results in trying to assess this shape can be seen when viewing the plan in SSMS – some of the operator cost percentages exceed 100%, a bug that Plan Explorer does its best to correct for. A Merge Join (Concatenation) deep in the tree has a cost percentage of 26,162% (!) for example. The low chance of a row that results from this difficult analysis looks cheap to the optimizer, but the strategy may fail badly at runtime where many more rows are encountered than expected.

Another important effect is that the way cost estimates are combined below the semijoin means that the higher levels of the input to the concatenation have a much greater contribution to the expected cost. This leads to a plan optimized for finding a row (to satisfy the semijoin) much earlier than actually occurs. The expensive work hidden lower down on the inputs to the semijoin therefore appears cheap – and this all falls apart when it turns out to be expensive, and executed many times.

The 70 CE happens to assign a higher cost initially, meaning it spends more time looking at alternatives, eventually settling on an implementation where the semijoin is pushed lower in the tree, and implemented with a Hash Match Right Semi Join. Crucially, this choice precludes the use of a series of inputs to a Concatenation operator, because the inputs are no longer correlated. The resulting plan features fewer nested loops at the higher levels, so the risk of excessive iterations is much reduced.

In addition, the higher cost of the 70 CE plan, and the greater time spent on optimization, appears to allow the optimizer to push an important predicate lower in the plan, meaning fewer rows make it past the Segment/Sequence Project combination. This is a pattern often seen with a window function like ROW_NUMBER in a view.

The plan also shows a number of potential missing indexes and other warnings about cardinality estimation due to type conversions. The Eager Index Spool in the problematic plan is another sign that a potentially important index is missing.

Ultimately, the issues here are far too complex to explore in detail, but I hope some of the above is useful to you. If the 9481 hint is an acceptable workaround for you, that's fine. But be aware that the 'good' plan selection has that element of luck. If you want the best chance of getting good plans reliably from the optimizer as the data changes, you would need to look at improving the indexing, addressing the type conversions, and simplifying the query (perhaps breaking it up into parts the optimizer can estimate correctly).