What’s wrong with 2014’s Cardinality Estimate here?

mjswart 2018-08-17 18:37:01

Set up a database with these two tables (I've also scripted stats for them)
https://gist.github.com/mjswart/2c0a791710757bd8d8790afd4d4b0d7e#file-weird-estimates

You'll see two queries at the end whose only difference is that the first uses trace flag 9481 (forcing it to use the old 7.0 Cardinality Estimator).

In SQL Server 2014 (version 12.0.4100.1), I see this query plan:
https://gist.github.com/mjswart/d9f89c14578cb5ed311b7e44675c3096

The old estimator has an estimate of 1 row.
The new estimator has an estimate of 100% of the rows.

I wonder if this has something to do with the "Increased Correlation Assumption for Multiple Predicates" that Joe Sack talked about in his white paper.

If so, what changed? On my local 2017 box, I get a much more reasonable estimate: Something greater than 1, but something less than 100%:
https://gist.github.com/mjswart/7d73ecd577f72dfbf1a4ad47097751c2#file-weird-estimates-2017-sqlplan

Is this a question for Joe Sack? Paul White? Or maybe it's time to call Microsoft Support? I'd love to know the story here.

Hugo Kornelis 2018-08-17 19:53:49
Michael, I tried to look into this but I was unable to run the script you supplied. I get an error message saying that one of the supplied statistic streams is corrupt.

If it helps, the error is on _WA_Sys_00000012_5F9EF494 ON dbo.QO_SECTIONS(Name)

mjswart 2018-08-17 19:57:57
Thanks for your interest Hugo!
I still have no problem. I'll try to attach the sql as a file to this incident (to avoid any gist weirdness maybe)
mjswart 2018-08-17 19:59:48
I've uploaded the same file here: raw-file.txt
If you still have problems, you could choose to elimiate that set of statistics, I don't think that particular one is used by my example

SQLkiwi 2018-08-19 11:35:59
@Hugo The corrupt stats stream message on a string column is normally a sign that the database collation does not match. Michael did not specify a COLLATE clause on the CREATE DATABASE statement, but I believe the correct collation is the US default SQL_Latin1_General_CP1_CI_AS.
SQLkiwi 2018-08-19 11:37:08
@mjswart I only had a two-minute look at this, but it seems fixed under TF 4199. I don't have 2014 SP1 to hand, but that is what I see on 2014 SP2 CU12 (latest).
Hugo Kornelis 2018-08-19 13:14:01
For the record, on the instances where I tested this (SQL2016 SP1-GDR build 4223; and SQL2017 RTM-CU9-GDR build 3035), switching TF4199 on or off made no difference.
In either case I see the incorrect estimate of 105 when using the new CE, and the correct estimate of 1 when using the legacy version.
SQLkiwi 2018-08-19 15:09:21
4199 produces the more sensible estimate of 105.036 rather than 162438000.
Hugo Kornelis 2018-08-19 09:33:30
(Scroll all the way down for the TL;DR summary)

The second upload causes the exact same error for me. Perhaps there is a line length limitation somewhere in your upload process, or in my download process? Regardless, it is not important. I first thought I was unable to replicate your issue because I saw the same estimates for both queries. Today I repeated the attempt on an older SQL Server version (2016 insttead of 2017) and then I did see the differences. And after playing with the code some more, I now also do see it on 2017 and am unable to repeat my first attempt where I didn't see it – so I guess I just overlooked something?

I will not be able to provide you with a full explanation. Estimating joins, with predicates that combine AND and OR can get pretty complex pretty fast. However, I did see an interesting observation when I was playing with the code. The join condition of the query reads:

ON      QO.CollectionId = QS.CollectionId
AND (   ObjectId        = QS.SectionId
     OR ParentObjectId  = QS.SectionId)

Joining by each of the three parts individually is fairly simple. The estimates are 105 or 103 (depending on CE version) for CollectionID only, 1 for ObjectID (obviously, since it is the PK), and 11.8 for ParentObjectID.

Joining by ObjectID OR ParentObjectID (but still without CollectionID) is where it starts to get weird. The execution plan changes its logical flow – instead of pushing the predicate into the inner input of the Nested Loops operator, it simply scans the entire index and applies the filter as a Predicate within the Nested Loops – so now you need to look at the output of the Nested Loops operator to see the estimate for this predicate. The old CE produces an estimate of 12.8 (makes sense if you assume that there are no rows in QO with ObjectID equal to ParentObjectID). But with the new CE the estimate shoots up to a whopping 16243700 – which is almost the entire table!!! I am pretty sure that this is a bug!

Note that estimates are rounded to six significant digits. The table has a total of 162437504 rows. The unfiltered scan rounds this to 16243800. After filtering in the Nested Loops, the rounded number has become 162437000 so the actual estimate must have been reduced somewhat (by at least 5, if normal rounding rules are used) but it's impossible to tell how much. So the OR still does not result in an estimate of "all the rows" – but it does result in an estimate that is way off, because it is way more than the sum of the two separate predicates (which is the theoretic maximum).

Now let's add back the CollectionId filter to return to the original query. We already know that the filter on CollectionId is very selective (only 103 or 105 rows out of a 162M row table), and that the combined OR filter on [Parent[ObjectId] should be very selective -12.8 out of the same 162M-, but is not when the new CE is used. Looking at your original queries, I would expect that combining those two predicates with AND would result in an estimate well below 1, which is always rounded up to 1 by SQL Server. This is exactly what happens when the old CE is used.

For the new CE, I would expect the bug with the OR to propagate, so that the estimate of 103 is left unchanged or only lowered insiginificantly. Instead it is even increased to 105, which is what the old CE produced for filtering on CollectionId only. Apparently the subtle change in behavior that changed 105 to 103 with the new CE is lost in this specific combination. But I'd still call the estimate "as expected" when taking the above described bug with the OR predicate into account.

NOTE: In the above I used some big simplifications of how the cardinality estimation for joins actually works. If you want to know more, I suggest reviewing my slide deck on cardinality estimation (which does not cover anything if this complexity but does get close to it) – you can download it from https://sqlserverfast.com/presentations/now-where-did-that-estimate-come-from/

SUMMARY: Based on your code and some experiments I did, it appears that the new cardinality estimator has trouble with estimating the cardinality for the join predicate {ObjectId = QS.SectionId OR ObjectId = QS.ParentSectionId). This error then propagates to the estimate for your actual query.

The estimate of "1" is the more correct of the two estimates you got bassed on the normal estimation rules. The estimate of 105.036 is wrong.

If this is causing issues for you that you need to get fixed right now, I recommend getting Microsoft Support involved. If it is not urgent, then I recommend that you try to simplify the repro as much as possible (ideal if you can find a similar issue on a standard demo database such as AdventureWorks or WideWorldImporters) and then use that repro to report this aas a bug on Uservoice (https://feedback.azure.com/forums/908035-sql-server). (I can report it too if you prefer but I do not want to take undue credit for a bug that you discovered).

I hope this helps!

SQLkiwi 2018-08-19 12:42:34
From a quick look, it appears the seek into QO is estimated as one would expect e.g. for a query like:

DECLARE @CollectionId bigint;
 
SELECT 1
FROM dbo.QO AS QO
WHERE QO.CollectionId = @CollectionId;

…giving an estimate of 105.036 (all density for the CollectionId stats * table cardinality).

However, the residual filter:

[QO].[ObjectId]=[QS].[SectionId] OR [QO].[ParentObjectId]=[QS].[SectionId]

…results in a complete guess:

CStCollFudge(ID=8, CARD=1.62438e+008)

Obviously a 100% guess (based on the table cardinality) isn't the smartest, but there we go.

There are various ways to get a better estimate, including enabling QO fixes under TF 4199. This improves things in various ways (including setting an upper bound on a simple join) but most importantly, the guess for the residual respects the limits set by the seek below it:

CStCollFudge(ID=9, CARD=105.036)

You can poke around in the details using the Debug channel extended event query_optimizer_estimate_cardinality and/or undocumented TF 2363.

I'm not sure if the 4199 fix is available for 2014 SP1, but it is certainly effective on my local instance, which is at SP2 CU12.