How does optimizer come up with this row estimate

GokhanVarol 2013-06-17 18:43:07

I would like to find out how the exact number 95665.1 estimation for this table is calculated using statistics. I attached the data for the statistics in ThisHoldsValuesClauseForTheStatistics_PleaseRunThis.txt , the screenshot is from EstimatedPlan.sqlplan, the query in the plan is in SourceQuery.txt

The actual plan mediacounts_long01_2546_sec_hash_spill.queryanalysis is from a previous run of a similar query to show the hash spills.

This is a partitioned table with 3300+ partitions and data within the partitions is not balanced, data belongs to different US counties and some partitions has only 15K rows and some others have 25M + rows. If there is any way to elevate the rowcounts in this query I would like to attempt it to avoid spills. If I do not use a driver table and force a nested loops for this query I get a query plan like Long_Single_16770_Sec_HashNoSpill_NoDriver.queryanalysis which takes a lot more resources and time to execute.
Thank you

link text

alt text

SQLkiwi 2013-06-18 10:52:14
> I would like to find out how the exact number 95665.1 estimation for this table is calculated using statistics.

Cardinality estimation for the inner side of a correlated nested loops join uses average values for any correlated parameters.

The cardinality of the table is 147,993,895. The all density figure for column CntyCd is 0.0006464124. Multiplying the two gives the 95665.1 estimate. There is an adjustment to account for the change in table cardinality since statistics were collected, but the effect is too small to see in this case.

Basing the inner side estimates on average values for correlated parameters can cause problems with memory consuming operators where there is a big variance from the average at run time. You could look to optimize the inner side of the loop for the largest partition using the normal OPTIMIZE FOR with a variable trick.

GokhanVarol 2013-06-18 12:48:13
I tried the optimize for variable trick where I did couple things, first I added to driver table ISNULL(@FakeVar, CntyCd) as CntyCd and optimize for @FakeVar = 'largest county' , it did not bite that then I added a where clause into the driver table where ISNULL(@FakeVar, CntyCd) = CntyCd hat did not do it either. How can I have optimizer think that all partitions in the loop or from the largest partition (or at least think that there is only one partition in the loop and that is the largest county)?

alt text

SQLkiwi 2013-06-18 13:34:58
GokhanVarol 2013-06-18 14:21:34
The estimated plans in attachment are still running, I will get the actuals. I have a small database with only 23 partitions, adding a union all on that database upped the rowcount for property to 735k or so but the same one union all only upped the property estimate to 191K or so records, then I modified the union all bottom to use a values clause with 40 largest county (statistics detail and row count per counties attached, the larges county is 06037 with sys.partitions rows = 2385881 for that county)
This is certainly helping the estimates but not sure how it's still reducing the numbers still.
Thank you .
link text

SQLkiwi 2013-06-18 14:28:14
I meant adding a UNION ALL directly to the seek operator with the too-low cardinality estimate, not to the driving table. Increasing the number of iterations is unlikely to help, but increasing the cardinality of each seek iteration likely will.
GokhanVarol 2013-06-18 15:04:25
I tried that I believe, it certainly help with memory grant but for some reason the query took longer ( I ran this over 3 times). I attached the plan.

link text

SQLkiwi 2013-06-18 15:41:59
Ok, well I believe the original question about cardinality estimates has been answered. We're not going to solve the performance problems with this query via a comments discussion 🙂