Why is the merge join in Node 48 [lowest right Inner Join] estimating a loss of 20% of the rows?

Kunkel 2015-04-07 15:30:45

The estimated 50 million rows that come out of the clustered index scan of the primary table [Object 2] are almost exactly matching the final number of rows returned by this query.

Both tables JOINed in the merge [Object 2 and Object 5] are completely repopulated [TRUNCATE then INSERT] a few hours (within the same jobstream) prior to this SELECTion.

Kunkel 2015-04-07 17:02:02
Here's the non-anonymized version of the analysis generated by the x64 version of plan explorer that i downloaded this morning.

link text

Aaron Bertrand 2015-04-07 17:06:47
Thanks, that's helpful because we can see the query text, but it's still not an actual plan generated by Plan Explorer (it's either estimated or was generated by SSMS and opened in Plan Explorer). It's missing actual rowcounts, duration, reads, etc.
Kunkel 2015-04-07 17:21:37
The query is executed as part of a SSIS package and it takes about 5 hours to run. This is the execution plan generated from "Activity Monitor / Recent Expensive Queries" while the query was running. I'm not sure of a way to get you the Actual Plan
Kunkel 2015-04-09 12:23:10
Here is a plan with actuals from our Development environment

SQLkiwi 2015-04-09 13:11:47
Oh my goodness. How many tens of millions of times is this calling the scalar UDFs [dbo].[IsStrNull] and [dbo].[ufn_years_old]!!! Can these be changed to inline functions?
Kunkel 2015-04-09 14:13:09
Well they can, and I'm not an expert at reading execution plans, but that doesn't seem to be the most expensive part of the query. Am I reading the output incorrectly?

I worked to get most of the tables clustered by KeyToSourceFile, or have a non-clustered covering index of KeyToSourceFile. It would seem that a merge join would be most efficient and that is what is used in most of the cases.

But the JOINs to AccountsPayable and ClaimsDtlAuxFile did not produce a merge join like the others, instead producing Index Seek and Nested Loops. And those are two of the more expensive parts of the query that remain.

That's why I wondered if the 20% lower estimation of rows was causing the QO to chose Index Seek and Nested Loops. I hoped that correcting the estimation would push the QO to a merge join which I'm guessing would run more quickly.

SQLkiwi 2015-04-09 14:39:02
Trust me, the scalar functions are killing your performance here. They are the invisible killers in query plans, unfortunately. Don't be misled by estimated cost percentages in plans. The cost model ignores the (unknown) cost of black-box functions. Happy to look at the reasons for the nested loops joins if you can edit your question to attach the complete source query.
Kunkel 2015-04-09 15:10:08
Here is the complete query, plus some of the table/index information at the top of the file.

SQLkiwi 2015-04-08 01:42:02
> Why is the merge join in node 48 estimating a loss of 20% of the rows?

The most likely causes are:

  1. Sampled or out-of-date statistics on the two large input tables.
  2. The effect on the KeyToSourceFile statistics histogram of the independent filters [ClaimType]='E' AND [PayStatus]<>'X' on the Claims Detail table. Filtered statistics could help there.

Join estimation is complex, but in outline: SQL Server takes the histograms for KeyToSourceFile (scaled for the independent filters, or from a filtered statistics object if available) from both tables, aligns each step (key boundary) using linear interpolation where necessary, then compares step-by-step to estimate how many rows will join.

This is a statistical process, subject to simplifying model assumptions and accumulating errors, so you should not in expect a perfect estimate, particularly for large tables, and/or where data changes have occurred since the statistics were collected.

FYI the SQL Server 2014 cardinality estimator sometimes produces less 'surprising' estimates in this sort of situation.

The post-execution plan will be interesting, though it is likely to show the cause of poor performance is not directly related to this particular join estimate.

Update based on post-execution plan

This query is executing the scalar UDFs [dbo].[IsStrNull] and [dbo].[ufn_years_old] hundreds of millions of times.

The current implementation of scalar UDFs in SQL Server means each execution of the function (per row!) runs within a new execution context (and T-SQL stack frame). This is approximately as bad as running a separate query for each row for each scalar function. These scalar functions should probably be directly in-lined or converted to in-line TVFs.

As another undesirable side-effect, using a scalar UDF also prevents parallelism, something one would expect to see in a query processing 50 million rows through 16 joins. The problems of scalar (and multi-statement) UDFs have been written about extensively over the years, here is one example from Hugo Kornelis:

http://sqlblog.com/blogs/hugo_kornelis/archive/2012/05/20/t-sql-user-defined-functions-the-good-the-bad-and-the-ugly-part-1.aspx

There are also a couple of odd choices of physical join type (nested loops) where a hash or merge might be expected to be better-performing. The source query text is unavailable (truncated in the provided file) so it's hard to know exactly why that choice is being made. Perhaps an OUTER APPLY construction that the optimizer is unable to unnest.

Kunkel 2015-04-08 14:49:30
Sorry for the lack of clarity; KeyToSourceFile is actually the clustered index on both of those tables.

Per your suggestion, I added filtered statistics with ClaimType and PayStatus predicated, and the estimated number of rows actually went down a little bit further.

SQLkiwi 2015-04-08 15:35:22
Make sure you create the filtered statistics with FULLSCAN. Check the statistics objects using DBCC SHOW_STATISTICS to see where things might be going wrong. Also added to my answer.
Kunkel 2015-04-09 23:34:32
Removing the UDFs did not change the underestimated row counts, but the Query Optimizer did choose merge joins for the two cases in question and used parallelism for everything.

Now I just have to verify that the inline code I wrote on the fly properly reproduced what the UDFs were doing in each instance.

Thanks for your help, Paul.

SQLkiwi 2015-04-09 23:48:35
That's great to hear. I assume performance improved significantly (from 5 hours, was it?) as well.
Kunkel 2015-04-10 03:11:24
Timings from development environment using direct INSERT statement instead of SSIS:

with UDFs: 6:12

w/o UDFs: 2:36