Why does this execution plan take over 3 hours to complete?

irb56 2015-03-31 10:05:05


I have a SSIS package that ran for a number of days and took 1 minute to complete. A few days ago, it started to take over 3 hours to complete and this has been happening consistently for 4 days now. Using a combination of my SSIS logging data and sys.dm_exec_query_stats I was able to identify that the problem was isolated to a single data flow task in the SSIS package and I was able to pull the execution plan from sys.dm_exec_query_plan (see attached).

The sys.dm_exec_query_stats tells me that the query optimizer is creating a new (bad) plan every day, hence my initial attempt to clear out the plan using DBCC FREEPROCCACHE made no difference.

The actual query returns about 35,000 rows and as recently as Friday 27/03, it took just 10 seconds to complete. I was able to identify the execution plan for that run also (and I can supply a copy of this too if someone thinks it will help). Both execution plans are complex and the only obvious difference I could see was that the "good" plan made a big over estimate of the total number of rows. Specifically, the "good" plan estimated 440,000 rows returned approx and the "bad" plan estimated 66,000 rows approx. I think the difference can probably be accounted for by the introduction of a maintenance job that performs index optimization and updates modified statistics. It seems like making sure the statistics are fresher and indexes are not fragmented has triggered the problem. But this is counter intuitive and I'm assuming it's a coincidence.

Can anyone please advise me on how best to tackle this problem? I might break the query down into two (or more) separate queries and associate them with separate (additional) data flow tasks. I expect this will solve the problem but if it works I'll still be none the wiser as to why the problem happened in the first place and whether or not I could resolve it in a more efficient manner, perhaps using query hints to help the optimizer.

Many thanks in advance.

SQLkiwi 2015-03-31 12:52:32
It is almost certainly not coincidental. Updating statistics causes dependant execution plans to recompile. The idea is that newer, potentially better, information is available so the plan is recompiled for "optimality reasons".

Large and complex execution plans often exhibit poor stability. Small changes in statistical information, can often cause substantial changes in plan shape. Some intuition for this: complex queries have a larger space of possible plans; the optimizer only has time to look at some subset of that space; and larger plans encourage small errors in estimation to grow, often exponentially. Even moderately-sized execution plans are often based on what comes down essentially to guesswork for decisions more than several operations away from a leaf node (where statistics are maintained). The further the optimizer is away from the leaf, the more it is making cost-based decisions based on an estimate based on a previous estimate, which is turn based on a previous estimate … and so on.

To get good plan stability from the optimizer, the query needs to be simpler, as relationally-expressed as possible, and with good supporting statistics and indexes. Breaking queries into multiple steps is one step in achieving this. The benefit of investing this effort is that the QO will continue to make good choices over time, taking changes in data distribution and volume into account.

The other end of the scale is to force a known good plan shape for the monolithic query using a Plan Guide (see Books Online). This guarantees the broad plan shape, but obviously you lose flexibility over time. If the data is likely to change such that a currently good plan might be terrible at some point in the future, a Plan Guide is probably not the way to go.

irb56 2015-03-31 13:29:01
Thanks very much for your advice. It's helpful to know that the updating of statistics probably triggered the re-compilations and destabilized what previously had been an acceptable execution plan. I'll break the query down into two or more simpler ones and expect to resolve the issue that way. Intuitively it was what I thought would be the answer but it's helpful to get the background knowledge on how changes in statistics can wreck what were previously relatively good plans. I don't think I'll go the plan guide route as this will probably just store trouble up for later.

Much appreciated, thank you. 🙂