Trying to figure out how to resolve the data skew

John Couch 2014-03-31 16:31:12

The group by / having clause is causing skew (I believe) and I am not sure how to resolve that. DDL script also attached.

I rewrote the logic 5 different ways. The highest batch cost queries execute the fastest. I attach all 5 below. The highest batch costs ones also have the better cardinality estimates (although not perfect).

I reached out to the application team to find out why we are doing a count/group by to determine order status vs just checking for a status. I am currently researching alternative coding directions based on their feedback.

So, it turns out, what they are doing is just trying to determine if the order has been picked up or not. an order in this situation would have a record with a status of 100, but no other records with additional status values. i.e. another record with 101 status would mean processing and a record with 106 would mean error. So only an order with a single record of 100.

Aaron Bertrand 2014-03-31 16:34:46
For estimated and actual to be off by that much for a simple index or CIX scan (it happens in both plans), I would say you should start by making sure statistics are up to date (and plan to keep them that way). What auto stats options are currently enabled? The faster query manages to filter out the rows from purchase_request first, by isolating the grouping to only the status table. But it still has statistics that are way off, and I think if you correct that problem, SQL Server might have an easier time with the first version of the query too (though I do highly recommend re-writing it in explicit INNER JOIN form).
John M Couch 2014-03-31 18:01:51
I validated statistics are up to date. I even tried running with a FULLSCAN vs. the default SAMPLE. The skew still exists and is considerably worse in the straight inner join syntax.
Aaron Bertrand 2014-03-31 18:06:49
Could you try with a hinted INNER LOOP JOIN as opposed to just INNER JOIN? Perhaps getting rid of that merge join is the key.
SQLkiwi 2014-04-01 06:24:10
The cardinality estimations appear incorrect because the query optimizer has applied a Row Goal when choosing an execution plan. The broad idea of a row goal is to produce a plan that returns the first 'n' rows quickly.

In the original query, this internal row goal was applied during query optimization due to the TOP (8) clause (adding a FAST 8 query hint would have a similar effect).

When a row goal is applied, cardinality estimates are scaled down from the full set on the assumption that potential matches are distributed evenly throughout the data. Where this assumption does not hold (i.e. the matches are skewed toward the end of the set) or where accurate cardinality estimation is difficult, the execution plan ends up processing many more rows than the optimizer expected.

There are several ways to work around this problem when it occurs. One is to execute the query without applying the row goal logic. Microsoft provide documented and supported trace flag 4138 for this purpose. In addition, you can apply this trace flag to a single query using the QUERYTRACEON hint, which is also documented and supported for use with trace flag 4138. The full syntax needed is OPTION (QUERYTRACEON 4138). This will produce a plan optimized for the case where all potential matches need to be checked, rather than stopping after the first 8 are found. To be clear, the query will still only produce a maximum of 8 rows of output – the trace flag just changes the optimizer's approach to finding those rows.

A second solution is to rewrite the query using alternate syntax that happens to generate an execution plan that works better for you in practice. This is where the DBA's knowledge of the data and wider system requirements comes into play. If you know that the query will usually find 8 matches quickly, you might prefer the row-goal plan. If the more normal case is that fewer than 8 matches exist, you could prefer a different plan. Most often, you will end up choosing a plan that works acceptably well for both cases.

There are many different ways to express the same logical requirement in SQL. One that you haven't tried yet is to check for a row with status 100 using EXISTS, and check there is no row with status > 100 using NOT EXISTS. I can't promise this will perform best, but it is something else to try. If you know a row with status 100 must always exist for a row in the related table, you could skip the EXISTS check there.

Update

The execution time varies depending on the input parameters. Yes, there is one/two versions that run in 4.5ms but it is parameter dependent.

Yes. The data distribution means that very different plan shapes are optimal for different parameter values. For example, the purchase_request table contains 3.7 million 'Order' type rows, but only 48 'CloudOrder' type rows. Where both parameter values are set to types with very low cardinality, the optimizer may be able to find a plan that locates those 48 rows in purchase_request first, looks up the matching status rows, and then perform aggregation and filtering to exclude requests with multiple status rows. This is a great execution plan for 48 request rows, but it would be a disaster for 3.7 million (or more):

CloudOrder Plan

Equally, there are many plan shapes that look reasonable to the optimizer based on the statistical information available, the assumptions it makes, and the effect of the row goal as discussed above. These plan shapes may work reasonably well, but they will never be able to compete with the one shown just above, which only ever touches a very small number of rows. If the parameter values are set to 'Order' and 'Quote', the number of request rows to check approaches 4 million.

Finding a good plan for the 'Order' and 'Quote' parameter values is more difficult, because there are so many potential strategies, each of which will perform best given a certain data distribution. From what I can tell, it seems there are relatively few requests that have a single status (13,200 out of 1,089,500). If this is likely to always be the case, a good strategy is to scan the whole status table to find 13,200 requests with a single status, then to loop join into the requests table:

Order Plan

While reasonably efficient given the task at hand, this plan is always going to run longer than the 'CloudOrder' plan. It's performance will also decrease somewhat as the number of status rows increases over time.

Both plans above were generated from the same source query:

SELECT TOP (8)
    PR.purchase_request_id,
    PR.[type]
FROM
(
    SELECT PRS.purchase_request_id
    FROM dbo.purchase_request_status AS PRS
    GROUP BY PRS.purchase_request_id
    HAVING COUNT_BIG(*) = 1
) AS SingleStatus
INNER LOOP JOIN dbo.purchase_request AS PR
    ON PR.purchase_request_id = SingleStatus.purchase_request_id
WHERE 
    PR.[type] IN (@P1, @P2)
ORDER BY
    PR.purchase_request_id
OPTION (RECOMPILE);

Note that the RECOMPILE hint is required to activate certain optimizations for the case where the two parameter values are equal. With the database designed as it currently is, this is probably the query form I would go with.

Because of that, I was looking at the data skew to see if that was causing the issue as it appears to not be requesting enough memory to satisfy the actual data size.

Many of the plans you have provided do not use operators that require a memory grant, so this concern (though valid in other scenarios) is not in play here. Note that only certain operations (hash, sort, exchange) require a memory reservation. In the provided plans where a memory grant was required e.g. for hash join, the input cardinality estimates are accurate.

Ultimately, making this process optimal for all parameter values will require a bit of design work. A common design pattern for applications reading from a queue of work (as appears to be the case here) is for the application to provide a minimum key value from which the search should start.

This can avoid searching the whole table if the application always consumes rows in purchase_request_id order, newer rows always acquire a higher purchase_request_id value than older ones, and old rows cannot subsequently re-qualify for processing.

If this approach is applicable, the application would need to provide the last key value it processed in a new parameter value (say @P3) and a one-line modification to the query could then apply this optimization:

SELECT TOP (8)
    PR.purchase_request_id,
    PR.[type]
FROM
(
    SELECT PRS.purchase_request_id
    FROM dbo.purchase_request_status AS PRS
    WHERE PRS.purchase_request_id > @P3 -- NEW!
    GROUP BY PRS.purchase_request_id
    HAVING COUNT_BIG(*) = 1
) AS SingleStatus
INNER LOOP JOIN dbo.purchase_request AS PR
    ON PR.purchase_request_id = SingleStatus.purchase_request_id
WHERE 
    PR.[type] IN (@P1, @P2)
ORDER BY
    PR.purchase_request_id
OPTION (RECOMPILE);

This allows the search of the status table to be much smaller, leading to a general-case plan that should perform as well as the CloudOrder case:

Last Key Plan

Finally, I can't help but mention that the purchase_request_status table is currently a heap. If this table experiences deletions, it may be consuming much more space than it should because heaps do not usually deallocate pages emptied by delete activity. Consider the benefits of creating a suitable clustered index on this table. There are also some nonclustered indexes that include a unique key in their definition but are not currently marked as unique. These things can be important to the optimizer – you should always declare an index as unique if you can. In addition, you should provide a uniqueness constraint or index across columns that are known to form a key, even if you don't have queries that access that path directly. Giving the optimizer more information is almost always a good thing.

John M Couch 2014-04-01 19:26:36
Thank you for the new insight into row goals. I did try an EXISTS/NOT EXISTS combination but the execution time was considerably longer than 20 seconds. I am going to try 1 other approach and will post up if resolves.
SQLkiwi 2014-04-02 00:31:14
You're welcome. If you were able to share DDL (ideally a stats-only copy of the tables concerned) I'm sure we'd be able to find an efficient T-SQL construction for you.
John M Couch 2014-04-02 02:40:43
Here is the DDL and all the different versions of the queries I have tried. When you say stats only copy, what do you mean?
SQLkiwi 2014-04-02 03:32:25
John Couch 2014-04-03 15:09:31
Yes, the design is not optimal at all. I have asked the application team to redesign this by dropping the status table and moving those fields back to the project_request table. Then if the status changes archive the old copy of the record out to another table for reporting and leaving only the current status record in the project_request table.

I was skimming through Plan Explorer and seeing the est. 287 rows vs. the actual of 1,298,922 rows and trying to get that into a more realistic skew like what you are getting of est. 9 vs. act. 238.

Thank you both for all your time and help with looking at this, and a huge thanks for the explanations on Row goals as well as the final update you provided SQLKiwi. I will be reading this over and over today to make sure I digest as much of it as I possibly can.

Aaron Bertrand 2014-04-03 14:28:21
Wish I could up-vote again.