Query running slow

Arvind 2015-04-09 11:43:49

I have a SQL2008R2 instance and the following query is taking a long time around 1hr 24mins to run. I have updated the statistics on all the tables. We run update statistics participating in the query. We don't use FULLSCAN.

I have captured the Actual plan in Plan Explorer (attached). In the plan, I see estimated rows are on higher side and RID look ups on Dim.Organization table. I need your help in finding out what's issue with this plan and ways to improve the execution times.

Query:

SELECT
  Dim.Organization.Product,
  Dim.Organization.Department,
  Dim.Organization.ProfitCenter,
  COUNT(DISTINCT CASE WHEN  Dim.Submission.SubmissionCount =1
  THEN  CONVERT(NVARCHAR(100),
  CONVERT(NVARCHAR(100),Dim.Submission.SubmissionNumber) + '-' +
  CONVERT(NVARCHAR(100),Dim.Submission.CustomSubmissionVersionID) )
END),
  count(distinct Dim.Policy.PolicyNumber+Dim.Policy.PolicyDoc),
  Case When 'Ironshore Agency Share' ='Iron-Starr Agency Share'
Then
sum(Fact.PolicyTransaction.GrossWrittenPremiumRC)
Else
SUM((Fact.PolicyTransaction.GrossWrittenPremiumRC)*Dim.Policy.IronshoreAgencySharePct)
End
FROM
  Dim.Organization 
INNER JOIN Fact.PolicyTransaction
    ON (Fact.PolicyTransaction.OrganizationKey=Dim.Organization.OrganizationKey  
    AND  Fact.PolicyTransaction.EndDate > Convert(Date,'04/05/2015')
    AND  Fact.PolicyTransaction.StartDate<=Convert(Date,'04/05/2015')
    AND  Dim.Organization.StartDate<=Convert(Date,'04/05/2015')
    AND  Dim.Organization.EndDate>Convert(Date,'04/05/2015')
    AND  Fact.PolicyTransaction.SourceSystemID Not in (9))
INNER JOIN Dim.Submission 
    ON (Fact.PolicyTransaction.SubmissionKey=Dim.Submission.SubmissionKey
    AND  Fact.PolicyTransaction.EndDate > Convert(Date,'04/05/2015')
    AND  Fact.PolicyTransaction.StartDate<=Convert(Date,'04/05/2015')
    AND  Dim.Submission.EndDate > Convert(Date,'04/05/2015')
    AND  Dim.Submission.StartDate <=Convert(Date,'04/05/2015')
    AND  Fact.PolicyTransaction.SourceSystemID Not in (9))
INNER JOIN Dim.Policy 
    ON (Dim.Policy.Policykey = Fact.PolicyTransaction.PolicyKey  
    AND  Dim.Policy.EndDate > Convert(Date,'04/05/2015')  
    AND  Dim.Policy.StartDate <=Convert(Date,'04/05/2015')  
    AND  Fact.PolicyTransaction.EndDate > Convert(Date,'04/05/2015')  
    AND  Fact.PolicyTransaction.StartDate<=Convert(Date,'04/05/2015')  
    AND  Dim.Policy.SourceSystemID Not in (9)  
    AND  Fact.PolicyTransaction.SourceSystemID Not in (9))
GROUP BY
  Dim.Organization.Product, 
  Dim.Organization.Department, 
  Dim.Organization.ProfitCenter
SQLkiwi 2015-04-09 15:58:42
I assume this is a generated query, because that would explain some of the odd features like all predicates in the ON clause (in case the join were an outer or full join), the NOT IN (9) construction (which might be a list), and a comparison between two string literals that will always be false ('Ironshore Agency Share' ='Iron-Starr Agency Share'). If that is true, it might limit what you are able to do in some respects, but it is still not impossible.

Anyway, there are two distinct problem areas in the submitted plan. First:

Problem Area 1

Nested Loops Join is an expensive strategy when the outer input is large. 71,764,343 rows definitely counts as large. 499,658,841 rows for the RID Lookup even more so.

To eliminate the RID Lookup, add columns StartDate and EndDate to a key of an index on the Organization table. Columns Product, Department, and ProfitCenter should be included in the index. The index also needs to have OrganizationKey. An example suitable index would be:

CREATE INDEX i -- Naming is up to you
ON Dim.Organization
(
    OrganizationKey,
    EndDate,
    StartDate
)
INCLUDE
(
    Product,
    Department,
    ProfitCenter
);

This assumes EndDate will usually be more selective than StartDate. If that is not true for you, reverse the order of those columns. You may even need two indexes with start and end date keys in different orders for best performance overall. That is something only you can determine through analysis and testing.

The indexing on PolicyTransaction is more interesting. The potential key columns are SourceSystemID, StartDate, and EndDate, all of which are compared using inequalities. Given the very low selectivity of the values used in this query, it is a little surprising the optimizer chose a seek over a scan. Perhaps the table is very wide, or has a lot of unused space (a separate issue).

As a fact table, it is unusual to have indexes at all (aside from a clustered index, which is conspicuous by its absence). The current index on SourceSystemID is not selective, though it does at least cover the columns needed by the query. I hope that is by design, and the index does not includes all columns from the base table!. It seems possible that SourceSystemID (and possibly another column to make the index unique) should be the clustered index on this table, not a nonclustered index.

So, consider dropping the existing nonclustered on PolicyTransaction, and replacing it with a clustered index that will be useful to future queries against this table. For example:

-- Suggestion only
DROP INDEX IDX_NC_SourceSystemID ON Fact.PolicyTransaction;
 
CREATE UNIQUE CLUSTERED INDEX CX -- Name this!
ON Fact.PolicyTransaction
(
    StartDate,
    EndDate,
    SourceSystemID,
    -- Other column(s) needed to make the index unique, if possible
);

Again, the exact choice of key order depends on your knowledge of the data and querying patterns. SourceSystemID should probably not be the leading key if it is usually specified as an inequality, for instance.

The second problem area of the plan is:

Distinct Aggregates

This plan shape results from the use of multiple DISTINCT aggregates in the query. This can be tough to improve, especially within a generated-query template, but partial pre-aggregation can definitely help. See the article below for details:

http://blogs.msdn.com/b/sqlqueryprocessing/archive/2008/09/22/distinct-aggregation-considered-harmful.aspx

You may get some relief by upgrading to SQL Server 2012 (or later) at some point:

http://sqlblog.com/blogs/paul_white/archive/2011/12/04/is-distinct-aggregation-still-considered-harmful.aspx

Ultimately, though, this sort of query really needs the columnstore technology in SQL Server 2012 (and very preferably 2014) to really fly. Performance improvements of 50x or more are not uncommon, and this particular query should do better than that. That is for the future though, and columnstore does require Enterprise Edition. Nevertheless, I have to mention it because it is the most complete overall solution.

The existing plan shape (Eager Spools) is very wasteful, writing 71.3 million rows to a tempdb workfile before replaying those rows three times through three very similar aggregates. investigate the partial aggregation idea reduce the repeated and wasted effort here.

Arvind 2015-04-10 05:59:11
Paul, Thanks for the detailed explanation . I will work on the indexes and will let you know on the performance
Yes it is a Generated query for a report from Business objects(SAP BO). This DB is a part of ETL work flow,where data gets loaded from Datawarehouse(GDW)to Datamart database and BO application will use it for report purpose once data is loaded.