DISTINCT clause degrading performance

NPS 2018-07-17 14:15:56

Below query took 8 min to execute in staging server. This script is generated by microstrategy tool and hence it has such weird aliases.

select distinct a11.CustomerDivisionKey CustomerDivisionKey,
a11.CustomerDivisionID CustomerDivisionID,
a11.CustomerDivisionName CustomerDivisionName
from Dim_CustomerDivision a11
join Fact_Project a12
on (a11.CustomerDivisionKey = a12.CustomerDivisionKey)
join vwDataPermission_Project_Requsition a13
on (a12.ProjectKey = a13.SSOProjectKey)
where (a11.CustomerDivisionKey not in (-9)
and a13.UserName = N'abcd@gmail.com')
order by 1 asc

If I remove DISTINCT clause from the above script then it executes within 5 sec. But I cannot remove DISTINCT clause because this sql query is created by microstrategy tool. I also updated the statistics of these tables but that did not improve the performance at all.
So please let me know your thoughts as to how can I improve the performance without changing the sql script.
Please find the attachment for its actual execution plan.

Hugo Kornelis 2018-07-20 21:14:49
It's hard to give a complete answer without having access to the execution plans of both versions of the query. And because the execution plan upload function on this forum is currently broken there is unfortunately no easy way to fix this.

Here are some generic thoughts.

First: How many rows did each of the two queries return? If the numbers are different then it shows that the DISTINCT is actuallly needed. If they are the same then it does not show they are not needed – just that it does not do anything for these specific parameters and on your specific data set on the staging server.

Second: DISTINCT will always lower performance because you force the optimizer to do the extra work to check for duplicated data. (Unless there are reasons based on constraints and/or query logic that the optimizer can use to interfere that duplicates are actually impossible). But from 5 seconds to 8 minutes indicates that for some reasont he entire execution plan was completely changed for this, because just adding the logic to remove duplicates to a 5 second execution should at worst double it or so.

Third: I do not know the schema of the two tables involved, nor the definition of the view so I have to do a bit of gambling here. But I guess that each row in Fact_Project joins to exactly one row in Dim_CustomerDivision, whereas each customer division can be related to multiple projects. I also guess that each project relates to multiple rows from the view, but just one of them can be for this particular UserName. The scenario where the query without DISTINCT would return duplicates is when this user has access to two or more projects within the same CustomerDivision.

Even though you join all tables, you only return data from the first. So one other version of the query you can try would be:

SELECT       a11.CustomerDivisionKey,
             a11.CustomerDivisionID,
             a11.CustomerDivisionName
FROM         dbo.Dim_CustomerDivision AS a11
WHERE        a11.CustomerDivisionKey <> -9
AND EXISTS
 (SELECT     *
  FROM       dbo.Fact_Project AS a12
  INNER JOIN dbo.vwDataPermission_Project_Requsition AS a13
        ON   a13.SSOProjectKey = a12.ProjectKey
  WHERE      a12.CustomerDivisionKey = a11.CustomerDivisionKey
  AND        a13.UserName = N'abcd@gmail.com')
ORDER BY     CustomerDivisionKey;

You'll need to test it. Check that I didn't mess up, and it returns the correct results. Also check performance – I HOPE it will be better but no guarantees.

If it does, then the problem is of course: how do you convince your query generating tool to generate a query in this form. That is one of mamy reasons why I dislike code that generates SQL. Some of those tools allow you to hand-craft code or to call stored procedures – perhaps that can be a solution for you?