Business objects query with EXCEPT runs slow

mylifeandsql 2018-06-26 12:55:10

The query below is generated by Business Objects. It runs for over 20 minutes before I cancel it. without the EXCEPT operator, each query runs in about 6 seconds. Is there a reason why it runs so slow with the EXCEPT operator?

SELECT
  L27RTP_STUDENTS.STUDENTS_ID,
  L27RTP_STUDENTS.SHORT_NAME_STU,
  L27RTP_STUDENTS.REUNION_CLASS_STU,
  L27RTP_STUDENTS.STU_ACTIVE_COHORT,
  L27RTP_STUDENT_PROGRAMS_ACTIVE.STPR_ACAD_PROGRAM,
  L27RTP_STUDENTS.STU_ACTIVE_PROGRAM_STATUS,
  STUDENTS_STPR_MINORS_LIST.STU_ACTIVE_MINORS
FROM
  ( 
  SELECT STUDENTS_ID,
    STUFF((SELECT STPR_MINORS FROM 
          (SELECT CHAR(13) + CHAR(10) + STPR_MINORS AS STPR_MINORS
                 FROM L27RTP_STUDENT_PROGRAMS_STPR_MINOR_LIST_ACTIVE
            WHERE STUDENT_PROGRAMS_ID LIKE 
                      STUDENTS_ID + '*%'
           UNION ALL
                   SELECT CHAR(13) + CHAR(10) + ACPG_MINORS
                 FROM L27RTP_STUDENT_PROGRAMS_ACTIVE
                         INNER JOIN L27RTP_ACAD_PROGRAMS_ACPG_MINORS
                           ON STUDENT_PROGRAMS_ID LIKE '%*' + ACAD_PROGRAMS_ID
            WHERE STUDENT_PROGRAMS_ID LIKE 
                      STUDENTS_ID + '*%') MINOR_LIST
FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)'), 1, 2, '') AS STU_ACTIVE_MINORS
FROM L27RTP_STUDENTS
  )  STUDENTS_STPR_MINORS_LIST INNER JOIN L27RTP_STUDENTS ON (L27RTP_STUDENTS.STUDENTS_ID=STUDENTS_STPR_MINORS_LIST.STUDENTS_ID)
   LEFT OUTER JOIN L27RTP_STUDENT_PROGRAMS_ACTIVE ON (L27RTP_STUDENTS.STUDENTS_ID=L27RTP_STUDENT_PROGRAMS_ACTIVE.STPR_STUDENTS_ID)
   LEFT OUTER JOIN ( 
  SELECT STA_STUDENT_ID, STA_ACAD_LEVEL, STA_START_TERM
        FROM L27RTP_STUDENT_ACAD_LEVELS
WHERE STA_START_DATE <= getdate()
  AND (STA_END_DATE > getdate() OR STA_END_DATE IS NULL)
  AND LEN (STA_ACAD_LEVEL) > 0 
  )  STUDENTS_STU_CURRENT_ACAD_LEVEL_MULTI_VALUES ON (L27RTP_STUDENTS.STUDENTS_ID=STUDENTS_STU_CURRENT_ACAD_LEVEL_MULTI_VALUES.STA_STUDENT_ID)
 
WHERE
  (
   L27RTP_STUDENTS.STU_ACTIVE_COHORT  LIKE  'A%'
   AND
   STUDENTS_STU_CURRENT_ACAD_LEVEL_MULTI_VALUES.STA_ACAD_LEVEL  IN  ( 'GR','UG'  )
  )
EXCEPT  
SELECT
  L27RTP_STUDENTS.STUDENTS_ID,
  L27RTP_STUDENTS.SHORT_NAME_STU,
  L27RTP_STUDENTS.REUNION_CLASS_STU,
  L27RTP_STUDENTS.STU_ACTIVE_COHORT,
  L27RTP_STUDENT_PROGRAMS_ACTIVE.STPR_ACAD_PROGRAM,
  L27RTP_STUDENTS.STU_ACTIVE_PROGRAM_STATUS,
  STUDENTS_STPR_MINORS_LIST.STU_ACTIVE_MINORS
FROM
  ( 
  SELECT STUDENTS_ID,
    STUFF((SELECT STPR_MINORS FROM 
          (SELECT CHAR(13) + CHAR(10) + STPR_MINORS AS STPR_MINORS
                 FROM L27RTP_STUDENT_PROGRAMS_STPR_MINOR_LIST_ACTIVE
            WHERE STUDENT_PROGRAMS_ID LIKE 
                      STUDENTS_ID + '*%'
           UNION ALL
                   SELECT CHAR(13) + CHAR(10) + ACPG_MINORS
                 FROM L27RTP_STUDENT_PROGRAMS_ACTIVE
                         INNER JOIN L27RTP_ACAD_PROGRAMS_ACPG_MINORS
                           ON STUDENT_PROGRAMS_ID LIKE '%*' + ACAD_PROGRAMS_ID
            WHERE STUDENT_PROGRAMS_ID LIKE 
                      STUDENTS_ID + '*%') MINOR_LIST
FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)'), 1, 2, '') AS STU_ACTIVE_MINORS
FROM L27RTP_STUDENTS
  )  STUDENTS_STPR_MINORS_LIST INNER JOIN L27RTP_STUDENTS ON (L27RTP_STUDENTS.STUDENTS_ID=STUDENTS_STPR_MINORS_LIST.STUDENTS_ID)
   LEFT OUTER JOIN L27RTP_STUDENT_PROGRAMS_ACTIVE ON (L27RTP_STUDENTS.STUDENTS_ID=L27RTP_STUDENT_PROGRAMS_ACTIVE.STPR_STUDENTS_ID)
   LEFT OUTER JOIN ( 
  SELECT STA_STUDENT_ID, STA_ACAD_LEVEL, STA_START_TERM
        FROM L27RTP_STUDENT_ACAD_LEVELS
WHERE STA_START_DATE <= getdate()
  AND (STA_END_DATE > getdate() OR STA_END_DATE IS NULL)
  AND LEN (STA_ACAD_LEVEL) > 0 
  )  STUDENTS_STU_CURRENT_ACAD_LEVEL_MULTI_VALUES ON (L27RTP_STUDENTS.STUDENTS_ID=STUDENTS_STU_CURRENT_ACAD_LEVEL_MULTI_VALUES.STA_STUDENT_ID)
 
WHERE
  (
   L27RTP_STUDENTS.STU_ACTIVE_COHORT  LIKE  'A%'
   AND
   L27RTP_STUDENTS.STU_ACTIVE_HIATUS_CODE  =  'W'
   AND
   STUDENTS_STU_CURRENT_ACAD_LEVEL_MULTI_VALUES.STA_ACAD_LEVEL  IN  ( 'GR','UG'  )
  )
Hugo Kornelis 2018-06-26 17:23:15
Without knowing your data model (tables, indexes); your data; and without access to the exection plans, there's no way to give a conclusive answer. If possible, please post the plans: actual execution plan for the fast query, preferably actual execution plan for the slow query too but if you can't afford to wait for it to finish we can make do with an estimated plan for this one.

If you want speculation, here goes:
1. EXCEPT means that only distinct rows from the left set are returned. Perhaps removing duplicates takes a lot of effort?
2. EXCEPT also translates to the equivalent of a NOT EXISTS for the right set. NOT EXISTS is sometimes accomplished by using a Nested Loops join and a row goal. This strategy can be very vulnerable to misestimation and to skewed data distribution.

There are probably many ways to optimize this query. But that's not what you're asking for, and since it's a generated query you might not have any control over the query so I'm not going there right now.

Hugo Kornelis 2018-06-27 12:54:18
So here are a few things I see in the execution plan and that are probably NOT the cause of your pain:

  1. Because of the EXCEPT, SQL Server has to find "distinct" values in the upper branch of the T-SQL (which has become the lower branch in the execution plan). It does this with a Hash Match (Aggregate) operator with a high estiamted cost – 90% in SSMS or 89.8% in PlanExplorer. Both tools are lying but that's because they state "estimated" cost and there are lots of estimates way off in this execution plan. (E.g. the Hash Match (Aggregate) itself processes just 324 rows instead of the estiamted 85325). This is added because of the EXCEPT but it is not the cause of the performance difference.
  2. The reason that the two inputs are swapped is that SQL Server estimates that the upper input will return 84,818 distinct rows, and the lower input will return just 4195 rows. It uses a Hash Match (Right Anti Semi Join) to remove rows from the sedcond set (first in the T-SQL) that also appear in the other set, and because Hash Match joins store the upper input in memory SQL Server always puts the set with the lowest estimated cardinality on top. Note that the actual rowcounts from the two parts of your query are 324 and 0; given the small size of these sets there is really very little difference between the various options.

My biggest concern in the excution plan is the top-right operator in both of the branches. It is a Clustered Index Scan on table STUDENT_ACAD_LEVELS. This table is not in your query so I assume that you are working with views. The table holds almost 120,000 rows so using a Clustered Index Scan to retrieve a subset of the rows is already non-optimal. The predicate is on a start date (<=getdate), end date (>getdate OR null), and on a weird substring function that (if I see it correctly) takes a column STUDENT_ACAD_LEVELS_ID and strips it off everything before a '*' character; the result then has start with GR or UG and have non-zero length (which is redundant give the first test).

What mostly bothers is not that no seek is used (which given the predicate would have been impossible), but the terrible cardinality estimation. Estimated rows is 235; actual is 5489 – almost 25 times as many. And given the location of this operator in the plan, the estimate of 235 has driven a lot of decisions. May operators are repeated once per row from this table, which is okay (ish) for 235 rows but can become a major issue with 5489 rows.

Are your statistics up to date? If no, then update them and see if you get better estimations and a better execution plan. Also, the predicate on "the part of a column that follows the * character" is a strong indication that your database is not properly normalized; that part of the column should almost certainly have been a separate column. (Which would have been much easier to index, and to get good cardiniality estimations for).

Another major issue is the STPR_DATES table. This table is accessed four times in total (twice in the upper part, twice in the lower part). All of the times its clustered index is scanned. A seek would be better – and in this case easy to obtain for the predicate of start date < getdate and end date > getdate or NULL by having a covering index on start_date. SQL Server estimates that 7277 rows will be returned per execution; in reality it's just 6357. Now you may think that scanning 205K rows to read 6537 rows may be a minor issue and that would be right if it happens once. But because of the issue above, each of the four copies of this operator runs 5486 times. And reading 200K rows (4 * 5486) is actually a lot of work.

There are also several index spools and table spools in the execution plan. These are sometimes needed for getting correct results, sometimes used to prevent even worse performance, but they can be true performance killers. It's hard to tell if they are actually the main issue here, and even harder to tell you how to fix this without knowing more about your query, view definitions, and data. If you run this query with SET STATISTICS IO ON, then I expect that you will see one or more lines marked Worktable with a high number of logical reads.

I have also looked over the text of your query in amazement. The second part is (unless I miss something) almost exactly the same as the first part, except that in the second part one extra predicate (L27RTP_STUDENTS.STU_ACTIVE_HIATUS_CODE = 'W') is added. Again, knowledge of data is needed to determine wether this is actually exactly the same, but I think that it would be easier to add L27RTP_STUDENTS.STU_ACTIVE_HIATUS_CODE <> 'W' to the first query (plus null handling if needed) and remove the entire except construction.

If you are sure that you really need the EXCEPT condition, then at least rewrite it as NOT EXISTS. The benefit is that now you only need to return the key values instead of all the columns, so you can simplify the second query (e.g. that FOR XML construction is probably not needed a second time at all). This will also greatly reduce the amount of work needed for the optimizer.

These are just a few pointers. As mentioned before, I do not know how your views are defined, what tables you have, how your data is distributed, what constraints are in place, etc.

All of the above does not answer your original question: how it is possible that each of the two queries runs in 6 seconds but the combination takes over 20 minutes. I could not see that from the execution plan. The most liekly explanation is that without the EXCEPT the optimizer somehow comes up with a completely different plan, that works much better – but I have no idea why. (Nor even whether it is true at all since I do not know how the execution plans for the individual queries look).

HTH