High-Performance Method For Joining Large Tables?

This is a user-facing query that runs thousands of times a day. I have managed to improve it in terms of performance, but, frankly, it's ugly now, and I am trying to put some lipstick on this pig. Any help would be very much appreciated!

First, a little background. All the query needs to do, basically, is join two tables containing descriptive data for financial securities. One, called SecurityDaily, holds columns that often change daily, and the other, called SecuritySemiStatic, holds columns that change less often. (About once a month, on average.) This is done mostly to save space, since there are more than two billion rows in the daily table and there are 500+ columns between the two tables. The query I am trying to optimize joins the two tables for all of the dates within a date range, for a given set of accounts. (Securities are identified by an account-ID/security-ID pair. The tables are joined based on the security and a semi-static date. Dates are represented as integers.) I have created "new" tables that only have a subset of the rows while I experiment with the table structure.

What I have done so far is create both tables so that their clustered indexes start with the same three columns: an account ID, security ID, and semi-static date. The clustered index of the daily table then has a fourth column: the "report" date.

In the attached session, you'll see that I've written a query (the "Single-stage query") that takes advantage of the clustered indexes to do a merge join of the two tables. Both sides of the join contain a duplicate sub-tree though, so I tried putting the results of that into a temp table first, in the "Two-stage query". Now SQL Server wants to sort both inputs of the merge sort, even though it shouldn't need to due to the clustered index on the temp table.

avatar image By sam.bishop 54 asked Jan 06 at 11:05 PM
more ▼
(comments are locked)
10|10000 characters needed characters left

2 answers: sort voted first

Joining the #Keys temporary table to either base table cannot produce rows in the order required for a final merge join without a bit of help.

The basic issue is that the final merge requires inputs sorted by accountId, securityId, and semiStaticDate, which the CTEs cannot deliver.

Consider a correlated nested loop join from #Keys (K) to SecurityDaily_new (SDN) on accountId, securityId, and semiStaticDate. Assuming the correlations are known to be unique, the join is capable of providing output sorted on:

  1. K.accountId

  2. K.accountId, K.securityId

  3. K.accountId, K.securityId, K.semiStaticDate

In addition, if the inner side of the join delivers locally sorted order S, this order may be added as a suffix to any of the orders shown above. In your case, the inner side can deliver output sorted on SDN.semiStaticDate, so the additional available orderings are:

  1. K.accountId, SDN.semiStaticDate

  2. K.accountId, K.securityId, SDN.semiStaticDate

  3. K.accountId, K.securityId, K.semiStaticDate, SDN.semiStaticDate

Now accountId and securityId are known to be equal in the join, so K.accountId and K.securityId are interchangeable with SDN.accountId and SDN.securityId. Even so, none of the available orderings exactly match the required: SDN.accountId, SDN.securityId, SDN.semiStaticDate. The closest available is K.accountId, K.securityId, K.semiStaticDate, SDN.semiStaticDate.

This is achievable by projecting an extra column ( K.semiStaticDate ) from the CTEs and adding that to the final join condition. If I understand your requirement correctly, this will not change the result of your query, but it can deliver the merge join you desire. The amended query is below:

 DECLARE 
     @beginDate integer = 5474,
     @endDate integer = 6202;
 
 WITH D AS 
 (
     SELECT 
         SD.*, 
         SSD = K.semiStaticDate -- NEW
         FROM #Keys AS K
         JOIN dbo.SecurityDaily_new AS SD
             ON SD.accountId = K.accountId 
             AND SD.securityId = K.securityId 
             AND SD.semiStaticDate >= K.semiStaticDate 
             AND SD.semiStaticDate <= @endDate
         WHERE 
             SD.isHeld = 1 
             AND SD.reportDate BETWEEN @beginDate AND @endDate
 ), S AS 
 (
     SELECT 
         ss.*, 
         SSD = K.semiStaticDate -- NEW
     FROM #Keys AS K
     JOIN dbo.SecuritySemiStatic_new AS SS 
         ON SS.accountId = K.accountId 
         AND SS.securityId = K.securityId 
         AND SS.semiStaticDate >= K.semiStaticDate 
         AND SS.semiStaticDate <= @endDate
 )
 SELECT 
     SD.accountId, SD.securityId, SD.reportDate, SD.price, 
     SS.maturityDate, SS.SecurityType, SS.currencyId, SS.isOpenRepo, SS.finalMaturityDate
 FROM D AS SD
 INNER MERGE JOIN S AS SS 
     ON SS.accountId = SD.accountId 
     AND SS.securityId = SD.securityId 
     AND SS.SSD = SD.SSD -- NEW, must go exactly here
     AND SS.semiStaticDate = SD.semiStaticDate;

Note: the order of the final merge join clauses is significant.

This produces the plan:

Estimated plan

Please be aware that join hints - INNER MERGE|HASH|LOOP JOIN - come with an implicit FORCE ORDER, meaning the whole plan join order will match the written order in the query. There are other restrictions placed on the optimizer with this implied hint as well, including the ability to introduce or move around optimizing aggregations. Join hints are rarely the optimal solution to a problem. They are much more restrictive to the optimizer than regular hints.

The reason the original query did not suffer from this problem is that the optimizer was able to obtain and infer more information about uniqueness and relationships than is expressed by the single unique index on the replacement temporary table.

All that said, I would probably abandon the idea of a merge join here. Merge join can be very efficient when one-to-many, serial, and ordered inputs do not require a sort. A many-to-many merge is much less efficient, since it uses a worktable to store, rewind, and reply duplicate join keys.

I would look to write a simpler query that does not require a join hint, probably based on cross apply. The basic idea would be to join accounts and securities to the static rows, then expand that for the extra detail rows in the daily table. The question isn't clear enough to me to attempt writing that query, and this site is more about execution plans than query writing anyway. You could ask for query writing help on Database Administrators Stack Exchange.

sp.png (48.2 kB)
avatar image By SQLkiwi ♦ 6.6k answered Jan 09 at 02:53 PM
more ▼
(comments are locked)
avatar image sam.bishop Jan 10 at 04:17 AM

Thank you, Paul! I see why it was adding the sorts now.

Would you be willing to share a little more detail about how you might write this query with a CROSS APPLY? I started with simple JOINs to the static and daily tables, similar to what you've suggested. And, of course, writing a join using CROSS APPLY results in the same query plan as just using JOIN, so you obviously have something extra in mind.

If you don't have time, I understand and will try re-posting on the Stack Exchange site. But I'm sure I would learn something useful even if it doesn't turn out to fit the requirements, so I thought I'd ask. Thanks again.

avatar image sam.bishop Jan 10 at 04:30 PM

Never mind. Any input is welcome, of course, but I think I figured out how to break my remaining questions into smaller pieces, that I'd be more comfortable posting to the Stack Exchange site.

avatar image Rob Farley Jan 10 at 11:46 PM

Let me jump in (sorry Paul, if you're writing something at the moment)

 WITH D AS 
 (
  SELECT 
      SD.*, 
      SSD = K.semiStaticDate -- NEW
         FROM #Keys AS K
      CROSS APPLY
      (SELECT SD.*
       FROM dbo.SecurityDaily_new AS SD
          WHERE SD.accountId = K.accountId 
          AND SD.securityId = K.securityId 
          AND SD.semiStaticDate >= K.semiStaticDate 
             AND SD.semiStaticDate <= @endDate
      ) SD
      WHERE 
          SD.isHeld = 1 
          AND SD.reportDate BETWEEN @beginDate AND @endDate
 ), S AS 
 (
  SELECT 
      ss.*, 
      SSD = K.semiStaticDate -- NEW
   FROM #Keys AS K
     CROSS APPLY 
     (SELECT SS.* 
      FROM dbo.SecuritySemiStatic_new AS SS 
      WHERE SS.accountId = K.accountId 
      AND SS.securityId = K.securityId 
      AND SS.semiStaticDate >= K.semiStaticDate 
      AND SS.semiStaticDate <= @endDate
    ) AS SS
 )
 SELECT 
  SD.accountId, SD.securityId, SD.reportDate, SD.price, 
  SS.maturityDate, SS.SecurityType, SS.currencyId, SS.isOpenRepo, SS.finalMaturityDate
 FROM D AS SD
 INNER MERGE JOIN S AS SS 
  ON SS.accountId = SD.accountId 
  AND SS.securityId = SD.securityId 
  AND SS.SSD = SD.SSD -- NEW, must go exactly here
  AND SS.semiStaticDate = SD.semiStaticDate;

Or even...

 WITH D AS 
 (
  SELECT 
      SD.*, 
      SSD = K.semiStaticDate -- NEW
         FROM #Keys AS K
      CROSS APPLY
      (SELECT TOP (9999999) SD.*
       FROM dbo.SecurityDaily_new AS SD
       WHERE SD.accountId = K.accountId 
       AND SD.securityId = K.securityId 
       AND SD.semiStaticDate >= K.semiStaticDate 
       AND SD.semiStaticDate <= @endDate
       ORDER BY SD.semiStaticDate
      ) SD
      WHERE SD.isHeld = 1 
      AND SD.reportDate BETWEEN @beginDate AND @endDate
 ), S AS 
 (
  SELECT 
      ss.*, 
      SSD = K.semiStaticDate -- NEW
   FROM #Keys AS K
     CROSS APPLY 
     (SELECT TOP (9999999) SS.* 
      FROM dbo.SecuritySemiStatic_new AS SS 
      WHERE SS.accountId = K.accountId 
      AND SS.securityId = K.securityId 
      AND SS.semiStaticDate >= K.semiStaticDate 
      AND SS.semiStaticDate <= @endDate
      ORDER BY SS.semiStaticDate
    ) AS SS
 )
 SELECT 
  SD.accountId, SD.securityId, SD.reportDate, SD.price, 
  SS.maturityDate, SS.SecurityType, SS.currencyId, SS.isOpenRepo, SS.finalMaturityDate
 FROM D AS SD
 INNER MERGE JOIN S AS SS 
  ON SS.accountId = SD.accountId 
  AND SS.securityId = SD.securityId 
  AND SS.SSD = SD.SSD -- NEW, must go exactly here
  AND SS.semiStaticDate = SD.semiStaticDate;

...which explains even more that the data should be pulled out of those tables in a particular order.

avatar image sam.bishop Jan 11 at 05:02 PM

Aha! Thank you, Rob! You got me looking at this in a different way. I've simplified the join conditions, the temp table is gone, and I ended up using a second CROSS APPLY that makes the query scale better if it goes parallel. (Maybe that's what Paul had in mind?)

I'm going to mark this as answered now. My final query is below, if you're curious. Thank you both! This forum is a tremendous resource. I learn a lot every time I use it.


 SELECT ca2.*
     FROM #AccountIds AS ai
     JOIN dbo.SecurityActivityRange AS ar ON ar.accountId = ai.accountId AND ar.startDate <= @endDate AND ar.endDate >= @beginDate
     CROSS APPLY (
         SELECT d.accountId, d.securityId, d.reportDate, d.price, 
                 ss.maturityDate, ss.SecurityType, ss.currencyId, ss.isOpenRepo, ss.finalMaturityDate
             FROM dbo.SecuritySemiStatic_new AS ss
             JOIN dbo.SecurityDaily_new AS d ON d.accountId = ss.accountId
                 AND d.securityId = ss.securityId
                 AND d.semiStaticDate = ss.semiStaticDate
                 AND d.isHeld = 1
                 AND d.reportDate >= @beginDate
                 AND d.reportDate <= @endDate
             CROSS APPLY (
                     SELECT TOP(1) accountId, securityId, semiStaticDate
                         FROM dbo.SecurityDaily_new
                         WHERE accountId = ar.accountId AND securityId = ar.securityId AND reportDate BETWEEN @beginDate AND @endDate
                         ORDER BY reportDate ASC
                 ) AS ca
             WHERE ss.accountId = ca.accountId
                 AND ss.securityId = ca.securityId
                 AND ss.semiStaticDate >= ca.semiStaticDate
                 AND ss.semiStaticDate <= @endDate
         ) AS ca2
avatar image SQLkiwi ♦ Jan 11 at 07:38 PM

That's pretty much exactly what I had in mind. I didn't want to guess without table definitions and example data in case I was misunderstanding. I did look out for a follow up question on dba.se. Anyway, glad you worked it out.

10|10000 characters needed characters left

Look at the "Where (Join Columns)" property of the Merge Join operator, and also on the "Order By" properties of the two Sort operators. You'll see that it's using the accountid and securityid columns of #Keys, but following this by the semiStaticDate column. And that's not in order, based on the inequality joins you have within your CTE statements.

Without a Sort operator, you can't really get data in an order based on columns from two tables. And the QO must think that the third predicate (semiStaticDate) is worth having in the Merge Join's Where property (rather than a Residual Predicate), and that it'll be cheaper to do a Sort.

So...

(What follows is a hunch... hoping that it might improve things, but not completely sure)

Let's try to get it to do the Merge Join on the main two columns, and handle the dates in the Residual Predicate. Your Merge Join is already "Many-to-Many=True", so this might not be too bad. (But yeah - it could be awful if the QO is right.)

One way to persuade a predicate to drop into a residual is to consider that a Merge Join doesn't like inequalities. So try changing AND ss.semiStaticDate = sd.semiStaticDate into AND ss.semiStaticDate <= sd.semiStaticDate AND ss.semiStaticDate >= sd.semiStaticDate, because the QO might not realise they're the same.

You might also find that you need to drop semiStaticDate from the Clustered Index of #Keys, but hopefully you won't need to change that.

Give it a try at least, and let me know how it goes...

avatar image By Rob Farley 196 answered Jan 09 at 08:17 AM
more ▼
(comments are locked)
avatar image sam.bishop Jan 09 at 06:52 PM

Thank you, Rob, for looking at this for me!

The optimizer was fooled when I split up the inequality, and the predicate was made residual, but the duration for that statement almost tripled.

I'm not sure what I'm going to try next, but first I need to process Paul's response. Thanks again!

10|10000 characters needed characters left
Your answer
toggle preview:

Up to 50 attachments (including images) can be used with a maximum of 209.7 MB each and 209.7 MB total.

Follow this question

Topics:

x455
x12
x3

asked: Jan 06 at 11:05 PM

Seen: 987 times

Last Updated: Jan 11 at 07:38 PM