High-Performance Method For Joining Large Tables?
First, a little background. All the query needs to do, basically, is join two tables containing descriptive data for financial securities. One, called SecurityDaily, holds columns that often change daily, and the other, called SecuritySemiStatic, holds columns that change less often. (About once a month, on average.) This is done mostly to save space, since there are more than two billion rows in the daily table and there are 500+ columns between the two tables. The query I am trying to optimize joins the two tables for all of the dates within a date range, for a given set of accounts. (Securities are identified by an account-ID/security-ID pair. The tables are joined based on the security and a semi-static date. Dates are represented as integers.) I have created "new" tables that only have a subset of the rows while I experiment with the table structure.
What I have done so far is create both tables so that their clustered indexes start with the same three columns: an account ID, security ID, and semi-static date. The clustered index of the daily table then has a fourth column: the "report" date.
In the attached session, you'll see that I've written a query (the "Single-stage query") that takes advantage of the clustered indexes to do a merge join of the two tables. Both sides of the join contain a duplicate sub-tree though, so I tried putting the results of that into a temp table first, in the "Two-stage query". Now SQL Server wants to sort both inputs of the merge sort, even though it shouldn't need to due to the clustered index on the temp table.
Without a Sort operator, you can't really get data in an order based on columns from two tables. And the QO must think that the third predicate (semiStaticDate) is worth having in the Merge Join's Where property (rather than a Residual Predicate), and that it'll be cheaper to do a Sort.
(What follows is a hunch… hoping that it might improve things, but not completely sure)
Let's try to get it to do the Merge Join on the main two columns, and handle the dates in the Residual Predicate. Your Merge Join is already "Many-to-Many=True", so this might not be too bad. (But yeah – it could be awful if the QO is right.)
One way to persuade a predicate to drop into a residual is to consider that a Merge Join doesn't like inequalities. So try changing AND ss.semiStaticDate = sd.semiStaticDate into AND ss.semiStaticDate <= sd.semiStaticDate AND ss.semiStaticDate >= sd.semiStaticDate, because the QO might not realise they're the same.
You might also find that you need to drop semiStaticDate from the Clustered Index of #Keys, but hopefully you won't need to change that.
Give it a try at least, and let me know how it goes…
The basic issue is that the final merge requires inputs sorted by accountId, securityId, and semiStaticDate, which the CTEs cannot deliver.
Consider a correlated nested loop join from #Keys (K) to SecurityDaily_new (SDN) on accountId, securityId, and semiStaticDate. Assuming the correlations are known to be unique, the join is capable of providing output sorted on:
- K.accountId, K.securityId
- K.accountId, K.securityId, K.semiStaticDate
In addition, if the inner side of the join delivers locally sorted order S, this order may be added as a suffix to any of the orders shown above. In your case, the inner side can deliver output sorted on SDN.semiStaticDate, so the additional available orderings are:
- K.accountId, SDN.semiStaticDate
- K.accountId, K.securityId, SDN.semiStaticDate
- K.accountId, K.securityId, K.semiStaticDate, SDN.semiStaticDate
Now accountId and securityId are known to be equal in the join, so K.accountId and K.securityId are interchangeable with SDN.accountId and SDN.securityId. Even so, none of the available orderings exactly match the required: SDN.accountId, SDN.securityId, SDN.semiStaticDate. The closest available is K.accountId, K.securityId, K.semiStaticDate, SDN.semiStaticDate.
This is achievable by projecting an extra column ( K.semiStaticDate ) from the CTEs and adding that to the final join condition. If I understand your requirement correctly, this will not change the result of your query, but it can deliver the merge join you desire. The amended query is below:
DECLARE @beginDate integer = 5474, @endDate integer = 6202; WITH D AS ( SELECT SD.*, SSD = K.semiStaticDate -- NEW FROM #Keys AS K JOIN dbo.SecurityDaily_new AS SD ON SD.accountId = K.accountId AND SD.securityId = K.securityId AND SD.semiStaticDate >= K.semiStaticDate AND SD.semiStaticDate <= @endDate WHERE SD.isHeld = 1 AND SD.reportDate BETWEEN @beginDate AND @endDate ), S AS ( SELECT ss.*, SSD = K.semiStaticDate -- NEW FROM #Keys AS K JOIN dbo.SecuritySemiStatic_new AS SS ON SS.accountId = K.accountId AND SS.securityId = K.securityId AND SS.semiStaticDate >= K.semiStaticDate AND SS.semiStaticDate <= @endDate ) SELECT SD.accountId, SD.securityId, SD.reportDate, SD.price, SS.maturityDate, SS.SecurityType, SS.currencyId, SS.isOpenRepo, SS.finalMaturityDate FROM D AS SD INNER MERGE JOIN S AS SS ON SS.accountId = SD.accountId AND SS.securityId = SD.securityId AND SS.SSD = SD.SSD -- NEW, must go exactly here AND SS.semiStaticDate = SD.semiStaticDate;
Note: the order of the final merge join clauses is significant.
This produces the plan:
Please be aware that join hints – INNER MERGE|HASH|LOOP JOIN – come with an implicit FORCE ORDER, meaning the whole plan join order will match the written order in the query. There are other restrictions placed on the optimizer with this implied hint as well, including the ability to introduce or move around optimizing aggregations. Join hints are rarely the optimal solution to a problem. They are much more restrictive to the optimizer than regular hints.
The reason the original query did not suffer from this problem is that the optimizer was able to obtain and infer more information about uniqueness and relationships than is expressed by the single unique index on the replacement temporary table.
All that said, I would probably abandon the idea of a merge join here. Merge join can be very efficient when one-to-many, serial, and ordered inputs do not require a sort. A many-to-many merge is much less efficient, since it uses a worktable to store, rewind, and reply duplicate join keys.
I would look to write a simpler query that does not require a join hint, probably based on cross apply. The basic idea would be to join accounts and securities to the static rows, then expand that for the extra detail rows in the daily table. The question isn't clear enough to me to attempt writing that query, and this site is more about execution plans than query writing anyway. You could ask for query writing help on Database Administrators Stack Exchange.