This is a user-facing query that runs thousands of times a day. I have managed to improve it in terms of performance, but, frankly, it's ugly now, and I am trying to put some lipstick on this pig. Any help would be very much appreciated!
First, a little background. All the query needs to do, basically, is join two tables containing descriptive data for financial securities. One, called SecurityDaily, holds columns that often change daily, and the other, called SecuritySemiStatic, holds columns that change less often. (About once a month, on average.) This is done mostly to save space, since there are more than two billion rows in the daily table and there are 500+ columns between the two tables. The query I am trying to optimize joins the two tables for all of the dates within a date range, for a given set of accounts. (Securities are identified by an account-ID/security-ID pair. The tables are joined based on the security and a semi-static date. Dates are represented as integers.) I have created "new" tables that only have a subset of the rows while I experiment with the table structure.
What I have done so far is create both tables so that their clustered indexes start with the same three columns: an account ID, security ID, and semi-static date. The clustered index of the daily table then has a fourth column: the "report" date.
In the attached session, you'll see that I've written a query (the "Single-stage query") that takes advantage of the clustered indexes to do a merge join of the two tables. Both sides of the join contain a duplicate sub-tree though, so I tried putting the results of that into a temp table first, in the "Two-stage query". Now SQL Server wants to sort both inputs of the merge sort, even though it shouldn't need to due to the clustered index on the temp table.
By sam.bishop 54 asked Jan 06, 2017 at 11:05 PM
The basic issue is that the final merge requires inputs sorted by accountId, securityId, and semiStaticDate, which the CTEs cannot deliver.
Consider a correlated nested loop join from
In addition, if the inner side of the join delivers locally sorted order
Now accountId and securityId are known to be equal in the join, so
This is achievable by projecting an extra column (
Note: the order of the final merge join clauses is significant.
This produces the plan:
Please be aware that join hints -
The reason the original query did not suffer from this problem is that the optimizer was able to obtain and infer more information about uniqueness and relationships than is expressed by the single unique index on the replacement temporary table.
All that said, I would probably abandon the idea of a merge join here. Merge join can be very efficient when one-to-many, serial, and ordered inputs do not require a sort. A many-to-many merge is much less efficient, since it uses a worktable to store, rewind, and reply duplicate join keys.
I would look to write a simpler query that does not require a join hint, probably based on cross apply. The basic idea would be to join accounts and securities to the static rows, then expand that for the extra detail rows in the daily table. The question isn't clear enough to me to attempt writing that query, and this site is more about execution plans than query writing anyway. You could ask for query writing help on Database Administrators Stack Exchange.
By SQLkiwi ♦ 6.6k answered Jan 09, 2017 at 02:53 PM
Look at the "Where (Join Columns)" property of the Merge Join operator, and also on the "Order By" properties of the two Sort operators. You'll see that it's using the accountid and securityid columns of #Keys, but following this by the semiStaticDate column. And that's not in order, based on the inequality joins you have within your CTE statements.
Without a Sort operator, you can't really get data in an order based on columns from two tables. And the QO must think that the third predicate (semiStaticDate) is worth having in the Merge Join's Where property (rather than a Residual Predicate), and that it'll be cheaper to do a Sort.
(What follows is a hunch... hoping that it might improve things, but not completely sure)
Let's try to get it to do the Merge Join on the main two columns, and handle the dates in the Residual Predicate. Your Merge Join is already "Many-to-Many=True", so this might not be too bad. (But yeah - it could be awful if the QO is right.)
One way to persuade a predicate to drop into a residual is to consider that a Merge Join doesn't like inequalities. So try changing
You might also find that you need to drop semiStaticDate from the Clustered Index of #Keys, but hopefully you won't need to change that.
Give it a try at least, and let me know how it goes...
By Rob Farley 196 answered Jan 09, 2017 at 08:17 AM