Changing the primary key from non-clustered to clustered on tables causing drop in Join performance
This query is joining tables on clustered index keys and taking 10 secs and doing a lot more 'worktable' logical reads than a exactly same query joining the tables with nonclustered primary keys on the same columns which took 5 seconds and a lot less 'worktable' logical reads. Joins are using primary-key/foreign-key columns. Please let me know what is going wrong here that is causing performance to drop.
The Non-clustered primary keys query plan is here
While fiddling around with the query using Clustered index, I changed the join order and made the query performance worse. So the JOIN order does matter and the optimizer can't be left to making the smart decision. The plan for changed JOIN order is here
One generic reason why a clustered index can be slower than an equivalent nonclustered index. Remember, a clustered index includes all columns of the table; a nonclustered only the columns that are indexed or included. So if you have an index over three integer columns in a table that also has twenty wide character columns, a clustered index uses far more bytes per row than that nonclustered index.
This is a double-edged sword. Indexes using just the columns in the nonclustered index will be faster becuase they do less I/O (those twenty character columns are never read). But as soon as more data is needed it canbackfire: now the index can still be found to find matching rows, but an additional lookup is added. Depending on how many rows end up qualifying and requiring the lookup, that can become expensive.
That being said: in your case I think there is more going on. I would typically not expect a change from nonclustered to clustered to make THAT much of a difference, and the additional worktable IO also suggests that there is more to this.
Looking at the execution plan, I see that the most likely source of the IO is the set of four Merge Join operators at the bottom right of the execution plan. The first one (in right to left data-flow order) joins the SAMPLES and ARCHIVE_SAMPLES tables, the second one adds HORIZONS, and after that LAS_RESULTS and OBSERVATIONS are added. The join criteria are rather similar in most cases: agency_code, proj_code, s_id, o_id, h_no, and samp_no. (But samp_no is not used in the second join, and both samp_no and h_no are absent in the last one).
Three of these four Merge Join operators are marked as "many to many". I blog about the problem of many to many merge joins here: https://sqlserverfast.com/blog/hugo/2017/12/many-many-reads-many-many-merge-join/.
Short summary: the optimizer does not trust EITHER of the inputs to be unique on the combination of all join columns. If at least one were unique, it could arrange the tables in proper order to do an efficient one-to-many join, which does not use a worktable at all. As soon as neither inputs has this unique guarantee, it has to resort to many-to-many which DOES use a worktable.
A many-to-many merge join is often caused by an index that should be declared as a UNIQUE index, but isn't. In other words, the user knows that the combination of columns identifies guaranteed unique rows, but doesn't tell SQL Server because they didn't include the UNIQUE keyword in the index creation. In your case, since the index implements the primary key, that cannot be the case – primary key indexes are always considered unique.
Another possible problem is a query error. A many-to-many join may mean that you didn't include all the columns you needed in the join, accidentally introducing duplicated results in the output. That is a bit more likely to be the case in your query, though it would not explain why you didn't have the same issue with nonclustered indexes.
And that brings me to the third and (in my opinion) ost likely explanation. Did you perhaps include extra columns in the index key when you created the primary key indexes? So for instance (and this is pure guesswork), perhaps you HAD a unique nonclustered index on just the columns agency_code and o_id in the SAMPLES table, but now you replaced this with a primary key on agency_code, proj_code, s_id, o_id, h_no, samp_no, and at least one more column. The former guarantees uniqueness on agency_code + o_id, and by extension on any set of columns that includes at least those two. The latter guarantees uniqueness on the full set of listed columns, but since your query does not include that one extra column the optimizer cannot rely on the remaining set of columns being unique.
If I were you, I'd look at the EXACT definition of the indexes you replaced. If a nonclustered index is declared as CREATE UNIQUE INDEX ON (, ) and you want to replace it with a clustered primary key, then you should use ALTER TABLE ADD CONSTRAINT PRIMARY KEY (, ) CLUSTERED. Note that I used the SAME combination of columns. (There may be reasons to change the order but that would really mess up your plan for this query so let's not go there!).
If the original nonclustered index had additional INCLUDE columns, ignore them. A clustered index by definition includes all columns. So if your original index was CREATE UNIQUE INDEX ON (, ) INCLUDE (), you should STILL use the same statement as before to create the primary key, and NOT add col3 to the list. (I assume that this is exactly what has happened in your case – but as mentioned above, without seeing the nonclustered index plan a lot of this is guesswork on VERY limited data).
Thanks for your reply and for the time you spent analyzing the issue.
Here is my point wise reply(I am quoting you alongside my comments down below):
If the change from some indexes from
nonclustered to clustered is the only
change, then this is unlikely. But if
you also added a few additional
indexes in the process, then it's
possible that the optimizer simply
didn't see the better plan in time.
No, I did not add another index and i can confirm the only change is going from non-clustered to clustered.
This means that the nonclustered index
aaaaHORIZONS_PK must have been on only
those five columns (or even less). On
the other hand, the new clustered
index HORIZONS_PK is NOT on only those
five columns but on at least one more
– that is the only logical reason why you get a many-to-many Merge Join in
the slower execution plan.
No, both clustered and non-clustered indexes are on the same columns
Well, there is one other option, which
is that you didn't create a PRIMARY
KEY constraint but instead used the
CREATE INDEX syntax with an index name
that makes people think it's an actual
primary key – and in that case you
perhaps simply forgot to include the
No, I specifically created a primary key and thus a unique clustered index created underneath to support it. Here is the snippet:
ALTER TABLE [dbo].[HORIZONS] ADD CONSTRAINT [HORIZONS_PK] PRIMARY KEY CLUSTERED
There does not appear to be a UNIQUE
index on the column combination
agency_code, proj_code, s_id, o_id,
h_no, samp_no, in either the SAMPLES
or the ARCHIVE_SAMPLES table
No, for the Samples table the column combination "agency_code, proj_code, s_id, o_id, h_no, samp_no" is unique
and for the Archive_samples table the column combination "agency_code, proj_code, s_id, o_id, h_no, samp_no, jar_no" is unique (note that jar_no is one extra column here)
ALTER TABLE [dbo].[SAMPLES] ADD CONSTRAINT [PK_SAMPLES] PRIMARY KEY CLUSTERED
ALTER TABLE [dbo].[ARCHIVE_SAMPLES] ADD CONSTRAINT [PK_ARCHIVE_SAMPLES] PRIMARY KEY CLUSTERED