Hash Join sort related

GokhanVarol 2013-07-05 03:20:28

Is it possible to produce sorted output with a serial hash join?

Example: A small table with 10 rows and a large table with a million rows. Small table is hashed to a few buckets. The large table is passed through it and prior to that the larger table was sorted. The output should still be sorted since data is fed sorted from large table passing through the hash buckets.

Why do we lose the sort order of the probe table on a serial hash?

Is there any way to force which input should be probe and which should be build for the hash join?

SQLkiwi 2013-07-06 01:06:37
Merged your two related questions.
SQLkiwi 2013-07-06 01:01:20
> Is it possible to produce sorted output with a serial hash join?

No (at least sorted output cannot be guaranteed).

Why do we lose the sort order of the probe table on a serial hash?

Because the hash might spill at execution time. The order of spilled partitions is not preserved, and the spill might invoke role reversal (swapping the build and probe inputs).

Is there any way to force which input should be probe and which should be build for the hash join?

The upper input is the build input and the lower input is the probe input, so it depends on the plan shape selected by the optimizer. The usual hints can force the shape to some extent, by spilling and role reversal might still occur.

In the context case, the hash join is not correlated so there is little benefit having it inside the APPLY (its hash table will be rebuilt on every iteration). Rewrite the query so the hash join is positioned outside the APPLY. The optimizer should naturally pick the temporary table as the build input, and a spill should be unlikely (and role reversal even less so given the relative sizes of the inputs).

GokhanVarol 2013-07-06 01:13:37
When I put the #Lender outside of apply it still chooses build on the large side and adds repartitioning (they are expansive aren't they?)

alt text

SQLkiwi 2013-07-06 02:15:54
Repartitioning that number of rows could be quite expensive yes. The other thing you could try is the right hash join inside the apply. The nested loop might be the best overall – it incurs a large number of logical I/Os, but no physical I/O.
GokhanVarol 2013-07-06 02:18:55
The original lookup table has extra columns (timestamp etc), that's why I created the most compact lookup table possible for seeks. I like the seek since in many cases there are 3-6 seeks (lookups) not only one, try to work them out with hash would be more difficult.
Thank you