Is there a better way to search for matches

tbrickle 2016-10-03 18:48:02

SQLkiwi 2016-10-03 20:28:50
This is one of those occasions when it is useful to know details about the data distribution, and whether the temporary tables might contain data for multiple batches or not. It is always useful to see a post-execution plan rather than estimates as well. All that said, the simplest thing to try is some indexing, the obvious ones being:

CREATE CLUSTERED INDEX C ON #Grower_Matching_Source (Batch_ID);
CREATE CLUSTERED INDEX C ON #Grower_Matching_Target (Batch_ID);
CREATE CLUSTERED INDEX C ON #tmpPossibleMatches (Source_Key, Target_Key);

If that is not effective (enough) you will need to rewrite the query so the optimizer has more choices than nested loops join. At the moment, this is the only option because the only simple join equality predicate is on Batch_ID. Since there is also a predicate constraining this to a single value ( @BatchID ), there is nothing suitable left as a simple equality for a hash or merge join.

The most suitable condition for engineering a hash/merge join is the test on City, State OR Zip matching. We can manually rewrite this as two separate queries with a UNION to respect the original semantics of the OR. Generally, A JOIN B ON (X OR Y) translates to (A JOIN B ON X) UNION (A JOIN B ON Y):

SELECT
    GMS.Source_Key,
    GMT.Target_Key,
    Score = 93,
    1,
    1
FROM #Grower_Matching_Source AS GMS
JOIN #Grower_Matching_Target AS GMT
    ON (GMT.City = GMS.City AND GMT.[State] = GMS.[State])
WHERE
    (
        @DeDupe = 0
        OR GMS.Source_Key < GMT.Target_Key
    )
    AND NOT EXISTS
    (
        SELECT * 
        FROM #tmpPossibleMatches AS TPM
        WHERE TPM.Source_Key = GMS.Source_Key
        AND TPM.Target_Key = GMT.Target_Key
    )
    AND GMS.Batch_ID = @BatchID
    AND GMT.Batch_ID = @BatchID
    -- Address match
    AND 
    (
        (
            GMS.StdAddress1 NOT IN ('', 'UNKNOWN')
            AND GMS.StdAddress1 IN (GMT.StdAddress1, GMT.StdAddress2)
        )
        OR
        (
            GMS.StdAddress2 NOT IN ('', 'UNKNOWN')
            AND GMS.StdAddress2 IN (GMT.StdAddress1, GMT.StdAddress2)
        )
    )
    -- Name match
    AND
    (
        (
            GMS.LastName <> ''
            AND GMS.LastName IN (GMT.LastName, GMT.PotentialLastName1, GMT.PotentialLastName2)
        )
        OR
        (
            GMS.PotentialLastName1 <> ''
            AND GMS.PotentialLastName1 IN (GMT.LastName, GMT.PotentialLastName1, GMT.PotentialLastName2)
        )
        OR
        (
            GMS.PotentialLastName2 <> ''
            AND GMS.PotentialLastName2 IN (GMT.LastName, GMT.PotentialLastName1, GMT.PotentialLastName2)
        )
    )
UNION
SELECT
    GMS.Source_Key,
    GMT.Target_Key,
    Score = 93,
    1,
    1
FROM #Grower_Matching_Source AS GMS
JOIN #Grower_Matching_Target AS GMT
    ON GMT.Zip = GMS.Zip
WHERE
    (
        @DeDupe = 0
        OR GMS.Source_Key < GMT.Target_Key
    )
    AND NOT EXISTS
    (
        SELECT * 
        FROM #tmpPossibleMatches AS TPM
        WHERE TPM.Source_Key = GMS.Source_Key
        AND TPM.Target_Key = GMT.Target_Key
    )
    AND GMS.Batch_ID = @BatchID
    AND GMT.Batch_ID = @BatchID
    -- Address match
    AND 
    (
        (
            GMS.StdAddress1 NOT IN ('', 'UNKNOWN')
            AND GMS.StdAddress1 IN (GMT.StdAddress1, GMT.StdAddress2)
        )
        OR
        (
            GMS.StdAddress2 NOT IN ('', 'UNKNOWN')
            AND GMS.StdAddress2 IN (GMT.StdAddress1, GMT.StdAddress2)
        )
    )
    -- Name match
    AND
    (
        (
            GMS.LastName <> ''
            AND GMS.LastName IN (GMT.LastName, GMT.PotentialLastName1, GMT.PotentialLastName2)
        )
        OR
        (
            GMS.PotentialLastName1 <> ''
            AND GMS.PotentialLastName1 IN (GMT.LastName, GMT.PotentialLastName1, GMT.PotentialLastName2)
        )
        OR
        (
            GMS.PotentialLastName2 <> ''
            AND GMS.PotentialLastName2 IN (GMT.LastName, GMT.PotentialLastName1, GMT.PotentialLastName2)
        )
    )
OPTION (RECOMPILE, HASH UNION, HASH GROUP, HASH JOIN);

The RECOMPILE and HASH hints are all optional; test with and without each of these. If the NOT EXISTS clause on #tmpPossibleMatches is not very selective, you could lift that out of the two queries as a final test.

There are a number of combinations to try here, but hopefully that gives you some ideas.

The execution plan for the above query (without indexes) should be something like:

Expected plan 1

With indexes, but no hash join hint:

Expected plan 2