Comparing tables and creating column level difference, can it be better.

GokhanVarol 2013-05-23 18:24:23

Our company deals with data, for a set of data we might get a new edition and what they would like to know is what changed between editions at column level.
Basically a table like (Primary_Key_Columns, ColumnName, OldValue, NewValue) needs to be generated. This tables can get large, a somewhat large table has about 3 million rows and 500+ columns. If wrote a sql statement to do the comparison, it's somewhat like the sliplet below. It's using a values clause to create multiple rows from the full outer join output.
Of course this has quite a bit overhead. In the attached query plans the table has about 470 columns. If I were to only use the merge the CPU is about 28 seconds and since it can parallelise the query duration drops to 8.856 seconds. But with cross apply doing the multiplication the query cannot run in parallel and it also takes about 93 seconds.
Is there a better way of writing this query?
Thank you

SELECT ISNULL(l.[PclId], r.[PclId]) AS [PclId], ISNULL(l.[PclSeqNbr], r.[PclSeqNbr]) AS [PclSeqNbr], -- Primary keys
    d.ColumnId, d.LeftValue, d.RightValue
FROM TEMPORARY.dbo.DiabloSample l FULL OUTER JOIN [Diablo].[tTax].Property r
    ON l.[CntyCd] = r.[CntyCd] AND l.[PclId] = r.[PclId] AND l.[PclSeqNbr] = r.[PclSeqNbr] -- Join on primary keys
 
OUTER APPLY (SELECT ColumnId, LeftValue, RightValue FROM (VALUES(CASE WHEN l.CntyCd IS NULL THEN -1/*<LMISS>*/ WHEN r.CntyCd IS NULL THEN -2/*<RMISS>*/ END, NULL, NULL),
(CASE WHEN r.CntyCd IS NOT NULL AND l.CntyCd IS NOT NULL AND ((l.[AbsenteeOwnrInd] IS NULL AND r.[AbsenteeOwnrInd] IS NOT NULL) OR (l.[AbsenteeOwnrInd] IS NOT NULL AND r.[AbsenteeOwnrInd] IS NULL) OR l.[AbsenteeOwnrInd] <> r.[AbsenteeOwnrInd]) THEN 1 END,l.[AbsenteeOwnrInd] ,r.[AbsenteeOwnrInd] ),
..................
(CASE WHEN r.CntyCd IS NOT NULL AND l.CntyCd IS NOT NULL AND ((l.[ZnDesc] IS NULL AND r.[ZnDesc] IS NOT NULL) OR (l.[ZnDesc] IS NOT NULL AND r.[ZnDesc] IS NULL) OR l.[ZnDesc] <> r.[ZnDesc]) THEN 868 END,l.[ZnDesc] ,r.[ZnDesc] ))
d(ColumnId, LeftValue, RightValue)
WHERE d.ColumnId IS NOT NULL
and 1 = 0)d
WHERE (l.CntyCd IS NULL OR R.CntyCd IS NULL OR (l.[AbsenteeOwnrInd] IS NULL AND r.[AbsenteeOwnrInd] IS NOT NULL) OR (l.[AbsenteeOwnrInd] IS NOT NULL AND r.[AbsenteeOwnrInd] IS NULL) OR l.[AbsenteeOwnrInd] <> r.[AbsenteeOwnrInd]
OR
(l.[AbsenteeOwnrIrisCd] IS NULL AND r.[AbsenteeOwnrIrisCd] IS NOT NULL) OR (l.[AbsenteeOwnrIrisCd] IS NOT NULL AND r.[AbsenteeOwnrIrisCd] IS NULL) OR l.[AbsenteeOwnrIrisCd] <> r.[AbsenteeOwnrIrisCd]
OR
(l.[AcctNbr] IS NULL AND r.[AcctNbr] IS NOT NULL) OR (l.[AcctNbr] IS NOT NULL AND r.[AcctNbr] IS NULL) OR l.[AcctNbr] <> r.[AcctNbr]
OR
..................
OR
(l.[ZnDesc] IS NULL AND r.[ZnDesc] IS NOT NULL) OR (l.[ZnDesc] IS NOT NULL AND r.[ZnDesc] IS NULL) OR l.[ZnDesc] <> r.[ZnDesc])
OPTION(RECOMPILE)

alt textlink text

GokhanVarol 2013-05-23 19:37:11
I kind of found a workaround I believe. I declared a variable
DECLARE @This_Is_to_Fake_the_optimizer_to_get_a_parallel_query_plan INT = 0

first I changed the cross apply to outer apply and added a where clause
WHERE @This_Is_to_Fake_the_optimizer_to_get_a_parallel_query_plan INT = 0

and I added query hint
OPTION(RECOMPILE, OPTIMIZE FOR (@This_Is_to_Fake_the_optimizer_to_get_a_parallel_query_plan = 1))

and now I get a query plan which scales kind of nice.
I am wondering though can this be done without any cross/outer apply more efficiently?

alt text

SQLkiwi 2013-05-24 04:23:10
The parallel merge join and nested loops to a constant scan is generally speaking the right plan shape, but the T-SQL is very long and the CASE statements aren't necessarily the most efficient.

It seems a little unusual to want to generate per-column differences where complete rows are missing on either side? You might want to look into finding differences between existing rows using one mechanism, and then add in completely missing rows using different queries optimized for the task. As always, the best method depends on details which only you really know.

Intuitively, the algorithm to find per-column differences is very easy: merge scan through both tables, emitting one row for each column difference. Unfortunately, some aspects of query processing make this a little harder to do well in T-SQL than it ought to be.

If this were my problem to solve, I might look to write a streaming CLR function that would use two connections to merge-scan the two tables and yield return a row for each column difference. That could be a very efficient solution if written well, though not trivial to execute on multiple threads.

As far as reducing the size and complexity of the T-SQL is concerned, consider the two tables below, for which we could like to find per-column differences:

CREATE TABLE dbo.T1
(
    pk      integer PRIMARY KEY,
    col1    integer NULL,
    col2    integer NULL,
    col3    integer NULL
);
 
INSERT dbo.T1
    (pk, col1, col2, col3)
VALUES
    (1, NULL, 1, 1),
    (2, 2, NULL, 2),
    (3, 3, 3, 3),
    (4, 4, 4, 4);
 
CREATE TABLE dbo.T2
(
    pk      integer PRIMARY KEY,
    col1    integer NULL,
    col2    integer NULL,
    col3    integer NULL
);
 
INSERT dbo.T2
    (pk, col1, col2, col3)
VALUES
    (1, 1, 1, 1),
    (2, 2, NULL, 2),
    (4, 4, 0, 4),
    (5, 5, 5, 5);
 
+-------------------------+
¦ pk ¦ col1 ¦ col2 ¦ col3 ¦
¦----+------+------+------¦
¦  1 ¦ NULL ¦ 1    ¦    1 ¦
¦  2 ¦ 2    ¦ NULL ¦    2 ¦
¦  3 ¦ 3    ¦ 3    ¦    3 ¦
¦  4 ¦ 4    ¦ 4    ¦    4 ¦
+-------------------------+
+-------------------------+
¦ pk ¦ col1 ¦ col2 ¦ col3 ¦
¦----+------+------+------¦
¦  1 ¦    1 ¦ 1    ¦    1 ¦
¦  2 ¦    2 ¦ NULL ¦    2 ¦
¦  4 ¦    4 ¦ 0    ¦    4 ¦
¦  5 ¦    5 ¦ 5    ¦    5 ¦
+-------------------------+

One possible approach is to unpivot before performing a full join:

WITH 
    UT1 AS
    (
        -- Unpivot T1
        SELECT
            u.pk,
            u.cname,
            u.cvalue
        FROM dbo.T1 AS t
        UNPIVOT (cvalue FOR cname IN (col1, col2, col3)) u
    ),
    UT2 AS
    (
        -- Unpivot T2
        SELECT
            u.pk, 
            u.cname,
            u.cvalue
        FROM dbo.T2 AS t
        UNPIVOT (cvalue FOR cname IN (col1, col2, col3)) u
    )
SELECT
    pk = ISNULL(UT1.pk, UT2.pk),
    column_name = ISNULL(UT1.cname, UT2.cname),
    old_value = UT1.cvalue,
    new_value = UT2.cvalue
FROM UT1
FULL JOIN UT2
    ON UT2.pk = UT1.pk
    AND UT2.cname = UT1.cname
WHERE 
    NOT EXISTS (SELECT UT1.* INTERSECT SELECT UT2.*);

This code is much more compact, and easily extensible by adding column names to the UNPIVOT list. The output for the test data shown previously is:

+------------------------------------------+
¦ pk ¦ column_name ¦ old_value ¦ new_value ¦
¦----+-------------+-----------+-----------¦
¦  1 ¦ col1        ¦ NULL      ¦ 1         ¦
¦  4 ¦ col2        ¦ 4         ¦ 0         ¦
¦  5 ¦ col1        ¦ NULL      ¦ 5         ¦
¦  5 ¦ col2        ¦ NULL      ¦ 5         ¦
¦  5 ¦ col3        ¦ NULL      ¦ 5         ¦
¦  3 ¦ col2        ¦ 3         ¦ NULL      ¦
¦  3 ¦ col1        ¦ 3         ¦ NULL      ¦
¦  3 ¦ col3        ¦ 3         ¦ NULL      ¦
+------------------------------------------+

The unfortunate aspect of query processing I mentioned earlier means that a sort is required before the merge join. Depending on your hardware, you might find that OPTION (HASH JOIN, LOOP JOIN) produces a more efficient execution plan.

Neither are perfect, however, which is a shame given that the T-SQL is very much nicer. The pre-merge sort ought to be unnecessary, and a hash full join will require quite a large memory grant. As I said before, if your hardware can provide the memory needed to sort of hash without using disk, it might perform well enough.

Another T-SQL option that uses parallelism naturally:

SELECT
    new_pk = ISNULL(RowDiff.old_pk, RowDiff.new_pk),
    Changed.*
FROM 
(
    SELECT
        old_pk = t1.pk,
        old_col1 = t1.col1,
        old_col2 = t1.col2,
        old_col3 = t1.col3,
        new_pk = t2.pk,
        new_col1 = t2.col1,
        new_col2 = t2.col2,
        new_col3 = t2.col3
    FROM dbo.T1 AS t1
    FULL MERGE JOIN dbo.T2 AS t2
        ON t2.pk = t1.pk
    WHERE
        NOT EXISTS (SELECT t1.* INTERSECT SELECT t2.*)
) AS RowDiff
CROSS APPLY
(
    SELECT column_id = 2, RowDiff.old_col1, RowDiff.new_col1
        WHERE NOT EXISTS (SELECT RowDiff.old_col1 INTERSECT SELECT RowDiff.new_col1)
    UNION ALL
    SELECT column_id = 3, RowDiff.old_col2, RowDiff.new_col2
        WHERE NOT EXISTS (SELECT RowDiff.old_col2 INTERSECT SELECT RowDiff.new_col2)
    UNION ALL
    SELECT column_id = 4, RowDiff.old_col3, RowDiff.new_col3
        WHERE NOT EXISTS (SELECT RowDiff.old_col3 INTERSECT SELECT RowDiff.new_col3)
) AS Changed;
GokhanVarol 2013-05-24 12:08:45
The Filter applied to the constant scan prior to getting into nested loops helps the query performance a lot. Since tables have many columns the unpivot creates close to 500 rows. Seems like creating many rows from unpivot is not that fast with out 4 socket Intel E5 hardware. One thing I like of this approach is there is no hash or sort in the plan. Since we have about 40 tables that belong to a batch parallelized from ssis (all have to be compared for a county our batch initiater, which all have different table structures and different primary keys) with 8 max threads kind of package but also 5-10 packages can run simultaneously having a large memory grant will not allow us parallelize much I believe.