Constructing multiple two-way contingency tables currently used multiple scans. Can some of those be combined using windowing functions?

rsjaffe 2017-07-18 19:04:46

I'm accumulating results for multiple contingency tables, one for each Test. As an example:

Contingency Table for TestA (one for each test, and for each word)
                EventsWithWord  EventsWithoutWord
TestA passed          n11                 n10         
TestA failed          n01                 n00

For analysis, I need the number of events matching each of the following categories (this is how R expects the data results):

  • TestA passed for each word: n11
  • TestA results available for each word: n_1 (sum of n11 and n01)
  • TestA passed: n1_ (sum of n11 and n10)
  • TestA results available: n__

Table "LeapFrogResults": columns: FacilityID nvarchar(20) NOT NULL, TestName nvarchar(255) NOT NULL, FacilityResults bit NULL

Table "Data" (one row for each event): Columns: EventID int NOT NULL, FacilityID nvarchar(20) NOT NULL

Table "WordEventMap": Columns: EventID int NOT NULL, Word nvarchar(255) NOT NULL

Is there a way of combining some of the queries (e.g., with windowing functions)? Some of these tables are very large–40 million rows currently in WordEventMap and 1.3 million rows in Data, and I expect the results table will have about 4 million rows (40 different tests, 100,000 different words).

Current query:

WITH CTE1
AS (SELECT
  COUNT(DISTINCT Data.Id) AS n__,
  LeapFrogResults.TestName
FROM LeapFrogResults
INNER JOIN Data
  ON LeapFrogResults.FacilityID = Data.FacilityPsoProviderId
WHERE LeapFrogResults.FacilityResults IS NOT NULL
GROUP BY LeapFrogResults.TestName),
CTE2
AS (SELECT
  COUNT(DISTINCT Data.[Id]) AS n11,
  LeapFrogResults.TestName,
  DataWordMap.Word
FROM LeapFrogResults
INNER JOIN Data
  ON LeapFrogResults.FacilityID = Data.FacilityPsoProviderId
INNER JOIN DataWordMap
  ON Data.[Id] = DataWordMap.EventID
WHERE LeapFrogResults.FacilityResults = 1
GROUP BY LeapFrogResults.TestName,
         DataWordMap.Word),
CTE3
AS (SELECT
  COUNT(DISTINCT Data.[Id]) AS n1_,
  LeapFrogResults.TestName
FROM LeapFrogResults
INNER JOIN Data
  ON LeapFrogResults.FacilityID = Data.FacilityPsoProviderId
WHERE LeapFrogResults.FacilityResults = 1
GROUP BY LeapFrogResults.TestName),
CTE4
AS (SELECT
  COUNT(DISTINCT Data.[Id]) AS n_1,
  LeapFrogResults.TestName,
  DataWordMap.Word
FROM LeapFrogResults
INNER JOIN Data
  ON LeapFrogResults.FacilityID = Data.FacilityPsoProviderId
INNER JOIN DataWordMap
  ON Data.[Id] = DataWordMap.EventID
WHERE LeapFrogResults.FacilityResults IS NOT NULL
GROUP BY LeapFrogResults.TestName,
         DataWordMap.Word)
SELECT
  CTE2.TestName,
  CTE2.Word,
  n11,
  n1_,
  n_1,
  n__
FROM CTE2
INNER JOIN CTE4
  ON CTE2.Word = CTE4.Word
  AND CTE2.TestName = CTE4.TestName
INNER JOIN CTE1
  ON CTE2.TestName = CTE1.TestName
INNER JOIN CTE3
  ON CTE2.TestName = CTE3.TestName