Why does the second query produce a lower estimated row count

Dan Holmes 2013-09-11 13:21:57

This started because i thought i found an optimizer trick when there are matching column constraints on joined tables on that column. In this case, the following two tables are identical except for the constraint on dayofweek.

CREATE TABLE sr (frsubrouteid INT NOT NULL, calendardate DATETIME NOT NULL
    , dayofweek TINYINT NOT NULL 
    , PRIMARY KEY (frsubrouteid, calendardate)

CREATE TABLE sr1 (frsubrouteid INT NOT NULL, calendardate DATETIME NOT NULL
    , dayofweek TINYINT NOT NULL CHECK (dayofweek BETWEEN 1 AND 7)
    , PRIMARY KEY (frsubrouteid, calendardate)

When joined to (abbreviated)

CREATE TABLE dbo.t_FRTripsProperties(
    FRTripsPropertiesID  int IDENTITY(1,1) NOT NULL
        CONSTRAINT UQ_FRTripsProperties UNIQUE CLUSTERED(FRTripsPropertiesID),
        CONSTRAINT FK_FRTripsProperties_FRTrips FOREIGN KEY(FRTripsID)
        REFERENCES dbo.t_FRTrips (FRTripsID)
    EffectiveEndDate DATE NOT NULL
            CONSTRAINT CK_TripsProperty_DayOfWeek_is_valid CHECK (DayOfWeek BETWEEN 1 AND 7)
    , CONSTRAINT PK_FRTripsProperties PRIMARY KEY NONCLUSTERED ( FRTripsID, EffectiveStartDate, dayofweek) 

on DayOfweek. there is a residual predicate on the index scan (exposed by the queryhint) for the first statement. That filter isn't required on the second statement because the constraints on both columns match.

What i don't get is why the estimated row count is lower for the second query. I would have thought the filter would have guessed at some rows being outside the range for statment 1 and reduced its estimate. It didn't. The row counts going into the final HASH join are the same yet the row counts come out different. The second statement has a lower count which is counter-intunitive to me.

The attached 'constraint_test.zip' is a backup of the database. It is very small and contains only the tables necessary to reproduce this plan. It is a 2008r2 database.
link text

SQLkiwi 2013-09-12 00:24:18

The optimizer can introduce implied predicates across some types of join. When joining T1 and T2 on a common column 'x', if there is a predicate T1.x = 100, the optimizer can infer the predicate T2.x = 100.

The same idea applies to check constraints. If T1 has a constraint (1 <= x <= 7), the same predicate is implied for T2. If a suitable constraint also exists on T2, the implied predicate is redundant and simplified away.

Cardinality estimation.

Both estimates are valid, they were just arrived at in different ways.

The optimizer explores many strategies, as you know. There is no general guarantee that cardinality estimates derived on physically different, but logically equivalent, subtrees will be the same. It is a statistical process, after all, and all estimates are guesses to some extent.

It so happens that the presence of the implied constraint predicate is enough to encourage the optimizer through different exploration paths (including star join optimizations) deriving different cardinality estimations along the way.

If the optimizer's internal exploration choices are limited with a FORCE ORDER hint, both queries produce the same cardinality estimation (1379620), regardless of the extra predicate.

In summary, one possible route through the optimizer produces a final result cardinality estimate of 1379620, and the other comes up with 1400850. Intermediate cardinality estimates are worked out independently.

The actual number of rows returned by the query is 2102656, so neither estimate is exactly right (though not bad either, when you consider that all the optimizer has to work with are single-column histograms and density information on the base tables).