Where does the Cardinality Estimate go wrong?

Loz 2016-09-13 19:23:06

Person may have a Case. Occasionally more than one.
Person may have a Licence. Occasionally more than one.

Driver1 and Driver2 contain the same data. The only difference is that Driver2 has a primary key. Each specify a subset of Cases.

Using a driver with a primary key appears to cause a bad estimate.

Table defs:

create table [dbo].[Person] (
    [PersonID] int not null,
    [LocationID] int not null,
    [Name] varchar(50),
    [DateOfBirth] date,
    constraint [PK_Person] primary key ([LocationID], [PersonID])
create table [dbo].[Case] (
    [CaseID] int not null,
    [LocationID] int not null,
    [PersonID] int,
    [CaseDate] datetime,
    constraint [PK_Case] primary key ([LocationID], [CaseID])
create table [dbo].[Licence] (
    [LicenceID] int not null,
    [LocationID] int not null,
    [PersonID] int,
    [LicenceNo] varchar(50),
    constraint [PK_Licence] primary key ([LocationID], [LicenceID])
create table [dbo].[Driver1] (
    [LocationID] int not null,
    [CaseID] int not null
create table [dbo].[Driver2] (
    [LocationID] int not null,
    [CaseID] int not null,
    constraint [PK_Driver2] primary key ([LocationID], [CaseID])

Code used to create test data:

set nocount on;
    @location_id int = 1,
    @case_id int,
    @person_id int,
    @licence_id int,
    @chance float;
begin transaction;
while @location_id < 100
        @case_id = 1,
        @person_id = 1,
        @licence_id = 1;
    while @person_id < 5000
        insert into dbo.Person (LocationID, PersonID, DateOfBirth, Name)
        values (@location_id, @person_id, dateadd(day, @person_id, '20000101'), 'Name ' + cast(@person_id as varchar));
        set @chance = 0.5;
        while rand() < @chance
            insert into dbo.[Case] (LocationID, CaseID, PersonID, CaseDate)
            values (@location_id, @case_id, @person_id, dateadd(day, @case_id, '20150101'));
            set @case_id += 1;  
            set @chance /= 10;
            if rand() < 0.02
                insert into dbo.Driver1 (LocationID, CaseID) values (@location_id, @case_id);
                insert into dbo.Driver2 (LocationID, CaseID) values (@location_id, @case_id);
        set @chance = 0.8;
        while rand() < @chance
            insert into dbo.[Licence] (LocationID, LicenceID, PersonID, LicenceNo)
            values (@location_id, @licence_id, @person_id, 'Licence ' + cast(@licence_id as varchar));
            set @licence_id += 1;
            set @chance /= 50;
        set @person_id += 1;
    set @location_id += 1;
commit transaction;
create index IX_LocationID_PersonID on dbo.Licence(LocationID, PersonID);
update statistics dbo.[Case] with fullscan;
update statistics dbo.[Licence] with fullscan;
update statistics dbo.[Person] with fullscan;
update statistics dbo.[Driver1] with fullscan;
update statistics dbo.[Driver2] with fullscan;
Loz 2016-09-13 20:52:50
I note, this is fixed in SQL 2014.
SQLkiwi 2016-09-14 04:00:18
If you're referring to the final estimate (263,586 rows) bear in mind this is the least important estimate in the plan; no cost-based choices depend on it. The plan-affecting estimates (e.g. choice of join/order strategy) are almost the same in both cases, the one difference being 4,658 vs 5,215 for the last join, which is pretty small.

In that sense, the difference is inconsequential. Both plans execute in reasonable time, with the choice between final hash or loops join not mattering very much.

In terms of indexing, adding LicenceNo as an included column in IX_LocationID_PersonID would encourage the optimizer to use it.

Foreign keys would help convey the information in your opening paragraph to the optimizer. Or at least they would if the foreign keys were single column. The logic involved does not operate for multi-column relationships. Nevertheless, they would expose the defect in the sample data script that can result in rows in Driver not matching Case.

As to why the Driver primary key exposes a difference: this is hard to be specific about since the details of n-ary join expansion are rather opaque in the original CE. Nevertheless, turning off join order exploration (and the n-ary join structure) with OPTION (FORCE ORDER) removes the 'issue' for the original CE. I don't recommend you use this, I mention it for interests' sake.

Loz 2016-09-14 06:56:34
Hi Paul, thanks for your detailed answer. Yes, it was the final row estimate I was interested in. This was an attempt at a minimum reproduction from a larger query. In the larger query, SQL ends up interposing another table with bad consequences. I think it is this estimate which is leading it down the wrong path. I suspect the answer will be to break the query down into smaller parts. I dream of foreign keys, but legacy system. Although if foreign keys had fixed this problem, that may have helped sell the necessary changes. The estimate problem was still there after I fixed the bug you pointed out.
SQLkiwi 2016-09-15 03:20:06
I see. Well yes CE errors do tend to creep in as an execution plan increases in complexity. I am a fan of breaking up large queries as you suggest, for many reasons.