Where does the Cardinality Estimate go wrong?
Person may have a Licence. Occasionally more than one.
Driver1 and Driver2 contain the same data. The only difference is that Driver2 has a primary key. Each specify a subset of Cases.
Using a driver with a primary key appears to cause a bad estimate.
Table defs:
create table [dbo].[Person] ( [PersonID] int not null, [LocationID] int not null, [Name] varchar(50), [DateOfBirth] date, constraint [PK_Person] primary key ([LocationID], [PersonID]) ) create table [dbo].[Case] ( [CaseID] int not null, [LocationID] int not null, [PersonID] int, [CaseDate] datetime, constraint [PK_Case] primary key ([LocationID], [CaseID]) ); create table [dbo].[Licence] ( [LicenceID] int not null, [LocationID] int not null, [PersonID] int, [LicenceNo] varchar(50), constraint [PK_Licence] primary key ([LocationID], [LicenceID]) ); create table [dbo].[Driver1] ( [LocationID] int not null, [CaseID] int not null ); create table [dbo].[Driver2] ( [LocationID] int not null, [CaseID] int not null, constraint [PK_Driver2] primary key ([LocationID], [CaseID]) );
Code used to create test data:
set nocount on; declare @location_id int = 1, @case_id int, @person_id int, @licence_id int, @chance float; begin transaction; while @location_id < 100 begin select @case_id = 1, @person_id = 1, @licence_id = 1; while @person_id < 5000 begin insert into dbo.Person (LocationID, PersonID, DateOfBirth, Name) values (@location_id, @person_id, dateadd(day, @person_id, '20000101'), 'Name ' + cast(@person_id as varchar)); set @chance = 0.5; while rand() < @chance begin insert into dbo.[Case] (LocationID, CaseID, PersonID, CaseDate) values (@location_id, @case_id, @person_id, dateadd(day, @case_id, '20150101')); set @case_id += 1; set @chance /= 10; if rand() < 0.02 begin insert into dbo.Driver1 (LocationID, CaseID) values (@location_id, @case_id); insert into dbo.Driver2 (LocationID, CaseID) values (@location_id, @case_id); end; end; set @chance = 0.8; while rand() < @chance begin insert into dbo.[Licence] (LocationID, LicenceID, PersonID, LicenceNo) values (@location_id, @licence_id, @person_id, 'Licence ' + cast(@licence_id as varchar)); set @licence_id += 1; set @chance /= 50; end; set @person_id += 1; end; set @location_id += 1; end; commit transaction; create index IX_LocationID_PersonID on dbo.Licence(LocationID, PersonID); update statistics dbo.[Case] with fullscan; update statistics dbo.[Licence] with fullscan; update statistics dbo.[Person] with fullscan; update statistics dbo.[Driver1] with fullscan; update statistics dbo.[Driver2] with fullscan;
In that sense, the difference is inconsequential. Both plans execute in reasonable time, with the choice between final hash or loops join not mattering very much.
In terms of indexing, adding LicenceNo as an included column in IX_LocationID_PersonID would encourage the optimizer to use it.
Foreign keys would help convey the information in your opening paragraph to the optimizer. Or at least they would if the foreign keys were single column. The logic involved does not operate for multi-column relationships. Nevertheless, they would expose the defect in the sample data script that can result in rows in Driver not matching Case.
As to why the Driver primary key exposes a difference: this is hard to be specific about since the details of n-ary join expansion are rather opaque in the original CE. Nevertheless, turning off join order exploration (and the n-ary join structure) with OPTION (FORCE ORDER) removes the 'issue' for the original CE. I don't recommend you use this, I mention it for interests' sake.