Optimizer gets the wrong number of estimated rows even with updated stats
/* various columns from table 1 and table 2 */
FROM dbo.tbl_1 AS a
INNER JOIN dbo.tbl_2 AS b ON a.AccountNumber = b.AccountNumber
WHERE a.RecordActive = 1 AND b.RecordActive = 1;
tbl_1 and tbl_2 are both partitioned views that equally contain 3 yearly base tables (2014-2016), partitioned by month and unioned in the view. The tables are large and 2016's data contains approximately 34,000,000 records for the first 3 months alone.
The query normally takes 10 minutes to run, when the right plan is chosen. However, recently it has been taken well over an hour.
When I check the estimated plan, it uses an index seek using a filtered index WHERE RecordActive = 1 for the table containing 2016's data. The estimated rows is 18,000, but when the query executes the actual number of rows is closer to 2,000,000 – a clustered index scan probably being a better choice.
This is the case despite updating the statistics for the filtered index with a full scan. As the predicate on the tables for column RecordActive is a BIT datatype yielding either 1 or 0, the histogram for the filtered statistic only contains one entry for the value 1, which shows the number of rows equaling that value is 2,000,0000.
I have turned on OPTION(RECOMPILE, QUERYTRACEON 3604, QUERYTRACEON 9204) and I can see that the filtered stats have been considered. So why then does it get the estimate so badly wrong?
Please help, as I am running out of ideas and would like to know what is really happening under the hood!
Notice this is not the filtered index referred to in the question. This index has the same leading key as the desired filtered index.
Given a choice, the optimizer tends to choose a non-filtered index over a filtered one, where there is overlap in the required keys. There is no particular reason for this that I am aware of. As far as I know, it is simply a limitation of the index-matching logic. There is no workaround, aside from ensuring that filtered and non-filtered indexes do not share the same keys.
That said, this may not even be the "problem". For all I know, the statistics on this index also indicate that 2 million or so rows would match IsCurrentPosition = 1 (not RecordActive as stated in the question). The underlying issue may be that the index is also partitioned, and cardinality estimation makes an unsafe 'educated' guess as to the effect of this partitioning. The plan indicates that partitions 1-13 were accessed for this index, which sounds like it might be all partitions. Without a definition for the partition function, it's hard to know.
Tracing the details of a particular cardinality estimate back to the underlying statistics can be challenging in simple cases. Where there is so much complexity, and so many potentially valid ways (in theory) to make an estimate, it becomes essentially impossible. It is even more difficult without all the underlying tables, partitioning functions and schemes, and statistics objects.
It's quite hard to imagine a physical database arrangement that would be less friendly for cardinality estimation: huge partitioned tables embedded in a partitioned view, with very many partitioned, and occasionally filtered, indexes with overlapping keys. Judging from the mismatch between the query objects and the local partitioned view names, there may also be another view layer or synonyms involved. The CE is doing reasonably well here to get within a factor of 100.
The segment 1 & 2016 case does use a filtered index, but it is on the inner side of a nested loops join with AccountNumber and IsCurrentPosition correlated parameters. It estimates one row per iteration, but the driving side of the loops join is already wrong courtesy of the segment 1 view: