Suggestions on a parallel plan on partitioned tables

GokhanVarol 2013-06-18 00:59:30

We have this pattern of joins. 2 Large schemas, Transaction related and Property Related, Transaction related is clustered indexed on CntyCd char(5), BatchDt int, BatchSeq Int (Only the root Trans table has additional covered index CntyCd char(5), PclId varchar(45), PclSeqNbr tinyint), the children in transaction related have few more keys added to this list. And there is Tax related tables, their clustered index is CntyCd char(5) , PclId varchar(45), PclSeqNbr tinyint.
All tables above are partitioned on CntyCd and they all have somewhere about 3336 partitions. These are wide tables, Root Property table (Tax schema) has over 400 columns (160mil rows) and root Trans Table has over 300 columns (600 mil rows).
Whenever users simply join few trans tables and few Property tables it ends up a desired 1-2TB+ memory and they most get 40GB -190GB of memory depends if they are resource governed (Server has 64 cores 1TB memory). On top query duration is not optimal (attached "Long_Single_16770_Sec_HashNoSpill_NoDriver.queryanalysis" is a sample query with 16770 seconds duration with desired 302GB and granted 194GB on dop 8).
I am trying to come up with a way to run this queries more efficiently, it seems like if the query started with a driver table of cntycd it works better (the screenshot shows a sample of that), but in that case there is a hash spill level 2 I could not avoid. I tried to create empty tables, fake their statistics union all the empty tables ("MediaCountsCleanedUpRun2_4457_Sec_Hash_NoSpill_FakeStartsTemp.queryanalysis" is a sample to that) and accomplished avoiding the hash spill but the query took a lot longer. From what I tried so far "MediaCounts_Usual_Last_2730_Sec.pesession" seems the best but unfortunately I could not produce that plan without hash spill.
Also I confirmed all joins are using the partitioning function. The data across 3336 partitions is not balanced, there are few counties with no records, in average in Property table there are 95K records and the largest county has 25mil records, underneath nested loop the estimates seem to equal to the average (95K) which means the large counties will spill for sure.
Any suggestions?
Thank you

link text

alt text

SQLkiwi 2013-06-18 07:56:49
I wrote a blog post just a couple of days ago that might be useful to you in this scenario:

Improving Partitioned Table Join Performance

GokhanVarol 2013-06-18 12:31:16
It seems like optimizer already adds the partition function into the plan in this query. I added (tested quite a few) the partition function calculated value into the driver table and joined on top of that (commented in this run) and that dropped the estimates even further down and spilled even more and duration have been higher.
I am thinking of creating a temp table before hash (or merge with sort) which will have around 162mil record (but only 233byte wide) clustered index it in the order of the property table and do one more already sorted merge join (to Property) and see added all that will improve (In this particular case there is not much can be filtered prior to join in Property table, but in other cases there are 3-4 tables from Property side (with same clustered key and including children of property table), which will make the temporary solution disfavor since filtering down from the reads without full scan of the table is possible with them).
Thanks

alt text

SQLkiwi 2013-06-18 13:08:41
The point made in my post is that best efficiency comes from using a parallel per-partition plan that uses one or more levels of collocated merge joins, without sorts or hashes. You come closest to this with the merge join plan, but you are still using a sort which spills heavily per thread and partition. The query is too big for me to spend time rewriting it for you, but the idea is pretty clear, I think.
GokhanVarol 2013-06-18 15:17:09
Modifying the driver with union all's pumped up the estimation a bit but not much. I added a union all with the largest county 06037 which had 23mil+ rows that about doubled (191k) the estimation and then I used a values cause with 40 rows with the same large county and that only increased the estimation to 1.4mil. Therefore hash spills continued.

link text

SQLkiwi 2013-06-18 15:43:46
Ok, good luck with your tuning efforts. The parallel APPLY is a powerful technique, but you do need to avoid memory-consuming operators on the inner side of the join, or ensure each iterator has a similar number of rows to process.
GokhanVarol 2013-06-19 00:09:09
You are right, I think the best way is use for spilled sort and a merge after using partition functions also derived from the driver. One question about the sort of the driver table, I have the row counts for tables under nested loops in the driver table. Should I sort by rows descending and pass from largest to the smallest row counts down or it does not matter. I heard that maxdop at a time is pulled from the driver table and each thread finishes one record from driver table and after all done another maxdop is pulled, if that is the case I think I should order them by rows desc?
SQLkiwi 2013-06-19 08:42:37
I can't think of a reason it would matter.
GokhanVarol 2013-06-19 03:21:56
The only difference in sql between the two queries is WithPartition plan has additional $Partition function and WithoutPartition does not. It seems to me in this plan regardless of $Partition function usage it already eliminates to the partition (The seek predicate has reference to all extracted partition function).
I guess I could choose the plan WithoutPartition even though it's using a Hash instead of Merge?
Also I noticed the WithPartition (with merge and sort spill) was incrementing task_internal_alloc_gb but WithoutPartition (with hash spill)did not increment task_internal_alloc_gb, is that how it works?
Thanks

link text

SQLkiwi 2013-06-19 08:46:17
I think I would always prefer the explicit function. The overall effect may be similar in some cases, but it isn't exactly the same. More to the point, using the parallel apply pattern works best if partition sizes are broadly similar, or if the parallel inner side does not use memory-consuming operators like sort or hash. To really make things better, I think you're going to have to address one or more of those issues at some stage. I have no idea about the XE task you mentioned.
GokhanVarol 2013-06-19 13:49:40
I was looking into the dmv
SELECT user_objects_alloc_page_count
, user_objects_dealloc_page_count
, internal_objects_alloc_page_count
, internal_objects_dealloc_page_count
FROM sys.dm_db_task_space_usage
WHERE session_id = @SESSIONID

It seemed to me like its icrementing internal_objects_alloc_page_count for sort spill but not for hash spill.

SQLkiwi 2013-06-19 15:34:58
The documentation for that DMV suggests it should increase for both hash work files and sort runs. Wouldn't be the first time the documentation was inaccurate though. I haven't checked the behaviour myself.