Remote Query Execution versus Local Query Execution
In the uploaded LOCAL_LINKED_SERVER_vs_NO_LINKED_SERVER_plans.pesession file you will find 2 separate actual execution plans captured from within Plan explorer.
The plans are from 2 different versions of a stored procedure called ksp_getAdmissionListForWardsJoiningPMI which differ as follows
1) The plan with History comments LOCAL LINKED SERVER version comes from the original version of the procedure which contains references to linked server tables e.g. [PMILS].[PMI].dbo.app_patient . However, the Data Source the Linked Server [PMILS] connects to is actually just another local database PMI in the same SQL Server instance. I'm told the historical reasoning behind the design decision for this is that the PMI database ever moved to a separate instance of its own for growth reasons then the code would be future-proofed! However, across all our customer base, I haven't ever seen any size or performance metrics presented that will drive such a change now or in the near future.
2) The plan with History comments NO LINKED SERVER VERSION comes from a modified version of the procedure which has had the linked server references replaced with just local database qualified references e.g. [PMILS].[PMI].dbo.app_patient has been replaced by [PMI].dbo.app_patient
Now Just by removing the Linked server references the elapsed execution times for procedures differ as follows
LINKED SERVER version 3.31 seconds
NO LINKED SERVER version 0.834 seconds
Both versions do around 370,000 logical reads
When I grabbed the wait stats for each version (outside Plan explorer in the 1st instance) the LINKED SERVER version is reporting 3.2 seconds of ASYNC_NETWORK_IO time whereas the NON LINKED SERVER version is reporting 0.9 seconds of CXPACKET waits.
I'm puzzled as to how the version of the actual plan capture from Plan Explorer only reports .0123 of NETWORK IO for the LINKED SERVER version?
So after all that Scene Setting my questions are as follows
1) Without finding anything online that explains in depth the under the hood workings of SQL Server's remote query execution engine I'm assuming that because SQL Server (obviously to my mind) assumes that the linked server is on another server the ASYNC_NETWORK_IO waits are unavoidable, because even though the ‘remote' query is actually executing in the local instance, however SQL server is executing the remote query is building packets that have to come back over the network e.g. TCP/ICP? Are there any ‘tricks' you guru's have under your hats to get round this, aside from removing the linked server references?
2) There are a couple of big estimated vs actual row count cardinality issues in the plan
a) The remote query actually bringing back only a single row whereas the estimate is for the 293,870 rows which is the row count of the [PMILS].[PMI].dbo.app_medic table. Clustered PK for this table is GPCode.
b) The Index Seek on [Evolve_Portal].[dbo].[app_admission].[idx_admission_patientward].(NonClustered) where the actual of 100,351 differs from the estimated of 7,913 by 1,168%. Tooltip says this “may indicate a problem with statistics for one or more tables/indexes in the query”.
The statistics for all indexes are up to date and are updated nightly using the Ola Hallengren scripts with @UpdateStatistics = 'ALL' & @StatisticsSample=100 as parameters.
Can you please give me some guidance on how to investigate the statistics to see if there is a problem and steps to try to fix them?
3) The NO LINKED SERVER VERSION of the plan has skewed workloads across the parallel threads. Can you advise what I can do to help balance out the workload?
Only one row is read from app_medic because it is under a merge join, where the second input has no rows. The merge joining process stops at that point – there is no point consuming more rows from the top input if the lower input has no rows to join with.
The seek on app_admission is on an expression value that involves an implicit conversion to nvarchar(100). Adjust the temporary table so that the column (patientWard and ward) data types match if possible. This may or may not improve the cardinality estimate, but is good practice in any case. It is also possible that the estimate is incorrect due to statistics caching on temporary tables in stored procedures.
Regarding question 3:
The parallel section ends up running on a single thread because only a single row is read from #tmp_ward. The optimizer's choice to use parallelism is very likely a knife-edge costing thing. I would concentrate first on tuning the serial (MAXDOP 1) version of the plan before worrying about parallelism at all. You might even find the MAXDOP 1 version is quicker due to the removal of parallel execution overheads.
My initial focus would be on removing the (often datetime) conversions that are currently preventing index seeks, or resulting in filtering operations being stuck higher up the plan than they perhaps need be. Aiming for SARGable predicates and for filtering to occur as early as possible are good general aims.