SQL Server 2016 internal memory pressure leads to the plan cache clearing

gopara 2017-08-03 12:56:07

Two weeks ago we migrated into SQL Server 2016 SP1 CU3 (previously our app was using SQL Server 2014 SP2). The application hasn't been upgraded or changed nor the workload has changed.

Since the migration, every few minutes up to more or less one hour, the proc cache is being flushed (not entirely, but still the majority of plans go away). If I run:

SELECT count (*) FROM sys.dm_exec_cached_plans

…just after clearing happens, the number of plans drops to 1 up to 3 hundred and then gradually increase to more or less 2,000, then usually clearing happens again and so on. It worth to mention that Buffer Pool seems to stay intact and only caches are affected.

The server runs on VMWare, it has 128 GB of RAM (SQL Server max server memory is set to 102 GB, min. server memory is set to 72 GB). Based on the output from SentryOne, I can see that buffer pool consumes ~61 GB.

SQL Server memory usage

My SQL Server version is as follows:

Microsoft SQL Server 2016 (SP1-CU3)
(KB4019916) – 13.0.4435.0 (X64)

It's a Standard Edition.

Finally, I came across an article by Jonathan Kehayias and I decided to check ring buffers and boom!, it turns out that I have notifications from resource monitors saying: 'low physical memory'. Occurrences of this notification fit perfectly to the proc cache clearing. Now the case is how to interpret these results and how to find the responsible process. As you can find in the query result:

    record.value('(/Record/ResourceMonitor/Notification)[1]', 'varchar(max)') as [Type],
    record.value('(/Record/ResourceMonitor/IndicatorsProcess)[1]', 'int') as [IndicatorsProcess],
    record.value('(/Record/ResourceMonitor/IndicatorsSystem)[1]', 'int') as [IndicatorsSystem],
    record.value('(/Record/MemoryRecord/AvailablePhysicalMemory)[1]', 'bigint') AS [Avail Phys Mem, Kb],
    record.value('(/Record/MemoryRecord/AvailableVirtualAddressSpace)[1]', 'bigint') AS [Avail VAS, Kb] FROM (
        DATEADD (ss, (-1 * ((cpu_ticks / CONVERT (float, ( cpu_ticks / ms_ticks ))) - [timestamp])/1000), GETDATE()) AS EventTime,
        CONVERT (xml, record) AS record
    FROM sys.dm_os_ring_buffers
    CROSS JOIN sys.dm_os_sys_info
    WHERE ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR') AS tab ORDER BY EventTime DESC;

Ring buffer query results

we can observe 'low physical memory' flag although the value of Available Physical memory stays at the same level. Moreover, if I am not mistaken above results are indicating internal memory pressure (IndicatorsProcess = 2, which is weird to me as Sentry all the time shows that SQL Server doesn't fully utilize the allocated memory. This is memory usage captured by Sentry for a sample taken at 8 AM:

SQL Server memory usage

All lines are pretty flat. What is also weird to me is that group of events:


happens at the same time. So this pressure takes milliseconds or less (perhaps this is also the reason why Sentry doesn't capture anything as it collects data a way less frequently).

I tried to find the reason behind this internal pressure and I checked top 10 memory clerks (in terms of memory consumption) to see if there are any heavy consumers there:

Memory clerks query results

but to be fair I don't see anything suspicious there.

Another thing worth to say is I haven't tried LPIM yet, as it requires SQL Server to be restarted, but even if it's the solution I would really like to understand why this issue happens. Moreover, please correct me if I am wrong, but as buffer pool seems not to be affected by trimming I don't really think LPIM is a solution here.

Now I am completely lost and I don't really know what else should I check in order to find the root cause of the issue. Perhaps I can utilize Sentry One in a better way to investigate this issue. I would really appreciate if some can help me to solve this puzzle.

gopara 2017-08-10 06:39:17
Eventually, LPIM made a trick and solved / masked the issue, but the root cause is still uncertain and it might be even a bug in SQL Server.

If anyone is interested in further findings please navigate to the discussion on MSDN forum:
SQL Server 2016 memory pressure leads to the plan cache clearing