So today, back at the same project I was working on earlier with Linked Servers, I encounter the following. We have a 42 disk aggregate of SAS drives dedicated for SQL Server LUNs set up on our NetApp system. We’re Fiber attached with 10GB links. These are virtual machines on ESX 4.1 (If I remember the version we’re on). The one I’m working on is Windows Server 2008 R2 and SQL Server 2005 SP3 Enterprise (64 BIT).
Today, I’m checking some web based reports from the Filer, and see that this box is keeping the SQL Data lun quite busy… My guess is an average of 1,200 IOPs. Spikes over 2K, and quiet times under 500. So, I immediately think Memory pressure. Log onto the server, and total memory is 12 GB and SQL is set to a max of 3…. so the Sysadmin did a hot add of Memory at some point and didn’t tell me. (There’s no way I’m going to let 8 GB of RAM go free!).
The primary database is a heavy ETL, with one major process loading data sequentially into a table, and another pulling data out (not really sequentially, more grouped). This table is about 20GB of size in Data. So I know that 3GB is going to be low… but checking my Perfmon counters, I couldn’t see it…. Buffer Cache Hit Ratio is fluctuating between 96-99.xx percent. So, I bump the max memory setting to 6GB because I know that we’re getting a ton of paging if IOPs are this high. Oh, 50% of Waits are for PageIOLatch_SH. Anyways, SQL almost immediately starts gobbling up the new memory presented to it, so I know it’s hungry.
I call my SAN admin back, and we re-run the IOPs report, and it has dropped like a lead balloon… from a range of 500 to 2500 Operations per Second, down to off the chart. I’ll monitor for a few days, but I think this has made a huge difference.
Just wondering why the heck the Buffer Cache Hit Ratio was looking so good, and if that’s not valid as a counter for Memory on a SAN attached server, what is?
I’d go for for checking out “Page Life Expectancy” a low figure for that baby, SAN or not, will make the low memory stick out like a sore thumb. Pity that a lot of monitoring tools (like SCOM) don’t monitor this and focus on the Cache hit ratio instead.
96% is kinda low for the cache hit ratio, but personally, I tend to ignore it in favour of the PLE.
Thanks Bob. I’ll add that counter to my SAN attached servers.