Troubleshooting SQL Server Storage IO

Spread the love

Storage performance is something that puzzles a lot of SQL Server professionals. So in this blog, I will cover the basic steps I perform to ensure I get the best performance from the underlying storage.

How to measure the throughput of your Disk at the time of Server Setup?

Brent Ozar covered this question nicely in the blog postHow to Test Your Storage with CrystalDiskMark. To summarize, I keep my local desktop Gen4 SSD’s throughput as my benchmark for the production server’s minimum expected throughput.

Below is the disk throughput I got on my local desktop SSD. SEQ1MQ8T1 is relevant for backup drives. Values of RND4KQ32T16 should be referred for data/log file OLTP workload.

Disk Throughput on Gen4 SSD

How to Evaluate Performance of Disk Drives on Existing Server

To evaluate the performance of disk drives, I like the analogy used by Brent Ozar in the blog postHow Do You Test IOPs, Latency, and Throughput? We need to understand Latency, Throughput, IOPS (based on IO size), and Queue Length.

With modern SSDs, we want to keep Read/Write Latency under 3-5 milliseconds. For storage older than 5 years, I can live with 20 milliseconds of latency. Every component in Storage Subsystem including SQL Server, OS, VM Host, HBA, Network, or Storage has a tipping point after which the response time increases exponentially. To say in simple terms, if we increase Reads/Writes on storage, the latency might also increase. This correlation of throughput and latency (stall) is usually linear until it reaches the tipping point.

Usually, I look at all these metrics using my homegrown open-source SQLMonitor tool. On SQLMonitor, I have 3 dashboards that help me figure out if my server is having IO bottleneck.

Dashboard Monitoring – Perfmon Counters – Quest Softwares – Distributed exposes perfmon metrics suggested in SQL Server Perfmon Counters of Interest poster of Quest Softwares.

On the above dashboard, I can see avg Logical Disk – Latency is well below 5 ms. If I happen to see huge latency, then I like to know if the no of operations on the disk drive has increased. This I see in the top panel Logical Disk (Disk Queue Length) under the same row. Ideally, we want to keep Avg Disk Queue Length under 1. If that’s not the case, then storage is receiving more Reads/Writes than it can handle.

Dashboard Wait Stats can also be used to derive similar conclusions by looking at Storage Related waits like PAGEIOLATCH_* , WRITELOG, IO_COMPLETION, ASYNC_IO_COMPLETION. Most important is to look for PAGEIOLATCH_* wait type’s Avg Time Per Wait. It can be considered Read/Write latency like we had in the above dashboard.

SQL Server Disk Latency using Wait Stats on SQLMonitor

Dashboard t__Database File IO Stats exposes captured information of sys.dm_io_virtual_file_stats DMV in both tabular & Time Scale formats. This dashboard tells me all metrics like Latency, IOPS, Throughput, and total Stalls for each database file.

SQL Server Disk Latency using sys.dm_io_virtual_file_stats DMV on SQLMonitor

This dashboard is very helpful in figuring out what databases or files are doing maximum Reads/Writes, and in the case of disk latency observed, which ones make sense to be moved to their dedicated volumes. This dashboard is very useful in scenarios of SQL Server migration projects from OnPrem to Cloud where we need to figure out IOPS & IO Share among databases.

There are cases when I have to work on a server that is not baselined with SQLMonitor. In those cases, I rely on First-Responder-Kit sp_BlitzFirst. The result has pretty much required data that helps me conclude storage issues in real-time.

Disk Latency using sp_BlitzFirst WAIT STATS result
Disk Latency using sp_BlitzFirst File IO Stats result

How to Resolve High Disk IO Latency Issue

The following are the possible solutions that can be used in the same order to resolve the disk latency issue.

  1. Brent Ozar’s D.E.A.T.H Method of Index Tuning. I learned this from his Mastering Index Tuning classes. Highly recommended.
  2. Dedicated Disk Drives. Place high Reads/Writes database files on dedicated disk drives. This is really helpful if simple tuning techniques like tweaking the read/write bias on the RAID controller’s cache, right-sizing the NTFS allocation units, and disk striping are already in place.
  3. Query Tuning. This step can do magic but could take days, months, or years to work upon.
  4. Add more RAM. Cache more data by adding RAM to reduce the physical IO.
  5. Replace SAN storage 🙂

I hope this would be helpful to SQL DBA DBA suspecting storage bottleneck on their servers.

3 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.