As a senior DBA, my job responsibility includes ensuring all the SQLServers in my organization are always issue-free and running at peak performance all the time. This was not as easy as it sounds until I created my own free SQLServer monitoring tool SQLMonitor.
At the start of my day, and at regular intervals, I visit the SQLServer / Monitoring – Live – All Servers dashboard. This dashboard has one row exposing the list of SQLServer that seems to be offline/connectivity issues.
Then there is a row that exposes the health metrics of any SQLServer that need some kind of DBA help. For example, the below screenshot shows server SqlPractice with 100% CPU & 6.2 minutes of Waits Per Core Per Minute. It’s very clear that the server is facing a high CPU issue at moment.
But not all problems on the server come in the shape of high CPU, low memory, or Blocking. For example, in below screenshot, all core metrics related to cpu, blocking, memory grant, sql connections look OK.
On this Wait Stats dashboard, In most cases for general performance analysis, I visit details of just the 1st row which exposes Wait Stats since the Startup. But in a specific scenario like this where we are actively looking to find out the issue at moment, I go through the details shared in 2nd row Selective – Wait Stats. Below is the screenshot of wait stats of Server having issue into 2 panel.
The panel Wait Stats ____Since Startup__till__yyyy-mm-dd hh:mm___ shows the wait stats of server since startup till the start time range selected in Grafana. The 2nd panel Wait Stats ___In Select Time Duration ___Since___yyyy-mm-dd hh:mm___till___yyyy-mm-dd hh:mm___ shows the Wait Stats accumulated in the time duration select in Grafana. Looking both the panel at same time gives me idea of how the server is behaving compared to since startup. For example, in above screenshot, usually PAGEIOLATCH_** wait is not the TOP wait on the server, but the same is at top of the wait list during the time of issue. What’s more important in this case is the Avg Time Per Wait. Total Avg Time Per Wait for PAGEIOLATCH_** waits in the above screenshot is 115ms / 374ms / 1.6 Seconds. All these values are HUGE, and should not cross 15-20 milli seconds in general.
As highlighted in the above screenshots, I can directly see that the Mean Latency for tempdb mounted volumes is way past even 100ms. This is one of those moments where even simple scripts like sp_BlitzFirst which usually takes 30-40 seconds would take forever to complete. This was a clear case of a high number of IO requests crossing the tipping point from where the disk latency started increasing exponentially.
In order to bring the server immediately to a healthy state, we were left with no choice but to have an agreed reboot of the box. This brought back the SQLServer to a normal performance state.
I am Microsoft Certified Professional having 10+ years of experience in SQL Server Querying, Database Design, and Administration. I am fond of Query Tuning and like to automate things using TSQL & PowerShell. I also have experience of implementing end-to-end Data Warehouse solution, and Data Migration using ETL tools SQL Server Integration Services (SSIS), Pentaho Business Analytics, and have designed Database Inventory through PowerShell, Python, and Django etc.