How to Monitor All SQLServers LIVE
As a senior DBA, my job responsibility includes ensuring all the SQLServers in my organization are always issue-free and running at peak performance all the time. This was not as easy as it sounds until I created my own free SQLServer monitoring tool SQLMonitor.
At the start of my day, and at regular intervals, I visit the SQLServer / Monitoring – Live – All Servers dashboard. This dashboard has one row exposing the list of SQLServer that seems to be offline/connectivity issues.
Then there is a row that exposes the health metrics of any SQLServer that need some kind of DBA help. For example, the below screenshot shows server SqlPractice with 100% CPU & 6.2 minutes of Waits Per Core Per Minute. It’s very clear that the server is facing a high CPU issue at moment.
But not all problems on the server come in the shape of high CPU, low memory, or Blocking. For example, in below screenshot, all core metrics related to cpu, blocking, memory grant, sql connections look OK.
Still, one key metric Waits Per Core Per Minute in the above screenshot is above 5 minutes. Brent Ozar explained this well in this blog. This is one of those situations where the server is not healthy due to non-common reasons. From here, I click on the highlighted value of Waits Per Core Per Minute which takes me to SQLServer / Wait Stats dashboard showing Wait Stats of the concerned server.
On this Wait Stats dashboard, In most cases for general performance analysis, I visit details of just the 1st row which exposes Wait Stats since the Startup. But in a specific scenario like this where we are actively looking to find out the issue at moment, I go through the details shared in 2nd row Selective – Wait Stats. Below is the screenshot of wait stats of Server having issue into 2 panel.
The panel Wait Stats ____Since Startup__till__yyyy-mm-dd hh:mm___ shows the wait stats of server since startup till the start time range selected in Grafana. The 2nd panel Wait Stats ___In Select Time Duration ___Since___yyyy-mm-dd hh:mm___till___yyyy-mm-dd hh:mm___ shows the Wait Stats accumulated in the time duration select in Grafana. Looking both the panel at same time gives me idea of how the server is behaving compared to since startup. For example, in above screenshot, usually PAGEIOLATCH_** wait is not the TOP wait on the server, but the same is at top of the wait list during the time of issue. What’s more important in this case is the Avg Time Per Wait. Total Avg Time Per Wait for PAGEIOLATCH_** waits in the above screenshot is 115ms / 374ms / 1.6 Seconds. All these values are HUGE, and should not cross 15-20 milli seconds in general.
Above metrics were RED alerts for me suggesting that SQLServer is having IO Latency problems right now. So from here, I usually check Logical Disk Latency on SQLServer / Monitoring – Perfmon Counters – Quest Softwares – Distributed dashboard. Below is how the disk latency on the server.
As highlighted in the above screenshots, I can directly see that the Mean Latency for
tempdb mounted volumes is way past even 100ms. This is one of those moments where even simple scripts like sp_BlitzFirst which usually takes 30-40 seconds would take forever to complete. This was a clear case of a high number of IO requests crossing the tipping point from where the disk latency started increasing exponentially.
In order to bring the server immediately to a healthy state, we were left with no choice but to have an agreed reboot of the box. This brought back the SQLServer to a normal performance state.
For a permanent fix, I already talked about possible solutions in my previous blog post Troubleshooting SQL Server Storage IO.
I hope this would be helpful to SQL DBA DBA who are having hard time figuring out the issues with their SQLServers.