Live Troubleshooting High CPU on SQL Server – Part 3
|In my previous blogs Live Troubleshooting High CPU On SQL Server – Part 1 and Live Troubleshooting High CPU On SQL Server – Part 2, I talked about various scenarios of CPU issues. To summarize,
- Ensure the CPU issue is not because of non-SQL Processes
- If CPU is all being consumed by SQL Server process, then use tools like sp_BlitzFirst or open source SQLMonitor to validate any deviation in workload (batch requests/sec), or high compilation or recompilations (> 20% of batch requests), high forwarded fetches compared to previous times.
- In rare cases, a very high-frequency non-optimized query can also cause high CPU issues. This happens when this high-frequency query has issues like string manipulations, implicit conversion, cursors, functions etc.
Recently I was pulled into an ongoing issue of high CPU on a 16-core server. The server CPU was high most of the time. I was able to validate the CPU information from Monitoring – Perfmon Counters – Quest Softwares – Distributed.
As suggested above, I found that the CPU issue is all because of SQL Server queries. I started with my own comprehensive Server Health Check query which I described in the blog Troubleshooting High CPU On SQL Server – Part 2. To my surprise, the query did not complete anytime soon. So I executed sp_WhoIsActive in another window. Even after waiting for several minutes, sp_WhoIsActive did not complete either. So I decided to check Wait Stats dashboard.
In the above dashboards, I could clearly notice PAGELATCH_** wait at the top along with SOS_SCHEDULER_YIELD. The presence of the above 2 waits is indicative of high CPU issues due to contention on the access of data file pages. I validated and found that this PAGELATCH_** wait is present almost all the time on the server. So decided to check the data of dbo.WhoIsActive that stores captured data of sp_WhoIsActive in SQLMonitor tool.
1 2 3 4 5 6 7 8 |
use DBA go select * from dbo.WhoIsActive w where w.collection_time between '2022-12-29 09:00' and '2022-12-29 12:00' and w.wait_info like '%PAGELATCH%' go |
In the wait_info column of the above dbo.WhoIsActive result screenshot, I clearly noticed that due to sub-optimized queries creating/dropping temp tables in the tempdb database executing thousands of times per minute is leading to tempdb contention on the server.
So I executed the below query to find out tempdb files & their size config –
1 2 3 4 5 6 7 8 9 10 11 12 13 |
use tempdb ; go -- Find used/free space in Database Files select SERVERPROPERTY('MachineName') AS srv_name, DB_NAME() AS [db_name], f.type_desc, fg.name as file_group, f.name, f.physical_name, [size_GB] = convert(numeric(20,2),(f.size*8.0)/1024/1024), f.max_size, f.growth, [SpaceUsed_gb] = convert(numeric(20,2),CAST(FILEPROPERTY(f.name, 'SpaceUsed') as BIGINT)/128.0/1024) ,[FreeSpace_GB] = convert(numeric(20,2),(size/128.0 -CAST(FILEPROPERTY(f.name,'SpaceUsed') AS INT)/128.0)/1024) ,cast((FILEPROPERTY(f.name,'SpaceUsed')*100.0)/size as decimal(20,2)) as Used_Percentage ,CASE WHEN f.type_desc = 'LOG' THEN (select d.log_reuse_wait_desc from sys.databases as d where d.name = DB_NAME()) ELSE NULL END as log_reuse_wait_desc from sys.database_files f left join sys.filegroups fg on fg.data_space_id = f.data_space_id order by FreeSpace_GB desc; go |
From the result of the above query, found that the tempdb database had only a single data file. So to partially reduce the issue, I went ahead and added 7 additional tempdb files ensuring the initial size & auto growth are equal for all the data files of tempdb.
Conclusion:- The SQL Server was dealing with tempdb contention & presenting it in the form of high CPU. It was showing SOS_SCHEDULER_YIELD & PATCHLATCH_UP as the top waits. Any diagnostics tsql tool/query was getting stuck (not completing) if it was trying to drop/create #temptables. So only diagnostics queries that would avoid tempdb usage were able to help me figure out the issues. Adding additional tempdb files helped me get rid of the high CPU issue to large extent.
To reduce the CPU issue to even further, later I executed Pssdiag & SqlNexus, found the top 2 batches/queries by CPU usage, and optimized them by getting rid of parameter sniffing & Kitchen Sink design. Probably will write blogs on these issues in future. XEvent infra of SQLMonitor did not help in this cause as it capture queries only with a duration equal to or above 5 seconds, and would not help to catch high-frequency queries finishing within 5 seconds.
I hope this would be helpful to you DBAs out there.
Kindly like, subscribe & share if this alert helped you in any way.