Troubleshooting High CPU on SQL Server – Part 2
In the last blog post Live Troubleshooting High CPU On SQL Server – Part 1, we worked on a scenario where we saw a high CPU on SQL Server due to some external OS level task. In this blog, we are going to explore a scenario where a high CPU issue is present because of the workload running on SQL Server.
Just like in the last blog post scenario, when I have to troubleshoot a “slow” SQL Server, if my SQL Server is baselined with the SQLMonitor tool, then I first visit my Monitoring – Live – All Servers dashboard which displays all the metrics of specific SQL Servers that need DBA help.
Here under the 3rd panel All Servers – Health Metrics – Require ATTENTION, I notice that there is high utilization for both OS CPU & SQL CPU. This makes it clear that there are some queries/workload running on server that are causing high CPU.
From here, I open Monitoring – Live – Distributed dashboard which helps me check a lot more health metrics for a SQL Server Instance. In this dashboard also, I try to ensure that server is entirely facing CPU issue, and not accompanied by any other indicator like High Memory Grants, or excessive workload than usual. In this scenario, I don’t notice anything apart from high CPU at OS & SQL level.
So it is clear that there are some queries/workloads running that are causing above high CPU. From here, since the issue is LIVE and happening at moment, I can execute my Live Health Check query that displays the workload information in a kind of observable manner.
On the above 4 core server, at this moment, we have about 9 active_request_counts. We also get to see what Logins/programs are contributing how many active requests at present. To me, from the above result, it seems that the resources are evenly distributed among various logins/programs. So rather, I would proceed to check further at query pattern.
From the below result of the Live Health Check query, it seems there is a function named dbo.IsTagBanned which is involved in most of the currently running requests/queries.
Similar information is revealed in the resultset “All-Sessions-SortedBy-cpu” of the Live Health Check query.
So in order to resolve the issue, appropriate action would depend on the consent of application teams. We can go for following solutions –
- Kill all the sessions having specific function/procedure/piece of code which seems to be the main cause of high CPU at the moment.
- Perform RCA for the time duration of high CPU, and analyze the workload of today compared to yesterday. This comparison is easily possible using XEvent infra of SQLMonitor tool which captures sql_batch_completed & rpc_completed xevents into tables dbo.resource_consumption
- In order to fully control the CPU issue caused by sessions of specific login/program/host, use advanced methods like LogOn Triggers or Resource Governor. We will talk about this in future blogs.
- In case, the server is not baselined with SQLMonitor, then utilize sp_BlitzCache to get top CPU consuming queries.
1 |
EXEC master..sp_BlitzCache @Top = 30, @SortOrder = 'CPU' -- queries consuming high cpu time |
Additionally, open source tools like sp_Blitz would tell you if your server is dealing with general problems like Plan Cache in-stability, high adhoc workload, Stored Procedures WITH RECOMPILE hints etc. Similarly, interesting findings may show up in sp_BlitzFirst tool also if we run the same under the CPU issue time. For example, in this workload case, I see the following red flags on my server –
- Forwarded Fetches/Sec High – which indicates there are some active heap tables with high forwarded records in them. Solutions could be to create a Clustered Index or temporarily rebuild the heap table.
- Compilations/Sec High – Usually if the number of compilations is more than 20% of Batch Requests per Sec, then I prefer to check for any stored procedure which is generating Adhoc queries using dynamic SQL within it. This is something where tools like QueryStore or sp_HumanEvents may be helpful, but should be used with precautions.
- Re-Compilations/Sec High – Similar to Compilations/Sec, I prefer to keep this number below 20% of Batch Requests per Sec. To troubleshoot this warning, I prefer to look for stored procedures, or statements having RECOMPILE hints in them. Also, change in statistics of the physical/temp table may also trigger Re-Compilations.
Now, once the high CPU issue is resolved using above mentioned steps, for RCA, I prefer to use XEvent infra of SQLMonitor, and then sp_BlitzCache.
On my server, CPU was almost 100% b/w 3:40 PM IST till 4:53 PM IST.
So below are some different queries I use –
RCA Query 01 – Get top Login/Programs by CPU from XEvent infra of SQLMonitor –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
CPU Usage Between Time Window By Login/Program">declare @start_time datetime2 = '2022-10-05 15:40:00'; declare @end_time datetime2 = '2022-10-05 16:53:00'; declare @sql nvarchar(max); declare @params nvarchar(max); set @params = '@start_time datetime2, @end_time datetime2' set quoted_identifier off; set @sql = " select rc.username, rc.client_app_name, sum(cpu_time/1000000)/60 as cpu_time_minutes, convert(numeric(20,2),sum(logical_reads)*8.0/1024/1024) as logical_reads_gb, convert(numeric(20,2),sum(logical_reads)*8.0/1024/count(*)) as logical_reads_mb_avg, sum(rc.duration_seconds)/60 as duration_minutes, sum(rc.duration_seconds)/count(*) as duration_seconds_avg ,count(*) as counts , sum(cpu_time/1000000)/count(*) as cpu_time_seconds_avg from dbo.resource_consumption rc where rc.start_time between @start_time and @end_time or rc.event_time between @start_time and @end_time group by rc.username, rc.client_app_name order by cpu_time_minutes desc, logical_reads_mb_avg desc, logical_reads_gb desc, duration_minutes desc " set quoted_identifier on; exec sp_executesql @sql, @params, @start_time, @end_time go |
RCA Query 02 – Find CPU Consuming Queries using XEVent Infra of SQLMonitor –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
use DBA go declare @start_time datetime = '2022-10-05 15:40:00'; declare @end_time datetime = '2022-10-05 16:53:00'; declare @database_name nvarchar(255) --= 'StackOverflow'; declare @login varchar(255); declare @program varchar(255) --= 'dbatools'; declare @host varchar(255); declare @table_name nvarchar(500) --= 'Users'; declare @str_length smallint = 50; declare @sql_string nvarchar(max); declare @params nvarchar(max); set @params = N'@start_time datetime, @end_time datetime, @database_name nvarchar(255), @login varchar(255), @program varchar(255), @host varchar(255), @table_name nvarchar(500), @str_length smallint'; if object_id('tempdb..#queries') is not null drop table #queries; CREATE TABLE #queries ( [grouping-key] [nvarchar](200), [cpu_time_minutes] [bigint], [cpu_time_seconds_avg] [bigint], [logical_reads_gb] [numeric](20, 2), [logical_reads_gb_avg] [numeric](20, 2), [logical_reads_mb_avg] [numeric](20, 2), [writes_gb] [numeric](20, 2), [writes_mb] [numeric](20, 2), [writes_gb_avg] [numeric](20, 2), [writes_mb_avg] [numeric](20, 2), [duration_minutes] [bigint], [duration_minutes_avg] [bigint], [duration_seconds_avg] [bigint], [counts] [int] ); set quoted_identifier off; set @sql_string = " ;with cte_group as ( select [grouping-key] = (case when client_app_name like 'SQL Job = %' then client_app_name else left("+DB_NAME()+".dbo.normalized_sql_text(sql_text,150,0),@str_length) end), [cpu_time_minutes] = sum(cpu_time/1000000)/60, [cpu_time_seconds_avg] = sum(cpu_time/1000000)/count(*), [logical_reads_gb] = convert(numeric(20,2),sum(logical_reads)*8.0/1024/1024), [logical_reads_gb_avg] = convert(numeric(20,2),sum(logical_reads)*8.0/1024/1024/count(*)), [logical_reads_mb_avg] = convert(numeric(20,2),sum(logical_reads)*8.0/1024/count(*)), [writes_gb] = convert(numeric(20,2),sum(writes)*8.0/1024/1024), [writes_mb] = convert(numeric(20,2),sum(writes)*8.0/1024), [writes_gb_avg] = convert(numeric(20,2),sum(writes)*8.0/1024/1024/count(*)), [writes_mb_avg] = convert(numeric(20,2),sum(writes)*8.0/1024/count(*)), [duration_minutes] = sum(rc.duration_seconds)/60, [duration_minutes_avg] = sum(rc.duration_seconds)/60/count(*), [duration_seconds_avg] = sum(rc.duration_seconds)/count(*), [counts] = count(*) from "+DB_NAME()+".dbo.resource_consumption rc where ( rc.event_time between @start_time and @end_time or rc.start_time between @start_time and @end_time ) "+(CASE WHEN @database_name IS NULL THEN "--" ELSE "" END)+"and rc.database_name = @database_name "+(CASE WHEN @table_name IS NULL THEN "--" ELSE "" END)+"and rc.sql_text like ('%'+@table_name+'%') "+(CASE WHEN @login IS NULL THEN "--" ELSE "" END)+"and rc.username = @login "+(CASE WHEN @program IS NULL THEN "--" ELSE "" END)+"and rc.client_app_name = @program "+(CASE WHEN @host IS NULL THEN "--" ELSE "" END)+"and rc.client_hostname = @host and result = 'OK' group by (case when client_app_name like 'SQL Job = %' then client_app_name else left("+DB_NAME()+".dbo.normalized_sql_text(sql_text,150,0),@str_length) end) ) select * from cte_group ct " set quoted_identifier on; insert #queries exec sp_ExecuteSql @sql_string, @params, @start_time, @end_time, @database_name, @login, @program, @host, @table_name, @str_length; set quoted_identifier off; set @sql_string = " select top 200 [Normalized-Text] = q.[grouping-key], rc.sql_text, q.counts, q.cpu_time_seconds_avg, q.logical_reads_gb_avg, duration_seconds_avg, q.cpu_time_minutes, q.logical_reads_gb, q.duration_minutes, rc.event_name, rc.database_name, rc.client_app_name, rc.username, rc.client_hostname, rc.row_count /* ,q.* ,rc.* */ from #queries q outer apply (select top 1 * from dbo.resource_consumption rc where ( rc.event_time between @start_time and @end_time or rc.start_time between @start_time and @end_time ) "+(CASE WHEN @database_name IS NULL THEN "--" ELSE "" END)+"and rc.database_name = @database_name "+(CASE WHEN @table_name IS NULL THEN "--" ELSE "" END)+"and rc.sql_text like ('%'+@table_name+'%') "+(CASE WHEN @login IS NULL THEN "--" ELSE "" END)+"and rc.username = @login "+(CASE WHEN @program IS NULL THEN "--" ELSE "" END)+"and rc.client_app_name = @program "+(CASE WHEN @host IS NULL THEN "--" ELSE "" END)+"and rc.client_hostname = @host and result = 'OK' and q.[grouping-key] = (case when rc.client_app_name like 'SQL Job = %' then rc.client_app_name else left("+DB_NAME()+".dbo.normalized_sql_text(rc.sql_text,150,0),@str_length) end) order by rc.cpu_time desc ) rc order by q.cpu_time_minutes desc " set quoted_identifier on; exec sp_ExecuteSql @sql_string, @params, @start_time, @end_time, @database_name, @login, @program, @host, @table_name, @str_length; go |
From above result, we can see, 3 stored procedure has very high execution counts with avg CPU time over 1 second. So these are the queries that are causing 100% CPU on my server.
RCA Query 03 – In case we don’t have SQL Server baselined with SQLMonitor tool, then the best option we have is sp_BlitzCache.
1 |
exec sp_BlitzCache @SortOrder = 'cpu' |
Below is the SQL Server wait stats for the duration of the issue that I generally share with the application team as part of my RCA. This we can easily get from the Wait Stats dashboard of SQLMonitor.
Most Important – What causes High CPU consumption within Queries?
While troubleshooting the queries consuming CPU time, we want to look for the below red flags –
- String processing (LIKE ‘%SomeString%’)
- Row-based processing (cursors, functions)
- Sorting/grouping/distinct in queries
- Heaps with forwarded fetches
- Implicit conversions
Finally, Thanks to Arvind Shyamsundar for sharing his TSQLTextNormalizer project code. We are using this TsqlTextNormalizer in SQLMonitor based RCA Query 02 shared above.
I hope this would be helpful to you DBAs out there.
Kindly like, subscribe & share if this alert helped you in any way.