Create & Populate StackOverflow.dbo.PostTags table

Spread the love

As a DBA & instructor, very often I like to generate heavy workload on my SQL Server instances to re-produce scenarios like 100% CPU, zero PLE, high IO, etc. For these activities, I find StackOverflow dump very handy.

In case you are not aware of StackOverflow dump, and want to start fresh, below links might help:-

— 01) Common Queries
https://data.stackexchange.com/stackoverflow/queries

— 02) Brent Introduction to StackOverflow Schema
https://www.brentozar.com/archive/2018/02/gentle-introduction-stack-overflow-schema/

— 03) Database schema documentation for the public data dump and SEDE
https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

— 04) Query Plan Operators
https://sqlserverfast.com/epr/operator-list/

— 05) Database for practice
https://www.brentozar.com/archive/2018/01/updated-and-smaller-stack-overflow-demo-databases/
https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/

Very recently, BrentOzar provided Extra-Large version of StackOverflow database of 381GB size from stack dump of 2020/06: 46GB torrent (magnet).

But, StackOverflow dump does not include separate dump for table dbo.PostTags which is very commonly utilized in various queries. Rather, the same data come as part of column Tags in table dbo.Posts.

In order to create & populate a separate table StackOverflow.dbo.PostTags, I wrote the below query utilizing STRING_SPLIT function which was added in SQL 2016.

USE StackOverflow
go

create table dbo.PostTags 
(	PostId int not null, 
	TagId int not null,
	constraint pk_PostTags primary key (PostId, Tagid),
	constraint fk_PostTags__PostId foreign key (PostId) references dbo.Posts (Id),
	constraint fk_PostTags__TagId foreign key (TagId) references dbo.Tags (Id)
);
go

alter table dbo.PostTags nocheck constraint fk_PostTags__PostId;
alter table dbo.PostTags nocheck constraint fk_PostTags__TagId;
GO

declare @counter int = 1;
declare @batch_size int = 10000;
while ( ((@counter-1)*@batch_size) <= (select MAX(id) from dbo.Posts) )
begin
	insert dbo.PostTags (PostId, TagId)
	select p.Id as PostId, t.Id as TagId
	from dbo.Posts as p
	outer apply
		( select ltrim(rtrim(pt.value)) as PostTag
		  from STRING_SPLIT(REPLACE(p.Tags,'<',''), '>') as pt
		  where ltrim(rtrim(pt.value)) <> ''
		  and p.Id > ((@counter-1)*@batch_size)
		and p.Id <= (@counter*@batch_size)
		) as pt
	join dbo.Tags as t
		on ltrim(rtrim(t.TagName)) = pt.PostTag
	where p.Id > ((@counter-1)*@batch_size)
		and p.Id <= (@counter*@batch_size)

	set @counter += 1;
end
go

alter table dbo.PostTags with check check constraint fk_PostTags__PostId;
alter table dbo.PostTags with check check constraint fk_PostTags__TagId;
GO

With all the required tables, you can starting writing/utilizing common queries to generate/test SQL Server performance issues.

I hope this has been helpful. Happy Coding 🙂

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.