question

OM84 avatar image
OM84 asked

Deleting 10+ million records from a database table

Hi Guys, I am in a dilemma where I have to delete data from a table older than 6 months. The data in there goes back to about 4 years and is a total of 1.8 billion rows. I have designed a DELETE statement to delete data older than 6 months, but since all delete's are logged in the transaction log, the transaction log will grow a lot. We have the auto grow option configured to 10 percent unrestricted growth. I am basically looking for advice on what would be the ideal solution to approach this problem. Should I delete in multiple stages? Data older than 3 years on one day, older than 2 years the next day, etc? Or should I just run one statement deleting all data older than 6 months. What kind of problems would I hit on something like this? Please let me know, Thanks in advance.
t-sqltransaction-logdeletearchitecture
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Scot Hauder avatar image Scot Hauder commented ·
What is the recovery mode of the database?
0 Likes 0 ·
Mark avatar image
Mark answered
Oleg has a winning [answer to a similar question here][1]. [1]: http://ask.sqlservercentral.com/questions/7079/advise-on-deleting-huge-amount-of-rows
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image
WilliamD answered
As you already noted, deleting data will cause activity on the transaction log. I would suggest you look at a few things before going ahead with this. 1. Can you try a different solution to deleting. You could select the data you want to keep into a new table, drop the existing table, rename the new table. This may not be an option if you have foreign keys etc. or the table has to remain online while you do this. 2. Do you have SQL Server Enterprise Edition. If so, have you considered looking at partitioning? You could push the current table into a partition scheme, splitting the newest data (that you want to keep) into a separate partition. You can then delete the old data without causing concurrency issues. 3. If you don't like the two options above, or cannot use them, please take care to pre-size your transaction log file. Ideally you will test the delete on a test machine and find out how large your log file will be getting. 4. You say the table is 1.8 Billion rows. How many of these rows are within the last 6 months, and how many will be deleted? What is the table and index structure like? Do you have an estimated query plan that we could look at to offer some advice on that? 5. Will the table still be in use while the deletion is taking place, or do you have a maintenance window? Regardless of what you end up doing, deleting large amounts of data is best done in batches. The size of the batch is dependant upon a great many factors, some of which I have mentioned already. The most important part is TEST, TEST, TEST. This will give you an idea of how long you need to delete and how much this will impact your system as a whole. Good luck and let us know if you need more help.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Oleg avatar image Oleg commented ·
@WilliamD +1 Very good answer, and it looks like I am late to the party (typing too slowly).
0 Likes 0 ·
Oleg avatar image
Oleg answered
Well, I think that in this situation, Mark's answer to the question which he linked (which was similar to this one) would be the best approach in your case. Here is why I think so: You have a table with 1.8 billion rows which has the data going back 4 years. Assuming that the inserts were relatively evenly distributed, and you need to keep data for last 6 month, you have to delete 87.5% of your records and keep 12.5%. This means that you have to delete 1.584 billion records, and this is not a good idea. Using Mark's idea, it would be far cheaper to insert 216 million records into new table, drop the original table and then rename the new table to the original name. Reapply the indexes, constraint etc as needed. Deleting 1.584 billion records would be a more difficult task to manage, considering the log bloat and indexes fragmentation it will cause. Suppose you have 100 bucks in one-dollar bills, and you need to buy something which cost $88. Certainly you would opt to shell out $12 and give the salesman the rest right? Inserting 216 million records is not an easy task either, but seems like a much better option. You can break it into several manageable pieces, like one month worth of records at a time, provided that you do it from earlier to later period to preserve the natural way of inserts. Oleg
7 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Oleg avatar image Oleg commented ·
@WilliamD Of course it was your answer. I just meant to say that you already provided the alternative option to consider (to insert into new table, dropping the original and renaming the new back instead of just deleting existing data). It just so happened that the link in Mark's answer points to the question which has 2 answers - Marks and mine. My answer happened to be more suitable to the situation in that question, but I wanted to point out that in this specific question, Mark's idea, which matches your answer, might be better, that is all.
1 Like 1 ·
Mark avatar image Mark commented ·
Then I should have just copied my old answer and pasted it here instead of showing the link! Dang! :o)
1 Like 1 ·
Oleg avatar image Oleg commented ·
@WilliamD Partially, I can blame the wonders of asynchronous operations on all of the above. This makes it kinda fun to read later, because it is indeed possible for someone to think that I referenced Mark's answer to **this*** question, and thereby gave myself an indirect pat on the back because the answer included the link to my own answer(!) Oh boy :)
1 Like 1 ·
WilliamD avatar image WilliamD commented ·
@Oleg - exactly - everyone who missed this live is just not going to get it.
1 Like 1 ·
WilliamD avatar image WilliamD commented ·
@Oleg - just for clarity's sake - it was my answer you were writing about. @Mark's answer pointed to one of your answers on another question here.
0 Likes 0 ·
Show more comments
ozamora avatar image
ozamora answered
With @WilliamD here. My recommendation based on these assumptions: * You got Enterprise Edition * You can sustain the non-availability of old data for a few minutes One Solution: * Recreate the table (table_new), the way it is now, partitioned by month. * Plan on copying over the the last 6 months worth of data. Approach batch operations (eg. 1 day at a time). * If you are running it today, use a stop date of (date = '10/27/2010') * Drop the old table. Ongoing: * Plan on switching out old partitions and truncate the data (sliding windows) * Ensure that you add new monthly partitions **before** you hit the new month
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Scot Hauder avatar image
Scot Hauder answered

A common technique to age out old data is to swap an empty partition (table) for the data you want to delete. The SWITCH only changes meta data so the billion rows are not moved on the media. After they are switched you can backup and drop the table. There is some overhead dropping and recreating the clustered index but there are no deletes or inserting/moving of data.

    CREATE TABLE [dbo].[LargeTable](
    	[RecordDate] [datetime] NOT NULL,
    	[intCol] [int] NULL,
    	[charCol] [char](10) NULL
    ) ON [PRIMARY]
    
    INSERT LargeTable(RecordDate,intCol,charCol)
    VALUES('4/30/2009',1,'historical')
    ,('5/31/2010 23:59:59.997',2,'historical')
    ,('6/1/2010',3,'keep data')
    ,('10/28/2010',4,'keep data')
    
    -- drop your existing CLUSTERED index here.
    --DROP INDEX blah blah
    
    -- create table identical to your original large table, same columns,
    -- same column order, same column nullability, collation etc
    -- it must contain the same primary key contstraint,foreign key constraints,check constraints,
    -- non-clustered indexes, clustered index(we dropped ours above, however),
    -- computed columns etc as the original
    
    -- the original table cannot be referenced by a foreign key in another table. drop these then after
    -- the switch recreate them 
    
    -- the date column used for partitioning should be NOT NULL in both tables
    -- the new historical table must be empty and must be on the same filegroup
    -- as the original table
    CREATE TABLE [dbo].[LargeTableHistory](
    	[RecordDate] [datetime] NOT NULL,
    	[intCol] [int] NULL,
    	[charCol] [char](10) NULL
    ) ON [PRIMARY] -- same filegroup
    
    -- cluster our partition column
    CREATE CLUSTERED INDEX cidx ON [LargeTableHistory] (RecordDate)
    
    CREATE PARTITION FUNCTION [RecordDateRangePF] (datetime)
    AS RANGE RIGHT FOR VALUES ('20100601') -- save everything from 6/1/2010 and later
    
    CREATE PARTITION SCHEME [RecordDateRangePS]
    AS PARTITION [RecordDateRangePF]
    ALL TO ([PRIMARY])
    
    -- create interim clustered index to align table with partition scheme
    CREATE CLUSTERED INDEX cidx ON [LargeTable] (RecordDate) 
    ON [RecordDateRangePS]([RecordDate])
    
    -- do magick -- 
    ALTER TABLE [LargeTable] SWITCH PARTITION 1 TO [LargeTableHistory]
    
    SELECT * FROM [LargeTable]
    SELECT * FROM [LargeTableHistory]
    
    -- drop the interim clustered indexes 
    DROP INDEX [LargeTable].cidx
    DROP INDEX [LargeTableHistory].cidx
    
    -- recreate the real clustered index on original table
    CREATE CLUSTERED INDEX cidx ON [LargeTable] (intCol) ON [PRIMARY]
    
    -- backup first, then drop the historical data
    DROP TABLE [LargeTableHistory]
-- clean up DROP PARTITION SCHEME [RecordDateRangePS] DROP PARTITION FUNCTION [RecordDateRangePF] -- obviously you wont do this :) DROP TABLE [LargeTable]
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.