Optimizing Delete Operation

Paul McHenry 2017-03-03 16:31:59

Hello,

I'm looking for some advice please. I've got a nightly job which runs and deletes data from a table for values which are older than a specified number of weeks. The number of rows to be deleted varies, but it can be be upwards of 800,000. The table is physically quite large too, as there are some NVARCHAR(MAX) columns with a lot of text. Roughly, the table is about 40GB.

The query deletes rows from table 1 and joins to table 2 using the primary key for table 2, which is a non key column in table 1, and then filters using a date from table 2. The majority of the query cost is in checking table 2, which has some 50m rows.

I've tried playing around with it a bit by adding and removing different indexes, updating statistics, adding statistics, and other stuff like removing the function from the where predicate in favor of a < [date], but I haven't really been able to improve on the query all that much. There is what appears to be a suitable/good non clustered index on table 2, but the query opts for a clustered index seek. I've tried forcing it to use a non clustered index with a hint, but this results in a worse plan and for some reason the estimates for other operations become completely inaccurate. Ultimately my main goal is to reduce blocking, as the query runs for a while and causes a lot of blocking whilst it runs. It's currently running in increments of 1000 but I'm happy to (likely to) change this for a higher number once I figure out the best way to optimize the query. I've also looked at articles on fast ordered deletes and derived tables etc., but when trying to use these techniques I've run into problems because of the join, and I get an error about not being able to delete as it would modify more than 1 table. Any advice would be appreciated. Thanks.link text

Paul McHenry 2017-03-07 12:02:57
Hey Aaron,

Thanks for the suggestions. I'll keep at it and hopefully get somewhere.

I tried the query as you suggested above, but it doesn't return any rows as it just selects the top 1000 in the ts_booking table without considering if these rows will exist in the msg_history table, which unfortunately they don't. The ts_booking table is around 50 million rows, whereas the msg_history table is just a fraction of that at about 1.5 million.

Aaron Bertrand 2017-03-07 16:30:52
I see, I didn't quite grasp the numbers. You could consider adding an alias (AS tsb) inside the CTE, and then AND EXISTS (SELECT 1 FROM dbo.msg_history WHERE context_id = tsb.id_booking) – that will be more expensive, of course. Another thing you could do is traverse the entire table – pick 1000 rows at a time, if you get some rows this time, great, if not, try again. You just need to keep track of your scan as it proceeds. But each individual check shouldn't be too bad in terms of picking the 1000 rows, especially if a good index supports it – I still think the biggest cost is the actual delete (in spite of what the estimates might say).
Aaron Bertrand 2017-03-07 01:19:49
That you used an index hint and wound up with a worse plan confirms that SQL Server was making the right choice without your "help." 🙂

I'd be curious to see the plan (and the wait statistics!) resulting from this slight restructuring of the query.

DECLARE @cutoff datetime = DATEADD(WEEK, -14, GETDATE());
 
;WITH cte AS 
(
  SELECT TOP (1000) id_booking
    FROM dbo.ts_booking 
    WHERE visit_date < @cutoff
    ORDER BY id_booking
)
DELETE h
  FROM dbo.msg_history AS h
  INNER JOIN cte
  ON h.context_id = cte.id_booking
  WHERE h.context_type = @context_type;

I'm not altogether surprised that things like adding statistics and messing with indexes haven't helped. After all, the heavy work involved with a delete is actually logging all of the delete information, which isn't really accounted for in the cost-based plan (it especially has no idea if the logging will actually be more expensive than the work done to the index/data pages, say for example your log file is on a slower or more burdened drive).

Given the LOB data you are trying to delete I would say that you might get more mileage out of reducing your batch size (say, to 100 or 500 instead of 1000), and running it more frequently, and seeing if that has a net positive effect on the sum of the operations (as opposed to any individual operation).

Another thing you could consider doing is to store the wider columns in a separate table, and mark those for "soft delete" during this operation. So instead of all the logging required for the LOB deletes, you just update a flag in that table that says this row is no longer valid, and you can have a different job that comes around and cleans this up at a different frequency, only during maintenance windows, etc. Views can be used to make these things look the same to existing applications, and filtered indexes can be used to prevent the invalid rows from impacting your most crucial operations.