Running a query twice, the second one has less IO/cost?

mbourgon 2014-01-29 14:44:31

link textI'm trying to limit the size/time a web query runs to avoid timeouts. Because of the amount of data queried, I run a count first ("select @mycount = count(*)" ). If the count is below our threshold, i run the query again, but with the ultra-wide SELECT that returns all the fields they want from the view and table (view is not indexed; seeing if I can do that). I do it this way to hopefully prevent the extra IO doing the key lookups. I can't think of a way to do it once – a CTE would work, but I can't do the "if less than x rows" with a CTE. Tossing it a temp table would again incur the lookup cost.

But here's the thing I'm actually posting about: even though both queries are the same (aside from the SELECTs), the query costs are vastly different. I'd expect the second to be more expensive, not less, since it's actually looking up all the various key rows. Thoughts? There are no functions in there, though I am using Erland's dynamic WHERE code (and field_a = @field_a_param or @field_a_param is NULL).

Thanks, all!
(plan_anonymized is the new plan, run within Plan Explorer Pro)

D. Ames 2014-01-29 15:30:50
For simplicity:
The first query forces the engine lookup a reference to all columns.
The second one you explicitly told it which ones and it doesn't need to.
This is by design.

Take a look at this article as well.

mbourgon 2014-01-29 16:12:44
Does it ref all columns? The code is literally:

select @returncounts = count(*) from […]
whereas the second one is
select from […]

Aaron Bertrand 2014-01-29 15:40:42
Well, I can't see the query because the plan has been anonymized, but a couple of things you could consider:

  1. You could consider limiting the rows returned to some TOP (@threshold) in all cases. If you return a million rows to the client, are they going to use all the rows anyway? And if they have a million rows with a subset of the data, is that more useful than, say, a thousand or ten thousand rows, with all of the data?
  2. It's not clear if the timeouts are caused because the query ran too long or because it took too long to transfer the data or because the web application couldn't consume the data fast enough. Can you clarify how quickly the query runs on its own, local on the SQL Server, to see if the problem is with the query or external? (You could also eliminate the data transfer variable by generating an actual plan in Plan Explorer – this anonymized plan seems to be missing duration and other runtime metrics, which sounds like you generated the plan in Management Studio and then opened it in Plan Explorer.)
  3. The only standout item to me is the clustered index scan in the first statement has a huge discrepancy in actual vs. estimated rows. While the estimated cost of this statement is very low relative to the rest of the batch, remember that those estimated cost % numbers are based on the estimated rows, and when it's way off, you need to factor that in. You really shouldn't base any decisions or conclusions on those estimated cost % numbers. They are merely a guideline to help point you where to look closer – the real test of which query is "more expensive" is running them and measuring runtime metrics such as duration, CPU, reads, etc. There are some minor discrepancies in statements three and four as well. Are stats up to date?
  4. Since you're using conditional logic in your WHERE clause, you may be getting a very sub-optimal plan some, or a lot, of the time. If these things can vary wildly, I would consider either (a) using dynamic SQL instead, and since you are on SQL Server 2008, you can combine this with the server setting optimize for ad hoc workloads – which will prevent all of the single-use versions of the query from bloating the plan cache, or (b) experimenting with adding OPTION (RECOMPILE) to the statement – this will force a new plan every time, which might avoid the sub-optimal ones based on previous parameters, at the cost of compiling the plan every time (which may or may not be tolerable in your case).
mbourgon 2014-01-29 16:33:33
Thanks for the help, Aaron!
1) The potential for it to return 1m rows is why we're doing "select @returncounts = count(*)" then "if @returncounts <500". I guess theoretically I could do select top 501, and if there's 501 then return an error... but then we have to include ORDER logic, which would slow it down. 2) It's not actually hitting a timeout in this example - but we have it happen when a frequent customer requests a huge order list. Hence the 500. But yes, I can add the actual plan here. I'll modify my post. The problem is usually with the query. They'll happily request 3 months' of data for dozens of locations, which will cause a timeout. 3) I saw that too. The table itself is only 65m rows, 22gb (with another 24gb of indexes), and it's partitioned, so every morning we do a partial stats update as part of the process. A 1% takes 5 minutes or so, but is up to date. I may have to look and see if there's space in our window for a FULLSCAN. But that also means that, if the table's only 22gb, that I might be able to add a covering index, especially if it's compressed. 4) Alas, you may be right. In the past I've been (ahem) vehement with the devs about using dynamic SQL (I know, these days there's not much of a hit), so I dislike being the one using it… call it DBA stubbornness. Also, and this is the bigger deal: customer facing web-site, and I have no insight as to how well it's protected against injection, and didn't want to rewrite the SP to deal with it. All that being said, as per Erland's article, I do have an OPTION (RECOMPILE) on both of the major statements, specifically to make sure we get as good a plan as possible. Adhoc isn't turned on, but only 2% of the cache is used by adhoc plans.
Aaron Bertrand 2014-01-29 18:03:04
Re: ad hoc, my suggestion was because if you change this to be dynamic SQL instead of one big conditional query with recompile, your ad hoc portion of the cache will most certainly ramp up.