Trying to create persisted computed column to avoid non SARGable SUBSTRING in query

Andre Ranieri 2013-10-22 19:08:05

I'm looking at some ugly legacy code that includes a SUBSTRING operator in the WHERE clause:

WHERE ER.TermDate IS NULL AND SUBSTRING(ER.EmpID, 4, 4) = @Tech;

I'm experimenting in the sandbox to determine whether or not i can create a persisted, computed column called MTLogin which has a formula of (substring([EmpID],(4),(4)))

I then modified the WHERE clause to read WHERE ER.TermDate IS NULL AND MtLogin=@Tech; and created a NC index on the two columns.

However, the optimizer is not using my NC index and the index seek (Node 6) is showing the substring in the predicate.

Is my solution a valid one for improving SARGability for this query, or is there a fundamental issue with my line of thinking?

Thanks,

Andre Ranieri

Dan Holmes 2013-10-22 19:16:40
How did you create the NC index? Did it cover all the columns in the ER table requested by the SELECT?
Andre Ranieri 2013-10-22 19:42:50
Dan:

Thanks for taking the time to review my plan.

Here's an SSMS output of my NC index, which should cover everything in the SELECT clause.

I'll also try to upload an update of my plan, I widened the other index so the second key lookup is gone.

Thanks,

Andre Ranieri

/** Object: Index [ix_tblMainEmployeesRemote_TerminationDate_MTLogin] Script Date: 10/22/2013 12:42:42 PM **/
CREATE NONCLUSTERED INDEX [ix_tblMainEmployeesRemote_TerminationDate_MTLogin] ON [dbo].[tblMainEmployeesRemote]
(
[TerminationDate] ASC,
[MTLogin] ASC
)
INCLUDE ( [FirstName],
[Mobile],
[Beeper],
[Subcontractor],
[LastName],
[EmployeeLogin]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

Aaron:

I inadvertently lifted the hood on an eight year old horror show with this legacy query. I will absolutely see about moving the CASE statements out of the query and into the app layer.

I'll post the table DDL here for the next couple hours:

https://dl.dropboxusercontent.com/u/37269919/tblMainEmployeesRemote.sql

Here's an updated .sqlplan file (the second key lookup is gone after widening the index for the other table)

https://dl.dropboxusercontent.com/u/37269919/tmp52D4.sqlplan

The termination date NC index only covers the one column. However, experimenting on the sandbox server I disabled all the NC indexes except the one above, and got a plan featuring a clustered index scan. I tried using a FORCESEEK hint on the table (which I would never advocate in production) and the query processor errored out with a message that it could not produce a query plan because of the hints in the query. This last bit intrigued me.

Is there something about the computed / persisted nature of the MTLogin char(4) column that means it can't be indexed?

Thanks,

Andre Ranieri

Aaron Bertrand 2013-10-22 19:49:03
That index doesn't include several of the columns required in the original query, making it less likely to be useful without key lookups (and you already have a couple of those). Since a skinnier index is available (I assume the _TerminationDate index only has the one column), it is more efficient for SQL Server to seek on that index, and use a key lookup to pull the rest of the data. Seeking on a wider index only to pull from the clustered index anyway is far less efficient. Can you please provide the full table structure and all of the CREATE INDEX statements? Can you also explain if some of these CASE expressions and other NULL handling could be handled at the client instead of in the query?
Aaron Bertrand 2013-10-22 20:36:59
You have way too many moving parts here for me to follow. Should I post a much simpler table structure with an indexed persisted computed column to prove that SQL Server will choose to seek in the case where the index actually covers the query? Your main problem is that your index doesn't cover the query, and you haven't found a combination yet to allow SQL Server to seek + lookup instead of seek on other indexes or just scan. Also how many rows in the table, how many rows come back from this query, and how long does the query take to run? The plan XML says "Good enough plan found" so this leads me to believe that the query isn't actually a performance problem and perhaps you are over-optimizing the wrong things?
Aaron Bertrand 2013-10-23 01:33:10
Here is a very simple example that is at least somewhat similar to your example. This demonstrates that (a) you can get a seek on an index even when one of the columns is computed and persisted, and (b) the reason you're not getting a seek is most likely because the columns you're selecting in the query (or in the filter) are not covered by the index you created.

First, create this table and index:

USE tempdb;
GO
CREATE TABLE dbo.tblMainEmployeesRemote
(
  ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
  TerminationDate DATETIME, 
  FirstName VARCHAR(32), 
  LastName VARCHAR(32), 
  Phone VARCHAR(20), 
  Username VARCHAR(32), 
  MTLogin AS SUBSTRING(Username, 4, 4) PERSISTED
);
 
CREATE INDEX test1 ON dbo.tblMainEmployeesRemote(TerminationDate, MTLogin) 
 INCLUDE (FirstName, LastName);
GO

Now, insert enough data for SQL Server to care about:

SET NOCOUNT ON;
GO
INSERT dbo.tblMainEmployeesRemote
(TerminationDate, FirstName, LastName, Phone, Username) VALUES
(NULL,      'Bob', 'Frank', '4013335541', 'MyUserNameSmellsFunny'),
(NULL,      'Sam', 'Mitch', '4013335542', 'MyRealNameLooksFunny'),
(NULL,      'Liz', 'Harry', '4013335543', 'MyDadsNameFeelsFunny'),
(NULL,      'Fit', 'Luper', '4013335544', 'MyMomsNameSoundsFunny'),
(GETDATE(), 'Doc', 'Thist', '4013335545', 'MyDogsNameIsFunny'),
(GETDATE(), 'Bip', 'Jones', '4013335546', 'MyCserNameIsFunny');
GO 1000

Now, turn on actual execution plan and run this, or generate an actual execution plan from Plan Explorer:

SELECT TerminationDate, FirstName, LastName, MTLogin -- all cols covered
  FROM dbo.tblMainEmployeesRemote
  WHERE TerminationDate IS NULL AND MTLogin = 'sern';
 
SELECT TerminationDate, FirstName, LastName, MTLogin -- all cols covered
  FROM dbo.tblMainEmployeesRemote
  WHERE TerminationDate IS NOT NULL AND MTLogin = 'sern';
 
SELECT TerminationDate, FirstName, LastName, MTLogin,
    Username, Phone -- two additional columns not covered by the index
  FROM dbo.tblMainEmployeesRemote
  WHERE TerminationDate IS NULL AND MTLogin = 'sern';
 
SELECT TerminationDate, FirstName, LastName, MTLogin,
    Username, Phone -- two additional columns not covered by the index
  FROM dbo.tblMainEmployeesRemote
  WHERE TerminationDate IS NOT NULL AND MTLogin = 'sern';

My results are attached. The first two use an index seek, the second two don't, because it is more efficient for SQL Server to use a scan to pull the other non-covered columns than to perform a seek and 1,000 lookups.

Andre Ranieri 2013-10-23 03:46:21
Aaron:

Thanks for taking the time to work with me on this. Your example worked perfectly, for every reason you explained above.

After going through your example I turned back to my own code, which was still not using the NCI. This really intrigued me. After much experimenting I discovered that the deprecated SET CONCAT_NULL_YIELDS_NULL OFF; in the query code was preventing the plan optimizer from selected the NCI under any circumstance, even with the FORCESEEK hint. Turning SET_CONCAT_NULL_YIELDS_NULL back on resolved the issue for me.

I'll see if I can find any BOL or MSDN materials on this tomorrow.

Cheers,

Andre Ranieri

Aaron Bertrand 2013-10-23 03:53:49
Ok, so, you set concat to the proper setting, and then your NCI was used? With or without the FORCESEEK hint? Does the index cover the rest of the query, or did you get a seek + lookup? (Using FORCESEEK may yield you a seek, but there is probably a good reason SQL Server won't pick that without the hint – a seek isn't always the most efficient path, even when this concat setting issue isn't involved.)
Andre Ranieri 2013-10-23 14:26:51
Once the concat setting was back to normal, the optimizer chose the fully covering non-clustered index without the FORCESEEK hint.

Yesterday I experimented with adding ISNULL() and COALESCE() to the computed column formula, but the optimizer refused the covering NCI in both cases unless I removed the computed column.

I wouldn't normally use hints in code, especially in production, but in this case FORCESEEK was helpful in performing root cause analysis.

Thanks for all your help in figuring this out.

Cheers,

Andre Ranieri