question

yaqubonnet avatar image
yaqubonnet asked

Find a string within a string - SQL takes too much time

I have two tables: **Table 1: CustomerEmails:** CREATE TABLE [dbo].[CustomerEmails]( [id] [int] IDENTITY(1,1) NOT NULL, [datecreated] [datetime] NULL, [UID] [nvarchar](250) NULL, [From] [nvarchar](100) NULL, [To] [nvarchar](100) NULL, [Subject] [nvarchar](max) NULL, [Body] [nvarchar](max) NULL, [Dated] [datetime] NULL, CONSTRAINT [PK_CustomerEmails] PRIMARY KEY CLUSTERED ( [id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) **Table 2: CustomerEmailIds:** CREATE TABLE [dbo].[CustomerEmailIds]( [Email] [varchar](200) NULL ) ON [PRIMARY] You may be thinking that why i am not using the EmailId in table CustomerEmails instead of using the emails itself? what i mean is that i can add a column EmailId (INT) to the table CustomerEmailIds & then i can refer that column instead of CustomerEmails.From & CustomerEmails.To?? Guys the issue is the table CustomerEmails is used by some other application & that application just keeps the track of emails sent/received through OutLook. The table CustomerEmailIds have customer emails & these emails come into system from the application i am working on. So the Table CustomerEmails have 7,00,000+ records while table CustomerEmailIds have 1,00,000+ records. I need to find out the emails from table CustomerEmails, based on the emails in CustomerEmailIds table. The query i am using is: SELECT e.* FROM CustomerEmails e WHERE EXISTS ( SELECT Email FROM CustomerEmailIds c WHERE ( ISNULL(e.[From],'') + '/' + ISNULL(e.[To],'') ) LIKE '%'+c.Email+'%' ) Some facts: 1- I am using SQL Server 2008 2- That the CustomerEmails.**To** can contains multiple comma separated emails like: email1@yahho.com, email2@yahho.com, email3@yahho.com 3- Because of the fact2 the c.Cs3Emails+'%' OR to= c.Cs3Emails will not list the desired results that's why i am using '%'+c.Cs3Emails+'%' Guys the above query return wrong results..... & i don't know why?? But below query works fine: SELECT e.* FROM CustomerEmails e WHERE (ISNULL(e.[From],'') + '/' + ISNULL(e.[To],'')) LIKE '%email1@gmail.com%'
sql-server-2008t-sqlstring
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Kev Riley avatar image Kev Riley ♦♦ commented ·
I'm confused : you mention `CustomerEmails` and `CustomerEmailIds` , but your script refers to `Emails` and `Cs3EmailsForPurge`
0 Likes 0 ·
yaqubonnet avatar image yaqubonnet commented ·
Sorry for that, by mistake the wrong table names were used there. Actually i am trying different things and.. that's why i put the wrong table names there.
0 Likes 0 ·
Kev Riley avatar image
Kev Riley answered
I have tried to mock up your scenario, but fail to get the incorrect results? What are the 'wrong results'? if exists(select OBJECT_ID from sys.objects where type = 'U' and name = N'emails') drop table Emails create table emails (EmailFrom varchar(200), EmailTo varchar(200)) if exists(select OBJECT_ID from sys.objects where type = 'U' and name = N'cs3emailsforpurge') drop table cs3emailsforpurge create table cs3emailsforpurge (cs3emails varchar(200)) insert into emails(EmailTo) select 'email1@yahho.com, email1@gmail.com, email2@yahho.com, email3@yahho.com' insert into cs3emailsforpurge select 'email1@gmail.com' select * from emails e where (ISNULL(e.EmailFrom,'') + '/' + ISNULL(e.EmailTo,'')) LIKE '%email1@gmail.com%' SELECT e.* FROM Emails e WHERE EXISTS ( SELECT Cs3Emails FROM Cs3EmailsForPurge c WHERE ( ISNULL(e.EmailFrom,'') + '/' + ISNULL(e.EmailTo,'') ) LIKE '%'+c.Cs3Emails+'%' )
8 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Kev Riley avatar image Kev Riley ♦♦ commented ·
so all your emails matched %.%
1 Like 1 ·
yaqubonnet avatar image yaqubonnet commented ·
Yes & i have resolved the issue by deleting all the invalid emailIDs, so now all the emailIDs are in this format EmailId@Domain.Com Thanks Kev, you guided me towards the solution.
1 Like 1 ·
yaqubonnet avatar image yaqubonnet commented ·
Thanks for your reply. Yes you are right it works for small number of records but not in my case!! By wrong results i mean that it returns all the records from CustomerEmails table whether or not the email exists in the table CustomerEmailIds.
0 Likes 0 ·
Kev Riley avatar image Kev Riley ♦♦ commented ·
maybe you have values in your CustomerEmailIDs table that could match any or all of the CustomerEmails rows - e.g. '@'
0 Likes 0 ·
yaqubonnet avatar image yaqubonnet commented ·
hmmm... i have checked but found no entry starting from @. all the emails are valid emails like email1@google.com
0 Likes 0 ·
Kev Riley avatar image Kev Riley ♦♦ commented ·
it doesn't have to be '@' - what is the value of the shortest email?
0 Likes 0 ·
yaqubonnet avatar image yaqubonnet commented ·
Thanks Kev Riley, great thanks..... Actually the shortest email length is 1 & it is '.' great thanks again for your valuable time.
0 Likes 0 ·
Kev Riley avatar image Kev Riley ♦♦ commented ·
Happy to have helped!
0 Likes 0 ·
Kevin Feasel avatar image
Kevin Feasel answered
**Update** - Right when I posted this, I saw your response to @Kev Riley, so I removed the first paragraph. Fundamentally, there are several problems here. This is a bit of a rant, but an important one. I know that you said that the data is coming from a different system, but that doesn't make the following any less important. First, you are searching within a string. This makes it very likely that your table is not in first normal form: your e-mail attribute is not atomic. Second, you are using a non-SARGable query. When you do % + value + %, indexes will be of no help and you need to scan every record to find a value. This really is an implication of point #1, and is a very good case for why normalization makes things faster. You mentioned that your table has 7 million records. I'm guessing, based on the rest of this, that there probably aren't any indexes aside from the primary key, so every time you do a search, it needs to scan 7 million rows. And with an EXISTS clause, you could be scanning up to 7 million rows up to 1 million times, because each row in the outer part of your query (the Emails table) needs to do a scan of the exists part (Cs3Emails). Having the exists clause will kill the search once it finds the first record, so that's a little bit better than not stopping, but it's still a horrible performance sink. We can probably help you figure out what's going on and get your system to limp along, but the setup in place now is like shooting yourself in the foot several times and complaining that your revolver's empty. Unfortunately, unless you change the way that e-mail addresses are stored there, you won't get the kind of performance you want. If you do change it so that you have a good data model (which would incorporate somehow a detail table for mail recipients), the query becomes trivial and performance will increase significantly. Even if the other application needs to remain in place, exactly as it was before, it would be beneficial to consider its results "staging" data and break it out into a normalized table structure, treating that as the real data.
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Fatherjack avatar image Fatherjack ♦♦ commented ·
700k, not 7M but points still stand :) +1
0 Likes 0 ·
Kevin Feasel avatar image Kevin Feasel commented ·
True. I saw two commas and went Dr. Evil...
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.