I have two tables:
Table 2: CustomerEmailIds:
You may be thinking that why i am not using the EmailId in table CustomerEmails instead of using the emails itself? what i mean is that i can add a column EmailId (INT) to the table CustomerEmailIds & then i can refer that column instead of CustomerEmails.From & CustomerEmails.To?? Guys the issue is the table CustomerEmails is used by some other application & that application just keeps the track of emails sent/received through OutLook. The table CustomerEmailIds have customer emails & these emails come into system from the application i am working on.
So the Table CustomerEmails have 7,00,000+ records while table CustomerEmailIds have 1,00,000+ records.
I need to find out the emails from table CustomerEmails, based on the emails in CustomerEmailIds table.
The query i am using is:
Guys the above query return wrong results..... & i don't know why??
But below query works fine:
I have tried to mock up your scenario, but fail to get the incorrect results?
What are the 'wrong results'?
answered Jun 23, 2011 at 02:55 AM
Kev Riley ♦♦
Update - Right when I posted this, I saw your response to @Kev Riley, so I removed the first paragraph.
Fundamentally, there are several problems here. This is a bit of a rant, but an important one. I know that you said that the data is coming from a different system, but that doesn't make the following any less important.
First, you are searching within a string. This makes it very likely that your table is not in first normal form: your e-mail attribute is not atomic.
Second, you are using a non-SARGable query. When you do % + value + %, indexes will be of no help and you need to scan every record to find a value. This really is an implication of point #1, and is a very good case for why normalization makes things faster. You mentioned that your table has 7 million records. I'm guessing, based on the rest of this, that there probably aren't any indexes aside from the primary key, so every time you do a search, it needs to scan 7 million rows. And with an EXISTS clause, you could be scanning up to 7 million rows up to 1 million times, because each row in the outer part of your query (the Emails table) needs to do a scan of the exists part (Cs3Emails). Having the exists clause will kill the search once it finds the first record, so that's a little bit better than not stopping, but it's still a horrible performance sink.
We can probably help you figure out what's going on and get your system to limp along, but the setup in place now is like shooting yourself in the foot several times and complaining that your revolver's empty. Unfortunately, unless you change the way that e-mail addresses are stored there, you won't get the kind of performance you want. If you do change it so that you have a good data model (which would incorporate somehow a detail table for mail recipients), the query becomes trivial and performance will increase significantly.
Even if the other application needs to remain in place, exactly as it was before, it would be beneficial to consider its results "staging" data and break it out into a normalized table structure, treating that as the real data.