question

sharon avatar image
sharon asked

Help with writing T-SQL

How would you write a t-sql to pull bad customer address and bad e-mail address of a customer data? If I query to pull customers with no zipcode, they might be international customers and it won't work. It is also tricky what to specify to look for in the query to get bad email addresses. Any idea how to get this right? Thanks!

t-sql
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

TimothyAWiseman avatar image
TimothyAWiseman answered

Without precise definitions of what you call bad, then the address one is hard to address. However, if you have some way of checking for international customers you can look for those without zip codes that are not international to start. Depending on how the addresses are entered in your system, you can probably do this by checking for the country code.

As for bad e-mail addresses, precisely identifying good versus bad ones can be more challenging then it sounds. The easiest way to address this is through regex. http://www.regular-expressions.info/email.html has a regex for a valid e-mail address I have used before. It is not 100% perfect, but it is very close and the site includes a detailed description of when and why it may be wrong.

T-SQL does not directly support regex, but you can use C# or VB.NET to write a CLR function and import that into SQL Server so you can use it that way. Alternately you could write a program to sort through it line by line in any language that works with SQL Server such as Python or Perl. The CLR is probably going to be more computationally effecient, but it requires the proper security setup to be able to use it and for some people writing an entirely external program will take less developer time than writing and importing the CLR.

1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

That's pretty much it. You have to define what "bad" looks like in order to arrive at a WHERE clause.
1 Like 1 ·
Matt Whitfield avatar image
Matt Whitfield answered

+1 to Timothy, and this would have gone in a comment apart from the fact that it didn't fit!

As Timothy correctly points out, you need to know what bad is. Your question isn't really about T-SQL at all, it's about email validation and address validation. So, dropping it into two parts, let's look at email validation.

There are three levels to which you can take it - the first, as described by Timothy answers the question 'does it look like an email address?' The second question to be answered is 'does it exist in a deliverable domain?'. This part involves looking up an MX record for the domain in question. An MX (mail exchanger) record is a specific type of DNS record that, when resolved against the part after the @ in an email address, tells you what mail server to talk to about the address. The third question is 'does that particular email address exist on the target domain?'. This part involves actually connecting to the server, and starting to deliver a message. This part can be made extremely hard by the fact that some SMTP servers won't actually give you an error until the end of the exchange.

In any case, either of those last two steps would be hideous done though SQL, even using a CLR approach, simply because the processing time involved in doing the DNS lookups and SMTP exchange (if you choose to do them) makes it prohibitive. If you want to do either levels 2 or 3, you will want a batch process that validates the addresses offline (i.e. not from within the SQL Server process) and then updates the database when addresses are validated (preferably in batches).

The second part, address validation, is even harder. Email has one address format, defined and adhered to everywhere. Normal addresses have a huge variety of formats, coding schemes and validation methods. You are really really going to be better off getting in a third party piece of software in for this. First of all, the task isn't feasible without interfacing with the mail companies that actually deliver the mail - given that these companies often charge for their data, then you'll have saved money before even doing any coding if you interface with a system written by a company with those agreements already in place. Often, however, the costs of these systems can be prohibitive, and they can sometimes be a right pain to work with. But, again, this is something that you would want to do offline, and update the database in batches when validation completes.

So, after all that waffle, here's the short answer to your question 'How would you write a t-sql...':

I wouldn't.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.