question

PeterH avatar image
PeterH asked

Compare unstructured text to indexed field - most elegant way?

On a client workstation, I have just extracted a page of text from a PDF via OCR and stored the results as a CSV. If I parse the file into words, it's probably around 300 to 400 words -- around 1200 to 1800 characters, on average.

In an Internet-facing (non-local) SQL/Server 2008 database I have an indexed field called name. The name could contain one or more words. I could have hundreds or thousands of unique names (probably not more than 1000 in most cases).

I need the most efficient way to find all names that are contained in my CSV.

Since SQL/Server is not local, my idea is to pass the extracted data as a text field to a stored procedure. Then write an SP to the text into words, check if any of my names begin with that word -- and if so get the whole name and see if the next words in the name match. This means running around 300 to 400 simple queries, which should be fast on the server.

Anther idea would be to insert the list of words in my CSV into a SQL/Server table and then use a query to compare the fields for matches.

Or I could pull down all my names to the workstation into an array and do the matching there.

Is there a more elegant way than the ones I listed to accomplish my objective?

sql-server-2008
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

graz avatar image graz ♦ commented ·
Can you give us a few sample rows? And what you'd like the output to look like? Please update your question with this information.
0 Likes 0 ·

1 Answer

·
Scot Hauder avatar image
Scot Hauder answered

Internet-facing SQL Server?..Hope it's on your DMZ behind a firewall...

This assumes no whitespace before or after the (one or more words) that make up a name, only commas, and that there is no text qualifier--meaning there are no commas within each group of words that make up a name. It will use an index scan but should be relatively fast with such few rows.

            
CREATE PROC dbo.GetNameMatches            
     @PDFText varchar(max)            
AS            
            
SET NOCOUNT ON            
            
SELECT *            
FROM NameTable            
WHERE @PDFText LIKE '%,' + Name + ',%'            
            
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.