question

Jagd avatar image
Jagd asked

Exclude file type from fulltext index

I have a fulltext index created on a varbinary(max) filestream column. I have both PDF's and XML's being stored in this filestream, and consequently both are being indexed. However, I really don't want to index the XML files, because I'm afraid that over time they'll bloat my index and slow it down. So what I would really like to do is only have my fulltext index on the PDF files instead. Is there a way to do this? Can I somehow disable the XML filter, perhaps? Thanks in advance for any help that I receive on this!
indexfull-text
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Oleg avatar image Oleg commented ·
If anything is going to slow down your full text searches, that would be the iFilter for PDF, not XML. I believe that xml iFilter is pretty efficient specifically when compared to PDF. It is designed to ignore the xml markup, so you should not be worried about the bloating. PDF iFilter is also designed to ignore the (PDF) markup, but the horrendousness of the latter should be a bigger concern. Just my 2 cents.
0 Likes 0 ·
Jagd avatar image Jagd commented ·
@Oleg - thanks for your two cents. I had read on quite a few different websites (forums and such) that the performance of the Adobe PDF iFilter was abysmal, but I haven't really noticed it myself. I'm indexing in the vicinity of 2500 PDF's, with some of them being hundreds of megs in size, and the full-text search doesn't ever seem to take more than a second.
0 Likes 0 ·
Oleg avatar image Oleg commented ·
@Jagd Oh, I see. This actually makes sense. The possibility of a poor performance only affects the engine ***while*** it is indexing the data as it needs to consider a blob, feed it to the PDF IFilter which allows extraction of useful text (that would be a time comsumming part), index the text and store index data. Once this is done, the format of the original blob is irrelevant, the search queries are applied directly against the FT index. As a mattter of fact, if original PDF contains a bunch of big images not carrying any text information then these bytes are ignored and the volume of useful information can be considerably smaller than the size of the original PDF blob. This means that while the engine might have to work harder while indexing PDF rather than other document types, the final result is still the same, the actual FT search is just as fast :)
0 Likes 0 ·

1 Answer

·
ozamora avatar image
ozamora answered
Seems that you cannot just by reading [BOL][1] You might want to isolate indexable data into its own table and create the fulltext index there instead. [1]: http://technet.microsoft.com/en-us/library/ms187317.aspx
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Jagd avatar image Jagd commented ·
Yeah, I'd considered doing this as well. I actually had someone from a different forum give me the suggestion of indexing a view that only looks at the PDF's (which I could do by file extension column). Thanks for your help.
0 Likes 0 ·
ozamora avatar image ozamora commented ·
Full text index works with tables only and not views. Replication can be a good option as you can replicate based on a filter. A full text index can be added to the replicated table. They can coexist in the same database.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.