question

guitar_guy avatar image
guitar_guy asked

SQLexpress 2008 doesn't recognize my U0242 (glottal stop)

Hi, I have a problem that is driving me crazy. No matter how I form the query (or what the collation), I can't seem to get the glottal stop character to be searchable. I've tried escaping it, changing collations, changing codepages, custom stoplists, etc. and can't seem to find a solution. Any ideas? Thanks!
sql-server-express
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Can you give us some examples of queries you have tried? Are you using full text indexing?
0 Likes 0 ·
Hi, I am using fulltext indexing on some columns, but it still fails on columns that are not fulltext. So I have tried: Select * from Table WHERE Column LIKE N'ɂ'(0 rows) or N'ɂ%' (all rows) And on the Fulltext column I have tried: Select * from Table WHERE CONTAINS(Column,N'ɂ') and with wildcards, etc. I also tried escaping the character, using the unicode position, etc. Note: I *can* use a REPLACE query to replace the character, so SQL knows it's in there!
0 Likes 0 ·
So it seems that U0242 and U0241 - lowercase and uppercase glottal unicode glottal stops, do not seem to recognized as proper characters in MS SQL 2008. Does anyone know if newer versions of MS SQL server support these characters properly?
0 Likes 0 ·
Oleg avatar image
Oleg answered
@guitar_guy The query you try: Select * from Table WHERE Column LIKE N'ɂ' is supposed to return zero rows because there are no wildcards and therefore the LIKE acts like equal sign, meaning you are trying to find rows which have a value of ɂ exactly. The way the glottal stop character is stored depends on how it was inserted. If the column is NVARCHAR column (2 bytes per letter) then the actual value for glottal stop will be either 0x4202 in case if N'' was used when inserting or 0x3F00 if it was not (sadly, 0x3F is just a regular question mark so if the N'' was not used then the value is lost as it was translated to regular question mark). If the column is VARCHAR column then the value is always 0x3F for that character. Here is the sample to show the difference (please note the 2 different ways used to insert the values): declare @t table (col1 nvarchar(3)); insert into @t values (N'A'), -- ɂ not present (N'Bɂ'), -- begins with ɂ inserted as nvarchar (N'ɂC'), -- ends with ɂ inserted as nvarchar (N'DɂE'), -- contains ɂ inserted as nvarchar ('Fɂ'), -- begins with ɂ inserted as varchar, ɂ = ? due to info loss ('ɂG'), -- ends with ɂ inserted as varchar, ɂ = ? due to info loss ('HɂK'); -- contains ɂ inserted as varchar, ɂ = ? due to info loss select col1, cast(col1 as varbinary(4)) stored_as from @t; -- results: col1 stored_as ---- ---------- A 0x4100 Bɂ 0x42004202 ɂC 0x42024300 DɂE 0x44004202 F? 0x46003F00 ?G 0x3F004700 H?K 0x48003F00 Please note that the glottal stop is either inserted correctly as 0x4202 or as a plain question mark 0x3F00 depending on whether N'' was used or not. Unfortunately, using **LIKE '%ɂ%'** returns incorrect rows (the ones with regular question mark) and using **LIKE N'%ɂ%'** returns all rows as you already discovered. The workaround is to use something else in place of LIKE. For example, the following will do the trick: select * from @t where charindex(0x4202, cast(col1 as varbinary(max))) > 0; -- results are correct (rows 2, 3, and 4 are returned): col1 ---- Bɂ ɂC DɂE You don't have to cast to varbinary(max) but can cast to varbinary(your\_column\_size) instead. **EDIT** Actually, I just tried the binary collation and it works just fine, so it would be much easier than casting to varbinary: select * from @t where col1 like N'%ɂ%' collate SQL_Latin1_General_Cp850_BIN; -- results are also correct (rows 2, 3, and 4 are returned): col1 ---- Bɂ ɂC DɂE Hope this helps. Oleg
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

guitar_guy avatar image
guitar_guy answered
Thanks, Oleg. Actually, I found that it turns out there are 3 unicode code positions for glottal stop: u294, u241, and u242 (don't recall the hex values). First two are uppercase, and 242 is lowercase, but belongs with 241. These are new(er) than u294. You are bang on about the collations. I also found that using Latin_100 also worked. Anything in the "100" collations are unicode 5.0 compliant, so that's what I ended up using. BINs also work, as you found. Cheers, Chris
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.