question

RonaldJJames avatar image
RonaldJJames asked

Collation and quote/hypen characters

I’m using SQL Server 2008 R2 with a default collation of SQL_Latin1_General_CP1_CI_AS and I’m investigating sort ordering using this collation. It seems that the ‘ character (ascii 39) and hyphen (ascii 45) don’t sort consistently depending on where they are in the string. For example: CREATE TABLE AsciiTestNonUnique ( Id INT IDENTITY(1, 1) ,Name NVARCHAR(10) ) INSERT INTO AsciiTestNonUnique VALUES ( 'A''A' ) INSERT INTO AsciiTestNonUnique VALUES ( 'A+A' ) INSERT INTO AsciiTestNonUnique VALUES ( 'A-A' ) INSERT INTO AsciiTestNonUnique VALUES ( 'A0A' ) INSERT INTO AsciiTestNonUnique VALUES ( '_''_' ) INSERT INTO AsciiTestNonUnique VALUES ( '_+_' ) INSERT INTO AsciiTestNonUnique VALUES ( '_-_' ) INSERT INTO AsciiTestNonUnique VALUES ( '_0_' ) SELECT ID , Name ,ASCII(SUBSTRING(name, 2, 1)) FROM AsciiTestNonUnique ORDER BY Name Result: 5 _'_ 39 7 _-_ 45 6 _+_ 43 8 _0_ 48 2 A+A 43 4 A0A 48 1 A'A 39 3 A-A 45 Can anyone explain this behaviour? Many Thanks Ron James
collationascii
1 comment
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

This site runs off of votes. Please mark all helpful answers by clicking on the thumbs up next to those answers. If any one answer lead to a solution, please indicate that by clicking on the check mark next to that answer. And you can mark your own answer.
1 Like 1 ·
JohnM avatar image
JohnM answered
If you change the Name column to VARCHAR, does that then provide the output that you would expect? Got the idea from: https://stackoverflow.com/questions/11107104/sql-server-sort-order-with-nonprintable-characters
4 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Yes, changing the Name column to varchar gives me the expected results - Thanks. But I'm stuck with nvarchar in my database. So a cast on the order by clause also gives me the expected results and I think this will enable me to move forward: SELECT ID , Name , ASCII(SUBSTRING(name, 2, 1)) FROM AsciiTestNonUnique ORDER BY CAST(Name AS VARCHAR(10)) But I'm still curious. Why does the sort order vary depending on the string containing these characters? Anyone?
0 Likes 0 ·
In fact a future project is to support unicode data. In that case the cast(Name as varchar(10)) I mention above won't be viable. So any explanation of the nvarchar behaviour would be much appreciated. Thanks in advance.
0 Likes 0 ·
The answer that I got back on Twitter is that you have to do the cast in order to get the sorting to work. http://sqlmag.com/blog/forcing-collation-where-clause-22-jun-2011 also from Twitter: "You could try ORDER BY CAST(Name as nvarchar(x)) COLLATE Cyrillic_General_CI_AI or.... COLLATE Latin1_General_BIN" and "Check sys.fn_helpcollations() to list different collations options that you can try. (I'm just skimming your question)." The provider of said info: http://twitter.com/angrypets (just disclaiming that it's not me) From another twitter user: "SQL uses the unicode collation algorithm for both [var]char & n[var]char. Sort order not as you'd expect." He referenced " http://www.unicode.org/reports/tr10/" The provider of said info: https://twitter.com/SQLTuna (just disclaiming that it's not me) Hope that helps!
3 Likes 3 ·
I posted a link to your question out on Twitter via the #sqlhelp hash tag. I don't know the answer but I'm sure that someone does. ;-)
0 Likes 0 ·
RonaldJJames avatar image
RonaldJJames answered
I guess this (from [ http://www.unicode.org/reports/tr10/"][1]) answers my original question Collation order is not preserved under concatenation or substring operations, in general. For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary: x < y does not imply that xz < yz x < y does not imply that zx < zy xz < yz does not imply that x < y zx < zy does not imply that x < y [1]: http://www.unicode.org/reports/tr10/
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

srutzky avatar image
srutzky answered

There are three main aspects of Unicode collating that are responsible for this behavior (mainly 1 and 3):

  1. VARCHAR data using a SQL Server collation (i.e. a collation name starting with SQL_ ) uses a static mapping table for sorting and comparison operations. But, both VARCHAR data using a Windows collation (i.e. a collation name not starting with SQL_ ) and NVARCHAR data (regardless of the type of collation) uses Unicode sorting rules. And Unicode uses a multi-level comparison that looks across the entire string for each type of sensitivity, instead of character-by-character. So it will look at all accents across the string, then the casing of each character across the entire string, etc.
  2. "Contextual Sensitivity" allows for behavior to adjust based on adjacent characters.
  3. "Variable Weighting" allows for punctuation to be "effectively" ignored, except as compared to other variable-weighted characters.

According to the official Unicode data, there are many characters marked as "variable", but in Microsoft's implementation of Unicode, it appears to be mainly "apostrophe" and "hyphen-minus".

For example:

DECLARE @SortTest TABLE (Things NVARCHAR(10) COLLATE SQL_Latin1_General_CP1_CI_AS);
INSERT INTO @SortTest ([Things]) VALUES
  (N'AA'),
  (N'AC'),
  (N'AB'),
  (N'ZA'),
  (N'A-A'),
  (N'a''A'),
  (N'A,B'),
  (N'A''''''A'),
  (N'A''-''A'),
  (N'A-B'),
  (N''''),
  (N'-'),
  (N'AA'),
  (N',');
SELECT [Things]
FROM   @SortTest
ORDER BY [Things] -- COLLATE Latin1_General_100_BIN2

That returns:

  1. '
  2. -
  3. ,
  4. A,B
  5. AA
  6. AA
  7. a'A
  8. A'''A
  9. A'-'A
  10. A-A
  11. AB
  12. A-B
  13. AC
  14. ZA

Now, you perhaps could use a binary collation (i.e. name ending in _BIN2; don't use the older _BIN collations) to get those two characters to not be ignored in this context, but that also introduces a different problem. For example, uncomment the COLLATE clause at the end of that query and execute it. It will return the following:

  1. '
  2. ,
  3. -
  4. A'''A
  5. A'-'A
  6. A,B
  7. A-A
  8. A-B
  9. AA
  10. AA
  11. AB
  12. AC
  13. ZA
  14. a'A

Notice how "a'A" comes after "ZA"; it is not grouped with the upper-case "A"s.

You could also try to sort by an expression that converts those 2 characters to something that would sort closer to where you expect them to go. This would allow for the actual values to remain as they are. But I wouldn't recommend this as it can get messy and hard to maintain, and won't likely perform well for large result sets as it likely won't be able to use an index (if you don't have an index on this column then it won't be as bit of a problem).


You can use the International Components for Unicode (ICU) Collation Demo to see a more ideal / faithful implementation of the Unicode specification. Just click on the "ICU Collation Demo" link and do the following:

  1. Replace the default list of items on the left with the following list of 14 items (same initial list that is inserted into the table variable in the example code shown above, but it needs to start out in the order shown below so you can't just copy-and-paste the result lists from above)
    • AA
    • AC
    • AB
    • ZA
    • A-A
    • a'A
    • A,B
    • A'''A
    • A'-'A
    • A-B
    • '
    • -
    • AA
    • ,
  2. In the top right section of options:
    1. select "secondary" as the "strength" setting (making it case-insensitive; default is case-sensitive)
    2. select "shifted" as the "alternate" setting
  3. To the right of the "Sort" button, uncheck the "diff strengths" check-box
  4. Click the "Sort" button

You should get the follow results:

  1. '
  2. -
  3. ,
  4. AA
  5. A-A
  6. a'A
  7. A'''A
  8. A'-'A
  9. AA
  10. AB
  11. A,B
  12. A-B
  13. AC
  14. ZA

Notable differences between ICU collation demo and SQL_Latin1_General_CP1_CI_AS in SQL Server:

  1. All of the punctuation in the ICU demo is ignored, not just apostrophe and hyphen-minus. This is why "A,B" sorts with "AB" in the ICU demo but before "AA" in SQL Server.
  2. The punctuation in the ICU demo is ignored completely, not just mostly. This is why we needed to list "AA" twice, once at the top and once at the bottom, to show that each one sorts according to its placement in the source list, which is why they are separated by other items containing punctuation.

Since you are working with NVARCHAR and intend on fully supporting Unicode (based on your comment on John's answer), here are some notes for working with NVARCHAR data:

  1. Not only do you need to specify variables as NVARCHAR, but string literals need to be prefixed with an upper-case "N". Sure, in this particular case it wouldn't alter any data, but it's a good habit to get into.
  2. When you want to find the numeric value (i.e. Code Point) of a particular character, you need to use the UNICODE() function instead of the ASCII() function.

For more information on working with collations in general, please visit: Collations Info

10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.