There are three main aspects of Unicode collating that are responsible for this behavior (mainly 1 and 3):
VARCHAR
data using a SQL Server collation (i.e. a collation name starting with SQL_
) uses a static mapping table for sorting and comparison operations. But, both VARCHAR
data using a Windows collation (i.e. a collation name not starting with SQL_
) and NVARCHAR
data (regardless of the type of collation) uses Unicode sorting rules. And Unicode uses a multi-level comparison that looks across the entire string for each type of sensitivity, instead of character-by-character. So it will look at all accents across the string, then the casing of each character across the entire string, etc.According to the official Unicode data, there are many characters marked as "variable", but in Microsoft's implementation of Unicode, it appears to be mainly "apostrophe" and "hyphen-minus".
For example:
DECLARE @SortTest TABLE (Things NVARCHAR(10) COLLATE SQL_Latin1_General_CP1_CI_AS); INSERT INTO @SortTest ([Things]) VALUES (N'AA'), (N'AC'), (N'AB'), (N'ZA'), (N'A-A'), (N'a''A'), (N'A,B'), (N'A''''''A'), (N'A''-''A'), (N'A-B'), (N''''), (N'-'), (N'AA'), (N','); SELECT [Things] FROM @SortTest ORDER BY [Things] -- COLLATE Latin1_General_100_BIN2
That returns:
Now, you perhaps could use a binary collation (i.e. name ending in _BIN2
; don't use the older _BIN
collations) to get those two characters to not be ignored in this context, but that also introduces a different problem. For example, uncomment the COLLATE
clause at the end of that query and execute it. It will return the following:
Notice how "a'A" comes after "ZA"; it is not grouped with the upper-case "A"s.
You could also try to sort by an expression that converts those 2 characters to something that would sort closer to where you expect them to go. This would allow for the actual values to remain as they are. But I wouldn't recommend this as it can get messy and hard to maintain, and won't likely perform well for large result sets as it likely won't be able to use an index (if you don't have an index on this column then it won't be as bit of a problem).
You can use the International Components for Unicode (ICU) Collation Demo to see a more ideal / faithful implementation of the Unicode specification. Just click on the "ICU Collation Demo" link and do the following:
You should get the follow results:
Notable differences between ICU collation demo and SQL_Latin1_General_CP1_CI_AS
in SQL Server:
Since you are working with NVARCHAR
and intend on fully supporting Unicode (based on your comment on John's answer), here are some notes for working with NVARCHAR
data:
NVARCHAR
, but string literals need to be prefixed with an upper-case "N". Sure, in this particular case it wouldn't alter any data, but it's a good habit to get into.UNICODE()
function instead of the ASCII()
function.For more information on working with collations in general, please visit: Collations Info
16 People are following this question.