x

How would you create a set of all possible unicode strings?

Ok, so something I need to do is a bit of investigation into how collations are managed, and how you could use .NET code to compare a string in the same way that any given collation would...

So, for this, I need to try and create a data set which contains all possible two-character unicode strings. Seeing as this will end up being 4 billion rows, it could potentially be quite (very) time consuming.

So I thought I would ask you guys how you would go about doing it.

A bit of background:

This is because I need to be able to take the output of two ordered SELECT statements from two DBs with identical schema, and in .NET be able to tell which row would come 'later'. This is so that if the databases contain different data, I can read from the side which has extra rows until the streams are back in sync. This is pretty easy for numeric types, but I need to work with string types too.

more ▼

asked Jan 25, 2010 at 06:42 AM in Default

avatar image

Matt Whitfield ♦♦
29.5k 62 66 88

(comments are locked)
10|1200 characters needed characters left

2 answers: sort voted first

Don't create a table with all the values, make one with all the Unicode characters - nchar(0) to nchar(65535) - joining it to itself to get the two-character strings.

more ▼

answered Jan 25, 2010 at 08:00 AM

avatar image

Rob Farley
5.8k 16 22 28

This info is not exactly correct. 1) In the 0 - 65535 range there are at least 2048 values that have no representation / are not characters, 2) I realized things have changed since 2010 since SQL Server 2012 introduced the _SC collations which are Supplementary Character-aware, but even in SQL Server 2008 R2 (and likely 2005 and 2008 as well), it is possible to store the full UTF-16 character set (I tried to provide an example here but it can't be saved ;-), and if the data is coming from outside of SQL Server, then those characters could be in the data, meaning that using 0 - 65535, while having the most common characters, would be missing the majority of the Unicode set. I provide more details about this in my answer.

Jul 22, 2015 at 02:24 PM srutzky
(comments are locked)
10|1200 characters needed characters left

I need to do is a bit of investigation into how collations are managed

What exactly is meant by "managed"? If you want to gain a better understanding of collations in general (their naming conventions, different versions, etc), the following three areas of MSDN documentation will help greatly:

I am not certain about getting an approximation for collations with names starting in SQL_, but for non-SQL_ collations you should be able to get the native .NET comparisons to be identical (or nearly identical depending on any differences between Unicode 5.0 and 6.0 since the version being used in .NET is dependent upon OS version and .NET Framework version). If you had the collation name of the field, you should be able to:

  1. Use CompareInfo.GetCompareInfo(int culture) to get a new CompareInfo object set to the correct locale, and then

  2. Figure out the CompareOptions from the _CI (for case-insensitive), _AI (for accent-insensitive; equates to CompareOptions.IgnoreNonSpace), etc

  3. Use CompareInfo.Compare(String, String, CompareOptions) to do the same comparison that SQL Server would do.

    ...create a data set which contains all possible two-character unicode strings. Seeing as this will end up being 4 billion rows...

This statement is based on a common misunderstanding of Unicode and SQL Server's implementation of it via the NCHAR, NVARCHAR, NTEXT, and XML datatypes.

There are actually 1,112,064 possible characters that can be represented in Unicode and yes, stored within the XML and N-prefixed string datatypes in SQL Server. Hence, all possible 2-character combinations would amount to 1,236,686,340,096 rows total :-).

The 4 billion number comes from thinking in terms of the typical NVARCHAR / NCHAR data being double-byte, which is 65,536 possibilities. The misunderstanding here is:

  • Both .NET and SQL Server (via the XML and N-prefixed string types) store data using the UTF-16 (Little Endian) encoding.

  • UTF-16 uses either 2 bytes or 4 bytes per character

  • The 2-byte characters were the original total set of characters, which was the UCS-2 encoding (now obsolete)

  • The UCS-2 encoding (values 0 - 65,535 / the first 65,536 values of UTF-16) doesn't use values 55,296 through 57,343 (2048 values) to represent any characters. Hence, if only thinking in terms of the purely double-byte characters, you would still only have a maximum of 63,488 actual characters (65,536 - 2048), and some additional ones are marked as "unused" or "not used", so there might be a total of 63,480 actual, usable characters.

  • The additional characters -- known as Supplementary Characters -- found in UTF-16 that are not in UCS-2 (values 65,536 and above / the 4-byte characters) are created by combing two 2-byte values -- known as surrogate pairs -- that are both in the 55,296 through 57,343 range. These are known as .

  • Storing these Supplementary Characters in XML and N-prefixed string datatypes is usually not a problem. However, whether or not they are recognized by sorting or the built-in functions depends on the version of the collation being used and whether or not the collation name ends with _SC (introduced in SQL Server 2012).

    This is because I need to be able to take the output of two ordered SELECT statements from two DBs with identical schema, and in .NET be able to tell which row would come 'later'. This is so that if the databases contain different data, I can read from the side which has extra rows until the streams are back in sync.

I might be missing something, but I don't see why this is necessary. If the two result sets are using the same field to sort on and have the same collation, then you can just look at the row numbers of the result sets. And if the collations are different, then it doesn't matter as there is no way to equate them. If you want to sync the data between two DBs and not worry about collation differences, then just use a collation that has a name ending in _BIN2. The _BIN2 collations are ordered strictly by the code-point values. And, that sorting just happens to be the same as using the CompareOptions.Ordinal option in .NET comparisons :-).

more ▼

answered Jul 19, 2015 at 02:03 AM

avatar image

srutzky
488 3 6

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x1069
x24

asked: Jan 25, 2010 at 06:42 AM

Seen: 1013 times

Last Updated: Jul 22, 2015 at 02:26 PM

Copyright 2016 Redgate Software. Privacy Policy