question

Matt Whitfield avatar image
Matt Whitfield asked

How would you create a set of all possible unicode strings?

Ok, so something I need to do is a bit of investigation into how collations are managed, and how you could use .NET code to compare a string in the same way that any given collation would...

So, for this, I need to try and create a data set which contains all possible two-character unicode strings. Seeing as this will end up being 4 billion rows, it could potentially be quite (very) time consuming.

So I thought I would ask you guys how you would go about doing it.

A bit of background:

This is because I need to be able to take the output of two ordered SELECT statements from two DBs with identical schema, and in .NET be able to tell which row would come 'later'. This is so that if the databases contain different data, I can read from the side which has extra rows until the streams are back in sync. This is pretty easy for numeric types, but I need to work with string types too.

t-sqlcollation
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Rob Farley avatar image
Rob Farley answered

Don't create a table with all the values, make one with all the Unicode characters - nchar(0) to nchar(65535) - joining it to itself to get the two-character strings.

1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

srutzky avatar image srutzky commented ·
This info is not exactly correct. **1)** In the 0 - 65535 range there are at least 2048 values that have no representation / are not characters, **2)** I realized things have changed since 2010 since SQL Server 2012 introduced the `_SC` collations which are Supplementary Character-aware, but even in SQL Server 2008 R2 (and likely 2005 and 2008 as well), it is possible to store the full UTF-16 character set (I tried to provide an example here but it can't be saved ;-), and if the data is coming from outside of SQL Server, then those characters could be in the data, meaning that using 0 - 65535, while having the most common characters, would be missing the majority of the Unicode set. I provide more details about this in my answer.
0 Likes 0 ·
srutzky avatar image
srutzky answered
> I need to do is a bit of investigation into how collations are managed What exactly is meant by "managed"? If you want to gain a better understanding of collations in general (their naming conventions, different versions, etc), the following three areas of MSDN documentation will help greatly: * [Collation and Unicode Support][1] (in "Planning a SQL Server Installation") * [Collation and Unicode Support][2] (in "Database Engine Instances") * [Collation][3] > how you could use .NET code to compare a string in the same way that any given collation would I am not certain about getting an approximation for collations with names starting in **SQL_**, but for non-**SQL_** collations you should be able to get the native .NET comparisons to be identical (or nearly identical depending on any differences between Unicode 5.0 and 6.0 since the version being used in .NET is dependent upon OS version and .NET Framework version). If you had the collation name of the field, you should be able to: 1. Use [CompareInfo.GetCompareInfo(int culture)][4] to get a new `CompareInfo` object set to the correct locale, and then 2. Figure out the [CompareOptions][5] from the `_CI` (for case-insensitive), `_AI` (for accent-insensitive; equates to `CompareOptions.IgnoreNonSpace`), etc 3. Use [CompareInfo.Compare(String, String, CompareOptions)][6] to do the same comparison that SQL Server would do. > ...create a data set which contains all possible two-character unicode strings. Seeing as this will end up being 4 billion rows... This statement is based on a common misunderstanding of Unicode and SQL Server's implementation of it via the `NCHAR`, `NVARCHAR`, `NTEXT`, and `XML` datatypes. There are actually 1,112,064 possible characters that can be represented in Unicode and yes, stored within the `XML` and `N`-prefixed string datatypes in SQL Server. Hence, all possible 2-character combinations would amount to 1,236,686,340,096 rows total :-). The 4 billion number comes from thinking in terms of the typical `NVARCHAR` / `NCHAR` data being double-byte, which is 65,536 possibilities. The misunderstanding here is: * Both .NET and SQL Server (via the `XML` and `N`-prefixed string types) store data using the [UTF-16][7] (Little Endian) encoding. * UTF-16 uses either 2 bytes or 4 bytes per character * The 2-byte characters were the original total set of characters, which was the UCS-2 encoding (now obsolete) * The UCS-2 encoding (values 0 - 65,535 / the first 65,536 values of UTF-16) doesn't use values 55,296 through 57,343 (2048 values) to represent any characters. Hence, if only thinking in terms of the purely double-byte characters, you would still only have a maximum of 63,488 actual characters (65,536 - 2048), and some additional ones are marked as "unused" or "not used", so there might be a total of 63,480 actual, usable characters. * The additional characters -- known as Supplementary Characters -- found in UTF-16 that are not in UCS-2 (values 65,536 and above / the 4-byte characters) are created by combing two 2-byte values -- known as surrogate pairs -- that are both in the 55,296 through 57,343 range. These are known as . * Storing these Supplementary Characters in `XML` and `N`-prefixed string datatypes is usually not a problem. However, whether or not they are recognized by sorting or the built-in functions depends on the version of the collation being used and whether or not the collation name ends with `_SC` (introduced in SQL Server 2012). > This is because I need to be able to take the output of two ordered SELECT statements from two DBs with identical schema, and in .NET be able to tell which row would come 'later'. This is so that if the databases contain different data, I can read from the side which has extra rows until the streams are back in sync. I might be missing something, but I don't see why this is necessary. If the two result sets are using the same field to sort on and have the same collation, then you can just look at the row numbers of the result sets. And if the collations are different, then it doesn't matter as there is no way to equate them. If you want to sync the data between two DBs and not worry about collation differences, then just use a collation that has a name ending in `_BIN2`. The `_BIN2` collations are ordered strictly by the code-point values. And, that sorting just happens to be the same as using the `CompareOptions.Ordinal` option in .NET comparisons :-). [1]: https://msdn.microsoft.com/en-us/library/ms143503.aspx [2]: https://msdn.microsoft.com/en-us/library/ms143726.aspx [3]: https://msdn.microsoft.com/en-us/library/ff848763.aspx [4]: https://msdn.microsoft.com/en-us/library/80xbtyz7.aspx [5]: https://msdn.microsoft.com/en-us/library/system.globalization.compareoptions.aspx [6]: https://msdn.microsoft.com/en-us/library/0sk4bdtt.aspx [7]: https://en.wikipedia.org/wiki/UTF-16
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.