question

Mark avatar image
Mark asked

Distributed Databases and Primary Keys

What should be done to maintain distributed database keys? (More specifically for example, a replicated database that is updated on more than one machine or a table that is not always connected.) GUIDs have their issues, although they're a possibility. Is there a better solution? This is not an emergency. I'm just interested. ;-)
primary-keydistributed
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

TimothyAWiseman avatar image
TimothyAWiseman answered
I concur with Fatherjack and that is generally the solution I have seen used when you want a surrogate key. Another possible answer is to look for a natural key. Not everything has a natural key that is appropriate, but when they do it tends to remove the problem entirely.
4 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

@TimothyAWiseman This is what I tried to describe as the last option in my answer, but could not find the right words to do it, so yours is so much easier to understand.
0 Likes 0 ·
+1 I agree. If you really have to distribute the app, sometimes just using the natural key is much better. I'm not a fan in general because they tend to make the PK so fat AND because they're usually the most common acces path, making them the most likely candidate for the cluster, which makes the cluster fat as well.
0 Likes 0 ·
@Grant Fritchey Sometimes there are ways to put PK on Atkins diet by using not so natural identifier of the location (provided that it is carved in stone though). For example, suppose you have a unit with some friendly name and very cheap ID (like 4-digit number). If you have ID \* some_big_round_number + some_number such that the first 4-digits are still preserved and the whole thing still fits into 8-byte int then you get the best of both: seed separation and identifiable by the naked eye "record location". Something like area named TX has id = 1234 and area named Peoples Republic of Massachusetts has id = 5678. So when both areas' data is loaded into your central database, you can see right away that record 12340000000001 came from TX and record 56780000000001 came from Boston.
0 Likes 0 ·
And the "answer" belongs to you Timothy. You got the most votes thus far and I like the idea of using a natural key. They can never be wrong (short of a typo or lack of data entry validation). On the other hand, if multiple columns are necessary to make the natual key, they can be more difficult to work with.
0 Likes 0 ·
Fatherjack avatar image
Fatherjack answered
Generally with replication you get to set a range of IDs per subscriber so that they are exclusive they get to use up that range between syncs. When they sync and have used thier range they get given the next range after the current largest in use. EG subsvr1 gets ID range 1 - 10000 subsvr2 gets ID range 10001 - 20000 they both use IDs subsvr1 gets to the end of its range 1st and then gets issued 20001 - 30000 and so on. setting the range sizes is where you come in. You have to estimate the suitable range to allow offline functionality. beware large transactions that bump into the end of the range - the whole job gets rolled back ...
4 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

I like that idea (+1), although using ranges like that can be tricky since I've known of little apps I've written to last years longer than I thought their life would be. Not that I'm bragging, the processes simply worked, so no one took the time to rewrite them is all.
1 Like 1 ·
Mark it isnt a case of defining a range for ever, you simply have to estimate the size of the range required to last the app between sync opportunities. It will arrange the hop to the next set of figures automagically. eg if you set the range to be 50 and the sync duration is 7 days and your offline work uses 50 per day then it will break. If you set the range to 500 then you should be OK for most weeks and handle the odd 'big one'. If you set it to 1000 then the subscriber doesnt have to sync for 3 weeks. When it gets to the limit it will roll over to the next available range as required
0 Likes 0 ·
@Fatherjack - good point, so the range of values function as temporary keys until the sync. After the sync, they can be reused. But in that case, you would have to track the sync sequences somehow to prevent duplicate keys. So you would use some sort of batch counter maybe? I've had to use that technique before with intensive data entry systems.
0 Likes 0 ·
@Mark - it depends on which method. Actually setting up replication involves assigning the ranges and progression on the PKs so its handled from SQL or you would have to manage it from the app if you are 'rolling your own'. I would guess (as I have never considered it let alone done it) that you would keep a reference table in the database that the app gets it range from and then marks it as "used" and gets the next range as needed. I have only ever let SQL take care of it.
0 Likes 0 ·
Oleg avatar image
Oleg answered
Starting from SQL Server 2000, there is a way to manage identity keys by letting the replication allocate different ranges to publishers and subscribers. The problem with this approach is the clustering. Identity keys are ideal clustering candidates because they are both ever increasing and never changing. In the distributed environment, the replication of such data will inevitably lead to inconsistent page splits. The same is applied to GUIDs. I believe that the page split damage can and should be minimized by not allowing front ends to feed the database objects with guids but rather by opting to implement newsequentialid() column defaults. This is not a bullet-proof approach of course because the sequential guids are reseeded at reboot, so they do lead to page splits even locally, and certainly in distributed environment. Another approach is the manufacturing of the pseudo-identity values by borrowing some info from other identifiers. This sounds attractive, but is inherently flawed if the subscribers move from one parent host to another. I happen to have practically worked with GUIDS and the last option, never with seeding identities for subscribers. Any scenario will require one to consider fillfactors very carefully. GUID approach is the easiest to maintain, but it cost some extra space because of the size (16 byte guid is twice as wide as 8-byte bigint, which matters when multiplied by all indexes referencing 8 extra bytes on every leaf pages index record of every non-clustered index in the table). The space is cheap enough though, so GUID combined with appropriately chosen fillfactor (usually 80 to 90) seems to be the easiest way to go. Oleg
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image
WilliamD answered
I'm going for door number three! We are using homegrown key management for our system. Ids generated on the fly using number ranges based on location. This is a potential bottleneck for inserts - namely when lots of people try to get the next Id, but our system is not constrained on that side of things, so it is not a problem. The other option we looked at was a composite primary key, where a single tinyint column is prepended to the "real" key and hard-set per node in the replication. Don't really know why we didn't go down that route - as we wouldn't have any further problems with number range crashes down the road.
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Matt Whitfield avatar image
Matt Whitfield answered
For what it's worth, I've done this before using identity seed and increment for surrogate keys - so, in a 5 node system, I might have: * Node1.Table1 - IDENTITY(1, 10) * Node2.Table1 - IDENTITY(2, 10) * Node3.Table1 - IDENTITY(3, 10) * Node4.Table1 - IDENTITY(4, 10) * Node5.Table1 - IDENTITY(5, 10) This left some space for expansion, and made sure there were no issues 'down the line'. However, as Oleg points out (+1), clustering can become an issue if one node accumulates rows at a different rate to the others.
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

David 1 avatar image
David 1 answered
A natural key. A guid makes a good surrogate but natural keys (I prefer the term "business keys", which means the same thing) are just as important in a distributed database as in a non-distributed one - potentially more so.
3 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

David, it's good to see you're back. I started a discusson about NULLs a while back (on the SSC site) and hoped you could chime in, but then you were not to be found.
0 Likes 0 ·
Pretty much everything that can be said about nulls has already been said.
0 Likes 0 ·
OK, that's fine. Most of the points I wanted to draw out have been made. However, I meant to keep working on that and say more about it, but I just haven't made the time for it.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.