x

How to write a merge statement in SQL

I am going to take this one element at a time.

First, I have 8 text files that I have imported into SQL and made separate tables. For sake of simplicity, I have titled them 21,31,41,51,24,34,44,and 54. I am trying to merge those 8 into 1 which would be titled, industry. The reason behind this process is that I have duplicates in various tables. Items in 21 are duplicated in 41. The duplicates do not exist in the tables themselves. The duplicates are not wholly contained in 21, 31, or such. My question for this time is how I begin the merge statement. For now, I will pair it down to merge two items into one.

The primary key is a compound one comprised on nine fields.

   MERGE Industry AS TARGET
   USING industry21 AS SOURCE 
   ON (TARGET.PrimaryKey1 SOURCE.PrimaryKey1)

First, how do I specify the 8 tables in the second? Or do I have to do this one at a time?

Second, I have 9 primary keys. Would I specify all 9 in the third line? If so, would it look like below.

  ON (TARGET.PrimaryKey1 SOURCE.PrimaryKey1, TARGET.PrimaryKey2 SOURCE.PrimaryKey2, and so on)

more ▼

asked Feb 22 at 09:09 PM in Default

avatar image

nevada1978
51 1 6

(comments are locked)
10|1200 characters needed characters left

3 answers: sort voted first

I will borrow the logic from @Usman Butt answer because it provided all necessary points and the logic for a complete solution.

The source tables have an unusually large number of the key columns in their respective composite primary key, so the the join condition looks pretty awkward, but there is not much that can be done about this unless the tables are properly re-designed. In order to remove duplicates (from across all columns perspective), the union may be used in place of the more desired union all. It is still not known whether the situation exists when the data for the same 9-column wide primary key exists in different source tables when the first 9 column values are the same while other columns' values (10 through 22) are different. This scenario needs to be discussed. It does not add much complexity to the solution, but the definition of the rule to use in order to determine which of the rows is the winner is needed. In any case, here is the script which will disclose whether such records exist:

 select
     Col01, Col02, Col03, Col04, Col05, Col06, Col07, Col08, Col09
     from (
         select * from [21]
         union select * from [31]
         union select * from [41]
         union select * from [51]
         union select * from [24]
         union select * from [34]
         union select * from [44]
         union select * from [54]
     ) t
     group by Col01, Col02, Col03, Col04, Col05, Col06, Col07, Col08, Col09
     having count(1) > 1;
 go

If the script above produces any records then the duplicate records by key, which are different by others do exist.

This possible problem aside, the merge statement using the logic from @Usman Butt answer complemented with the actual actions to take is below. Please note that the UPDATE part includes the check to ensure that only the rows which truly need to be updated are subjected to the update. Otherwise, all matching destination rows will be updated even though their respective values in the columns 10 through 22 are the same. There are other ways to sniff whether the matching rows will cause the faux update or not (for example, the binary_checksum), but I opted not to use them. Of course it goes without saying that the update on the columns 1 through 9 makes no sense because those are key columns. In other words, if any of the key columns at any of the source tables is updated by some thoughtless process, then such update simply translates to lack of match between source and destination meaning that the destination (industry) row will be deleted and the updated row from the source - inserted. This is normal behaviour because generally, nobody ever updates any key columns in any SQL Server database tables right? If the key column needs to be updated then one would restate such update into delete plus insert anyway, so this is what the MERGE will mimic.

Here is the script which will work as expected after the column names that I used are replaced with their actual counterparts (please note that because TARGET and SOURCE are keywords in T-SQL language, I avoid using these words for aliases. naming the target as t and the source as src):

 merge industry as t
 using (
     -- Remove the rows which are duplicates across all columns by
     -- using union instead of union all. It is cheesy, but it works
     select * from [21]
     union select * from [31]
     union select * from [41]
     union select * from [51]
     union select * from [24]
     union select * from [34]
     union select * from [44]
     union select * from [54]
 ) src
     -- this is very unfortunate, but needs to be done due to fact that
     -- there are 9 columns in the composite primary key of every table
     on t.Col01 = src.Col01 and t.Col02 = src.Col02 and t.Col03 = src.Col03
     and t.Col04 = src.Col04 and t.Col05 = src.Col05 and t.Col06 = src.Col06
     and t.Col07 = src.Col07 and t.Col08 = src.Col08 and t.Col09 = src.Col09
 
     when matched /* comment out these conditions if faux updates are acceptable */
         and (t.Col10 <> src.Col10 or t.Col11 <> src.Col11 or t.Col12 <> src.Col12 
         or t.Col13 <> src.Col13 or t.Col14 <> src.Col14 or t.Col15 <> src.Col15 
         or t.Col16 <> src.Col16 or t.Col17 <> src.Col17 or t.Col18 <> src.Col18 
         or t.Col19 <> src.Col19 and t.Col20 <> src.Col20 and t.Col21 <> src.Col21 
         or t.Col22 <> src.Col22) then update
     set
         Col10 = src.Col10, Col11 = src.Col11, Col12 = src.Col12, Col13 = src.Col13, 
         Col14 = src.Col14, Col15 = src.Col15, Col16 = src.Col16, Col17 = src.Col17,
         Col18 = src.Col18, Col19 = src.Col19, Col20 = src.Col20, Col21 = src.Col21,
         Col22 = src.Col22
     when not matched by target then insert (
         Col01, Col02, Col03, Col04, Col05, Col06, Col07, Col08, Col09, Col10, Col11,
         Col12, Col13, Col14, Col15, Col16, Col17, Col18, Col19, Col20, Col21, Col22
     )
     values (
         src.Col01, src.Col02, src.Col03, src.Col04, src.Col05, src.Col06, src.Col07, 
         src.Col08,src. Col09, src.Col10, src.Col11, src.Col12, src.Col13, src.Col14, 
         src.Col15, src.Col16, src.Col17, src.Col18, src.Col19, src.Col20, src.Col21, 
         src.Col22
     )
     when not matched by source then delete;

Hope this helps.

Oleg

more ▼

answered Feb 28 at 12:24 AM

avatar image

Oleg
20.6k 3 7 29

@Oleg, thanks for your help. I will go over this with a fine toothed comb tomorrow and try to learn the mechanics of this so that I can possibly avoid having to do this in the future.

Feb 28 at 12:43 AM nevada1978

@Oleg +1 for coming up with the complete script :) My only concern is if delete is to be done as well in this MERGE, then why bothering such complex MERGE statement? A full delete and insert could be better?

Feb 28 at 02:06 PM Usman Butt

@Usman Butt As always, it depends. Usually, the expectation is that the deltas are small comparing to the complete replace of data via delete plus insert, so merge might make sense. What about if there are dependencies between the destination table and some other tables in the destination database? I might be not trivial to take care of all child records if delete all on the parent is used. The only complex part about this merge is the bunch of column to column comparisons for non-key columns (the ones subjected to update), but this is just to avoid faux updates back to the same values for probably most records. The delete is there to handle the true new records and the evil updates on the key columns (which merge sorts out into delete plus insert). The bottom line is that with relatively small deltas, the merge will affect a rather small part of all destination records and it is pretty fast because of it (not too much data has to move).

By the way, I was typing too fast last evening, it was already after hours, so there was a bug in the update part (I used AND in place of OR). The answer has been updated to fix this.

@nevada1978 Please use updated answer. Thank you.

Feb 28 at 03:39 PM Oleg

Not to be difficult but I still had to manually weed out the duplicates. I still get a primary key duplication error. It would seem as if I will have to take this one file at a time.

Mar 01 at 05:59 PM nevada1978

@nevada1978 I thought I mentioned in my answer that is is entirely possible to have the rows in the different source tables which have the same combination of columns 1 through 9 (key columns) with different values in the columns 10 through 22. So, let me please disagree with your assessment. It is still easily possible to "weed out the duplicates" by the means of a tiny touch up to the select statement which is gathering the data from 8 source tables. All that needs to happen is that the rule to use needs to be defined. When you said that you had to manually do it, you probably had to use some rule. This could be "the record in the earlier table wins" or "the record in the later table wins" or "one of the columns is actually the date column so the record with the latest date wins" or whatever other rule. If you let me know what determines which record to keep and which to throw out, I can update my answer. Again, this would be a tiny touch up just to the part selecting rows from 8 source tables, that is all. Please let me know.

Mar 01 at 06:59 PM Oleg
(comments are locked)
10|1200 characters needed characters left

if data types are similar and number of columns are equal, maybe try using select with union all for that table

more ▼

answered Feb 23 at 08:11 AM

avatar image

Lukasz8519
136 2 2

That would be ok if there were not duplicates in most of the 8 files mentioned above. I am hoping to find a way to merge all 8 and not have to de-duplicate them one table at a time.

Feb 24 at 12:02 AM nevada1978
(comments are locked)
10|1200 characters needed characters left

If you already have temporary structures, then it is better to do all the data deduplication/merging on temporary structures before hitting the base table. This strategy will give you various advantages. For e.g., you know which file would have the preference. So you could keep only that file's data with the duplicate key to be merged in the base table. Rest of the files duplicate data could be deleted or marked as inactive in the temporary structures. If the major chunk of data is duplicated, then you could be processing less data for the base table to be merged. Hence, fewer CPU cycles, less processing time etc. could be achieved

I would be tempted to run the process something like the following

  • Merge/Deduplicate all data in the temporary tables preferably by deleting the duplicate data or mark the duplicate data as inactive/deleted.

  • If the data is not huge, you could also merge/deduplicate data in a single temporary table. So then only one merge statement would be required

  • If the data is huge then I would use 8 Merge statements. Yes, I know it won't sound good to you but while working with the huge data it is always best to process it in batches. You would have more granular control, better logging, option to commit/rollback a certain file etc.

  • But after the merging/deduplication, If you still want to go with all the tables in one Merge statement then as @Lukasz8519 already pointed out you could use UNION ALL for all the temporary tables to make a single source

EDIT: Added all composite primary columns

     MERGE Industry AS TARGET
     USING (
     
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table1 UNION ALL
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table2 UNION ALL
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table3 UNION ALL
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table4 UNION ALL
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table5 UNION ALL
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table6 UNION ALL
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table7 UNION ALL
     SELECT PrimaryKeyCol, OtherPrimaryColumns, OtherColumns FROM Table8 
         ) AS SOURCE 
 ON (TARGET.PrimaryKeyCol = SOURCE.PrimaryKeyCol
 TARGET.PrimaryKeyCol2 = SOURCE.PrimaryKeyCol2
 TARGET.PrimaryKeyCol3 = SOURCE.PrimaryKeyCol3
 TARGET.PrimaryKeyCol4 = SOURCE.PrimaryKeyCol4
 TARGET.PrimaryKeyCol5 = SOURCE.PrimaryKeyCol5
 TARGET.PrimaryKeyCol6 = SOURCE.PrimaryKeyCol6
 TARGET.PrimaryKeyCol7 = SOURCE.PrimaryKeyCol7
 TARGET.PrimaryKeyCol8 = SOURCE.PrimaryKeyCol8
 TARGET.PrimaryKeyCol9 = SOURCE.PrimaryKeyCol9
     )


If you are sure that in each file the whole data row would be a complete duplicate and no preference is required, then you could simply use UNION instead of UNION ALL to remove the duplicate rows.

There you see there could be more than one solution or multiple combinations, all depends upon your data and what suits you best. Hope it helps.

more ▼

answered Feb 24 at 09:11 AM

avatar image

Usman Butt
14.9k 6 13 21

@Usman Butt, I have 9 primary keys and 13 foreign keys. Does your mention of "other columns" refer to the additional 8 primary keys or the 13 foreign keys?

Feb 26 at 04:39 PM nevada1978

@nevada1978 What does "additional 8 primary keys" mean? Each table may have but one primary key, that is all. If the number of columns is the same and the data types match then @Usman Butt script already does the trick in the sense that it returns all rows from all tables in question as one set, so you can use a single merge statement. Additionally, he mentions that in case if you need to weed out the duplicates then you can simply use UNION in place of UNION ALL. Of course it goes without saying that the script in the answer is not complete, it just provides the idea on how to merge multiple tables at once. You still have to fill the part on what to do with the data, @Usman Butt provided you with the answer about how to gather it together.

From the questions you ask, it appears that that you might not be familiar with what MERGE statement actually does and how it works. If this is the case then please elaborate a bit by adding additional details to your question (tables columns, what needs to happen, i.e. just insert and update to the target or complete sync, including deletes) and we can then spell out the complete solution for you. Thank you.

Feb 26 at 05:32 PM Oleg

@Oleg, I have added some more details. First, I just want to know if having the compound primary key is of any relevance?

Feb 26 at 08:14 PM nevada1978

@nevada1978 You keep mentioning 9 primary keys, but this is not possible. Any table can have only one primary key. You probably mean to say you have a table with the composite primary key consisting of 9 columns. Of course, if I had to discuss it in the office settings, I would certainly have quite a few cuss words to say about the table with such a peculiar design, but we are not discussing the design but need to simply merge the data from multiple sources into a single destination. This task is not difficult, but can be accomplished only after the requirements are spelled out. Please clarify the following:

  • What are the column names of the source tables? (or at least the number of columns)

  • All source tables have the same number of columns (names and types match) Yes or no.

  • Sadly, each source table can be uniquely identified by the ridiculously large number of columns, 9 to be exact. Yes or no.

  • Should the merge handle inserts and updates, or deletes should be handled as well.

  • What to do if 2 source tables have the "same" record (per 9-column-wide primary key) with different values in "other" columns. Which of the 2 records wins?

Please let me know. Thank you.

Feb 27 at 04:22 PM Oleg
  1. All source tables have 22 columns. The first nine are part of the composite primary key.

  2. Yes.

  3. Yes.

  4. I would hope to do all of the above.

  5. I will research if the scenario that you mention does in fact occur.

Feb 27 at 04:42 PM nevada1978
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x169
x57
x45
x34

asked: Feb 22 at 09:09 PM

Seen: 147 times

Last Updated: Mar 01 at 07:44 PM

Copyright 2018 Redgate Software. Privacy Policy