I am going to take this one element at a time.
First, I have 8 text files that I have imported into SQL and made separate tables. For sake of simplicity, I have titled them 21,31,41,51,24,34,44,and 54. I am trying to merge those 8 into 1 which would be titled, industry. The reason behind this process is that I have duplicates in various tables. Items in 21 are duplicated in 41. The duplicates do not exist in the tables themselves. The duplicates are not wholly contained in 21, 31, or such. My question for this time is how I begin the merge statement. For now, I will pair it down to merge two items into one.
The primary key is a compound one comprised on nine fields.
First, how do I specify the 8 tables in the second? Or do I have to do this one at a time?
Second, I have 9 primary keys. Would I specify all 9 in the third line? If so, would it look like below.
I will borrow the logic from @Usman Butt answer because it provided all necessary points and the logic for a complete solution.
The source tables have an unusually large number of the key columns in their respective composite primary key, so the the join condition looks pretty awkward, but there is not much that can be done about this unless the tables are properly re-designed. In order to remove duplicates (from across all columns perspective), the union may be used in place of the more desired union all. It is still not known whether the situation exists when the data for the same 9-column wide primary key exists in different source tables when the first 9 column values are the same while other columns' values (10 through 22) are different. This scenario needs to be discussed. It does not add much complexity to the solution, but the definition of the rule to use in order to determine which of the rows is the winner is needed. In any case, here is the script which will disclose whether such records exist:
If the script above produces any records then the duplicate records by key, which are different by others do exist.
This possible problem aside, the merge statement using the logic from @Usman Butt answer complemented with the actual actions to take is below. Please note that the UPDATE part includes the check to ensure that only the rows which truly need to be updated are subjected to the update. Otherwise, all matching destination rows will be updated even though their respective values in the columns 10 through 22 are the same. There are other ways to sniff whether the matching rows will cause the faux update or not (for example, the binary_checksum), but I opted not to use them. Of course it goes without saying that the update on the columns 1 through 9 makes no sense because those are key columns. In other words, if any of the key columns at any of the source tables is updated by some thoughtless process, then such update simply translates to lack of match between source and destination meaning that the destination (industry) row will be deleted and the updated row from the source - inserted. This is normal behaviour because generally, nobody ever updates any key columns in any SQL Server database tables right? If the key column needs to be updated then one would restate such update into delete plus insert anyway, so this is what the MERGE will mimic.
Here is the script which will work as expected after the column names that I used are replaced with their actual counterparts (please note that because TARGET and SOURCE are keywords in T-SQL language, I avoid using these words for aliases. naming the target as t and the source as src):
Hope this helps.
if data types are similar and number of columns are equal, maybe try using select with union all for that table
answered Feb 23 at 08:11 AM
If you already have temporary structures, then it is better to do all the data deduplication/merging on temporary structures before hitting the base table. This strategy will give you various advantages. For e.g., you know which file would have the preference. So you could keep only that file's data with the duplicate key to be merged in the base table. Rest of the files duplicate data could be deleted or marked as inactive in the temporary structures. If the major chunk of data is duplicated, then you could be processing less data for the base table to be merged. Hence, fewer CPU cycles, less processing time etc. could be achieved
I would be tempted to run the process something like the following
EDIT: Added all composite primary columns
If you are sure that in each file the whole data row would be a complete duplicate and no preference is required, then you could simply use UNION instead of UNION ALL to remove the duplicate rows.
There you see there could be more than one solution or multiple combinations, all depends upon your data and what suits you best. Hope it helps.