question

liton avatar image
liton asked

Remove duplicate using row_number performance

I’m using row_number() to insert distinct orders to the production table from the staging tables. I receive a lot of duplicate records and I only want the distinct records to be in the final table. Below is the cte I’m using to accomplish this task: with cte_orders as (select *, ROW_NUMBER() over (partition by ordno order by ordno) as rownum from Stage_Orders) select * from cte_orders where rownum=1 This is taking a long time to compile. I looked at the execution plan and the sort cost is 48% and table scan cost is 42%. I don’t have index on the staging table because I update that table a lot. Any suggestion on how I can improve the performance? Thanks
performance
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Håkan Winther avatar image
Håkan Winther answered
One issue with your table is that you don't have a clustered index as you get a table scan. In my opinion, every table should have at least a clustered index. You can't create a table in SQL Azure without adding a clustered index. A table without a clustered index is an unsorted hash that you need to scan and very often sort to use. To improve performance of over clause you should add an index on POC columns, where P is partitioning column, O is order by columns and C is other columns you use in the Query. P and O should be index columns and C should be included columns.
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

liton avatar image liton commented ·
But wouldn't creating index will slow down the insert or updates? I know I can disable the index before the insert and then re-build it again but disabling and rebuilding the index takes long time for a large table. I usually insert around million records and the table already has 150 million records.
0 Likes 0 ·
Håkan Winther avatar image Håkan Winther commented ·
It depends. If you have a clustered index and your data is sorted on the same column as your insert, then the insert is almost as fast as no index at all.
0 Likes 0 ·
Håkan Winther avatar image Håkan Winther commented ·
A million records is not that many records, compared To 150 million records in a heap that you try To fetch AND sort.
0 Likes 0 ·
Magnus Ahlkvist avatar image
Magnus Ahlkvist answered
I agree with Håkan. Depending on how you update the existing rows in the staging table, you might or might not want to just add an identity column primary key on it. If you insert more often than you update, or if you update doesn't change the total number of bytes per row, you can have a clustered identity column without causing fragmentation. I wouldn't be too afraid of adding indexes to the staging table. If you have massive loads from SSIS or the likes, you might want to drop indexes, load staging table and rebuild indexes. With enterprise edition, you could look at partition switching and use a "pre staging" table. If your performance problem today is with selecting from the staging table, I would suggest indexing the staging table and test performance on loading and updating rather than assuming indexing the staging table will hit your performance.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Håkan Winther avatar image Håkan Winther commented ·
Partition switching is an excellent solution, but it requires a clustered index to use for partitioning. Another option is to use temporary tables for pre-staging.
0 Likes 0 ·
Jeff Moden avatar image
Jeff Moden answered
> But wouldn't creating index will slow > down the insert or updates? I know I > can disable the index before the > insert and then re-build it again but > disabling and rebuilding the index > takes long time for a large table. I > usually insert around million records > and the table already has 150 million > records. @liton I realize I'm responding to an older post but let's back the performance truck up for a minute. Are you saying that your 150 million row table does NOT have a clustered index??? The other thing I want to ask is about your original post. You're all worried about some large % of cost numbers in the execution plan but how long does your query, which is patently on the million row staging table, actually take??? On my humble laptop, your query only takes 4 seconds on a million rows.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.