question

Head_Contusion avatar image
Head_Contusion asked

TOP N - in query bogs down SQL Server - WHY?

Hi, I'm using SQL Server 2008. I have a query that looks like: SELECT TOP 8 Table1.Key FROM Table1 WHERE Table1.Key NOT IN (SELECT DISTINCT Table2.Key FROM Table2) When I run the query without the 'TOP N' it returns 86 rows from our Table1 in less than 1 second. When I add the 'TOP N' into the query, it runs for minutes - it's been gone for as long as 4 minutes - without returning a result. Strangely, when I do: SELECT TOP 8 * FROM Table1 (or) SELECT TOP 8 * FROM Table2 The top 8 rows are returned from either table in less than 1 second. Is there some sort of problem with SQL Server? Is there a problem with my T-SQL? Is there something else going on "beneath the surface"? Any answers/ideas would be greatly appreciated.
t-sqltop
12 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Oleg avatar image Oleg commented ·
@Grant Fritchey
assuming they're selective enough
This is the key to such decision, which also depends on how big is the table and how narrow is the key. Now I am working on modifying a table designed like this: it has a PK enforced by a unique nonclustered index, and it is clustered by a foreign key column, which is meant to be 16-byte GUID, but is stored as varchar(50), yes, var-bloody-char(50). There are anywhere between 1 and 500 records per same foreign key value, and the number of records is always in 6 digits. The table also has 6 other nonclustered indexes. Now, if I let this thing to continue, I will have to live with 36 (guid cast as varchar(50)) + 8 (uniquefier) = 44 bytes - 4 = 40 bytes of pure waste per each leaf page index record, totalling 7 * 300,000 * 40 ~ 100 MB of waste. Needless to say, this design is going straight out the window, and the FK becomes actual GUID (16 bytes) and it also becomes a column on the NON-clustered index. PK goes clustering. DMLs are mostly about single FK, so 500/300 K is ~ 0.1 percent to let seek + lookup take place. Again, all we can do is speculate as far as the actual question is concerned.
2 Likes 2 ·
Grant Fritchey avatar image Grant Fritchey ♦♦ commented ·
@Oleg don't throw the baby out with the bathwater. If the FK is the most frequently used access path, it might make a lot more sense as the cluster, GUID or not. I'd test it before I tossed it. But, I agree with you, VARCHAR(50) is really scary and needs to go.
1 Like 1 ·
Head_Contusion avatar image Head_Contusion commented ·
Oleg, Grant F, and all, Thanks for the responses. The underneath of SQL Server is so....interesting. When we sat down and took a closer look at the execution plan - we found a green line telling us about a missing index! In Table2 our clustered index was based on the table's primary key, not the foreign key that we joined tables on! HAHAHA! :) The reason I missed it is because it seems that 2008 doesn't script indexes into the 'script table as' option. So I didn't see any indexes. I had to go to the indexes folder for the table, and look at each script individually to see what the execution plan was referring to. Thank you all for the information, advice, and help. I appreciate all of your input. Thanks again.
0 Likes 0 ·
Mark avatar image Mark commented ·
HA! So Grant, since you suggest looking at the executoion plan in your very first sentance, you get the thumbs up from me. (+1)
0 Likes 0 ·
Oleg avatar image Oleg commented ·
@Head_Contusion But you cannot have your Table2 clustered on the foreign key, at least I hope you don't think that you can. Table1 has the clustered key named [Key], and Table2, which should be clustered by its own key, has the column named [Key] which is a foreign key rererencing column named [Key] of Table1. This column of the Table2 should be the single (or at least prefix) column of the NON-clustered index. Otherwise, you will cluster your table on non-unique column, which will tremendously bloat the size of any other non-clustered indexes of this table because the engine will add 6 to 8 byte uniquefier to EVERY value of the clustered column, which will travel to every index record of all other indexes' leaf pages.
0 Likes 0 ·
Show more comments
Grant Fritchey avatar image
Grant Fritchey answered
Right out of the gate, take a look at the execution plans for the various queries. Also, a TOP without an ORDER BY is going to return random rows. Also, a NOT IN with a DISTINCT could be replaced with a LEFT OUTER JOIN and a WHERE Table2.Key IS NULL to eliminate the bad records. The DISTINCT is an aggregate operation that's just going to slow down the process. But, to really know for sure what's happening, you have to check out the execution plans.
11 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

GPO avatar image GPO commented ·
"...a TOP without an ORDER BY is going to return random rows..." It's my understanding that they won't necessarily be random. They will be indeterminate which is different. You could argue that indeterminate is worse than random because at least with random results people would wake up more quickly that something was awry.
3 Likes 3 ·
TimothyAWiseman avatar image TimothyAWiseman commented ·
The big problem with calling it random is that there are times when you truly want random results (sampling for instance) and this will not give it to you. Here, you just want know what the order will be ahead of time.
2 Likes 2 ·
Grant Fritchey avatar image Grant Fritchey ♦♦ commented ·
You're right. Random was absolutely the wrong word to use. Basically, you can't know which rows will be returned or what order they will be in.
1 Like 1 ·
Matt Whitfield avatar image Matt Whitfield ♦♦ commented ·
I don't know - Random to me is a good choice - because random can be the same each time... :)
1 Like 1 ·
Grant Fritchey avatar image Grant Fritchey ♦♦ commented ·
@Matt Well, it's technically true that it can't be random. There will be an order applied, probably based on the current storage order or the order within the pages of an index, but you have no way of knowing what that order will be, and, that order can change in ways that are, effectively random. However, I still think @GPO is correct, the word random isn't the right word to describe the situation, even if it does describe the effects.
1 Like 1 ·
Show more comments
Oleg avatar image
Oleg answered
This behaviour is rather strange. The answer depends on how the data is indexed in both tables. You can try restating your query to any one of the following two (the result should be the same, but the time might differ, so you can pick whichever seems to be the best): select top 8 t1.[Key] from Table1 t1 where not exists ( select [Key] from Table2 where [Key] = t1.[Key] ); select top 8 t1.[Key] from Table1 t1 d left join Table2 t2 on t1.[Key] = t2.[Key] where t2.[Key] is null; Oleg
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Blackhawk-17 avatar image
Blackhawk-17 answered
How long does this query take to run on its own? SELECT DISTINCT Table2.Key FROM Table2 Could you also post an example of a Key and let us know the data types of both columns?
4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Matt Whitfield avatar image Matt Whitfield ♦♦ commented ·
How up to date are your statistics?
1 Like 1 ·
Head_Contusion avatar image Head_Contusion commented ·
Blackhawk-17, For the query you posted, when we run it through our tables, it returns more than 60,000 rows in less than 1 second. For a little more information, the primary key in Table1, and the foreign key in Table2 are both integers. So the foreign key in Table2 may look like: 68291. Which is what a corresponding primary key in Table1 may look like (from the query I originally posted above - so Table2.Key above is a foreign key. Sorry if I was unclear above). Matt W, I actually did do an outer join (initially - and was chastised for it, which is why I moved to using the NOT IN query that I originally posted). Ironically, the LEFT OUTER JOIN with a TOP 8 ran very slow as well. However when we made (much later after I posted the question) the ORDER BY a descending one, what was taking around 11 minutes resolved in 7 seconds. Stranger still, when we ran the essentially same query on a different pair of tables (that are structured similarly), the TOP N queries all returned in less than 1 second despite the tables having more rows. The only difference I've noticed so far was in the table script for the similar table on its primary key in a constraint has a NOT FOR REPLICATION. It's the only difference I've noticed (unless I'm digging too deep)? Everyone else, thanks for responding! I'll look into this some more.
0 Likes 0 ·
Grant Fritchey avatar image Grant Fritchey ♦♦ commented ·
I can probably explain the ORDER BY making it run faster. Without an ORDER BY statement, it's just using whatever it needs to access the data. An ORDER BY statement could change the indexes used, and, in your case, the index made a difference. Of course, that's just speculation without seeing an execution plan.
0 Likes 0 ·
Blackhawk-17 avatar image Blackhawk-17 commented ·
I would say then we need to follow Grant's macro suggestion and examine your query plans.
0 Likes 0 ·
Matt Whitfield avatar image
Matt Whitfield answered
You could try the TOP in an outer query: SELECT TOP 8 Key FROM ( SELECT Table1.Key FROM Table1 WHERE Table1.Key NOT IN (SELECT DISTINCT Table2.Key FROM Table2)) iDat ORDER BY Key
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.