SQL Server - operator used tempdb to spill data

Karla

This is a our "Data Warehouse" (in quotes because I am sure it doesn't fit the actual definition of data warehouse-regardless it is where we combine data from multiple other sources and try and make pretty Power BI dashboards to make the executives happy).

I have a view fairly complex joining 5 data tables and a couple lookups. Initial query without considering optimization at all (and joining a few extra tables that I ultimately didn't need) took ~45 seconds.

Through indexing, breaking up the query, using CTEs, etc. I got it down to 11 seconds.

But looking at the execution plan, I have the warning, "operator used tempdb to spill data."

I did all the stuff that a google search provides (updating statistics for all tables involved with fullscan).

This made no difference (and gave the message that the statistics were updated.

Next I asked to get more RAM. Got an increase from 16 GB to 32 GB.

This reduced query time to 6 seconds (which for now is workable but is only going to increase).

I look at the execution plan and it is different from the previous plan but still has the "operator used tempdb to spill data."

It is spill type 1.

SQL Server version 2016.

Rows returned is less than 500K. (That will be doubled when I add another source of data.)

Groaner

@Karla Sometimes I find that materializing CTEs into temp tables will boost performance, especially if they're complex or recursive. Is this something you've tried?

What's the bottleneck in the query plan?

With SET STATISTICS IO ON
and SET STATISTICS TIME ON, what are the bottlenecks?

izzion

@Karla
If you’re looking at (or can get) an actual execution plan, are the actual row counts close to the estimated row counts, or off by a factor of 10 or more?

Karla

@Groaner said in SQL Server - operator used tempdb to spill data:

@Karla Sometimes I find that materializing CTEs into temp tables will boost performance, especially if they're complex or recursive. Is this something you've tried?

What's the bottleneck in the query plan?

With SET STATISTICS IO ON
and SET STATISTICS TIME ON, what are the bottlenecks?

I get something like this (I've never looked at these before so not completely sure how to use the information):

SQL Server parse and compile time:
CPU time = 250 ms, elapsed time = 250 ms.

(348727 row(s) affected)
Table 'Worktable'. Scan count 1991, logical reads 734237, physical reads 0, read-ahead reads 5689, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Scan count 1, logical reads 1344, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't2'. Scan count 1, logical reads 2404, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't3'. Scan count 1, logical reads 493, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't4'. Scan count 1, logical reads 120, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't5'. Scan count 1, logical reads 1376, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't6'. Scan count 1, logical reads 609, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't7'. Scan count 1, logical reads 78, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't8'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 't9'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 4390 ms, elapsed time = 6992 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

Karla

@izzion said in SQL Server - operator used tempdb to spill data:

@Karla
If you’re looking at (or can get) an actual execution plan, are the actual row counts close to the estimated row counts, or off by a factor of 10 or more?

By a factor of 10

Atazhaia

@blakeyrat wanted an answer forwarded:

Skärmavbild 2019-01-16 kl. 00.07.47.png

Groaner

@Karla said in SQL Server - operator used tempdb to spill data:

@Groaner said in SQL Server - operator used tempdb to spill data:

@Karla Sometimes I find that materializing CTEs into temp tables will boost performance, especially if they're complex or recursive. Is this something you've tried?

What's the bottleneck in the query plan?

With SET STATISTICS IO ON
and SET STATISTICS TIME ON, what are the bottlenecks?

I get something like this (I've never looked at these before so not completely sure how to use the information):

Basically, you're looking for stuff that has the largest amount of IO and CPU time. Which, in this case, appears to be 'Worktable,' which I assume is a CTE or other intermediary. No surprises there. Everything else is tiny in comparison.

izzion

@Groaner
Or the worktable could be the spill itself.

@Karla

What operator is spilling to tempDB? A sort? A read?
Do any of the joins or WHERE clauses have something along the lines of WHERE column = @parameter OR @parameter IS NULL? Are any of the joins or WHERE clauses using a function to change the value of a column before comparing it to something else?

Edit: If the plan itself doesn't have any sensitive company information in it (ha!), you could post it at https://www.brentozar.com/pastetheplan/ and then I can look through it in more detail. IMPORTANTLY: Any plan you paste there is 100% public for their retention interval, so if there's PII in the query itself at the top of the plan or if the table names are things you can't share publicly, then don't post it.

Karla

@Groaner said in SQL Server - operator used tempdb to spill data:

@Karla said in SQL Server - operator used tempdb to spill data:

@Groaner said in SQL Server - operator used tempdb to spill data:

@Karla Sometimes I find that materializing CTEs into temp tables will boost performance, especially if they're complex or recursive. Is this something you've tried?

What's the bottleneck in the query plan?

With SET STATISTICS IO ON
and SET STATISTICS TIME ON, what are the bottlenecks?

I get something like this (I've never looked at these before so not completely sure how to use the information):

Basically, you're looking for stuff that has the largest amount of IO and CPU time. Which, in this case, appears to be 'Worktable,' which I assume is a CTE or other intermediary. No surprises there. Everything else is tiny in comparison.

I wasn't sure what worktable was because I didn't name anything worktable.

I'll try again tomorrow with parts of it.

Karla

@Atazhaia said in SQL Server - operator used tempdb to spill data:

@blakeyrat wanted an answer forwarded:

Thank you.

Though I did that.

I did all the stuff that a google search provides (updating statistics for all tables involved with fullscan).

This made no difference (and gave the message that the statistics were updated.

DBA also repeated this again today after the RAM was added for all table in the database.

izzion

@Karla
Worktable is SQL Server's internal name for tempdb, more or less. There are some specific self-inflicted things that can use tempdb off the top (temporary tables, table variables, generally not CTEs as they almost always get inlined into your main query) an generate a Worktable on their own as well.

Karla

@izzion said in SQL Server - operator used tempdb to spill data:

@Groaner
Or the worktable could be the spill itself.

@Karla

What operator is spilling to tempDB? A sort? A read?

sort -- though I am not sorting

Do any of the joins or WHERE clauses have something along the lines of WHERE column = @parameter OR @parameter IS NULL?

The only where clauses isActive = 1 and isDeleted = 0

Are any of the joins or WHERE clauses using a function to change the value of a column before comparing it to something else?

The only function is in a select.

Edit: If the plan itself doesn't have any sensitive company information in it (ha!), you could post it at https://www.brentozar.com/pastetheplan/ and then I can look through it in more detail. IMPORTANTLY: Any plan you paste there is 100% public for their retention interval, so if there's PII in the query itself at the top of the plan or if the table names are things you can't share publicly, then don't post it.

Let me check tomorrow.

Karla

@izzion said in SQL Server - operator used tempdb to spill data:

@Karla
Worktable is SQL Server's internal name for tempdb, more or less. There are some specific self-inflicted things that can use tempdb off the top (temporary tables, table variables, generally not CTEs as they almost always get inlined into your main query) an generate a Worktable on their own as well.

Got it.

izzion

@Karla
If you're picking up a sort operator without an ORDER BY clause, then that probably means you're using SELECT DISTINCT, which can contribute to the estimation problem that's causing Estimated vs Actual rows to be off by so much.

If DISTINCT is in the query, that's a code smell to me - it usually gets slapped on in order to cover for the fact that a join is causing unwanted row fanout, rather than fixing the join so that it doesn't fan out. So, if there is a DISTINCT, it's worth checking to see if it's actually needed (especially after you've changed all the joins around)

Karla

@izzion said in SQL Server - operator used tempdb to spill data:

@Karla
If you're picking up a sort operator without an ORDER BY clause, then that probably means you're using SELECT DISTINCT, which can contribute to the estimation problem that's causing Estimated vs Actual rows to be off by so much.

If DISTINCT is in the query, that's a code smell to me - it usually gets slapped on in order to cover for the fact that a join is causing unwanted row fanout, rather than fixing the join so that it doesn't fan out. So, if there is a DISTINCT, it's worth checking to see if it's actually needed (especially after you've changed all the joins around)

Yeah, there are distincts. And I've been trying to figure out how to remove them.

I think it is partially the nature of the data. I'll try and give a contrived example.

I want distinct personIds of 2 types that are loosely hierarchical and I don't care about the IDs.

But we have a Produce table that has something like:
ProduceId, ProduceName, OtherProperty
1, Fruit, Type1
2, Fruit, Type2
3, Fruit, Type3

Food table:
FoodId, FoodName, Other Property
1, Apple, Type1
2, Apple, Type2
3, Orange, Type3
4, Orange, Type4

Then we have a mapping table Produce-Foods
ProduceId,FoodId, AnotherId
1,1,1
1,1,2
1,2,1
1,2,2

Both of these tables have legacy mappings so I don't think I will be able to change them. (And while Apples are obviously fruit the actually data is not so obvious and the tables a bit more complicated--I'll check tomorrow to see if I left out something important).

There's lots more like this but this is a simplified version of the simplest part.

Maybe if I understand how to eliminate distinct from this, I can figure out the other parts.

izzion

@Karla
I'm not sure that example is concrete enough to get my head around it, maybe if you can post the anonymized join clauses it would help me visualize it. But if it's something like "For all records in Produce-Foods with AnotherId IN (1,2), find ProduceNames or FoodNames that match, removing duplicates", then you may be stuck either moving the de-dupe out into the application layer or finding a way to move the DISTINCT to a different point in the query (usually later, if you're not actually RAM bound and it's just a bad estimate) so that SQL can get the right amount of RAM to handle the Distinct Sort.

The other thing you miiiight be able to do...

SELECT DISTINCT
    f.FoodName
FROM Food AS f
WHERE FoodId IN ( SELECT FoodId FROM Produce-Foods pf WHERE pf.AnotherId IN (1,2) )
UNION ALL
SELECT DISTINCT
    p.ProduceName
FROM Produce AS p
WHERE ProduceId IN ( SELECT ProduceId FROM Produce-Foods pf WHERE pf.AnotherId IN (1,2) )

Not sure how ugly that query plan is really going to wind up though, especially if that segment of the query is actually wider (needs more columns than just the name).

Karla

@izzion said in SQL Server - operator used tempdb to spill data:

@Karla
I'm not sure that example is concrete enough to get my head around it, maybe if you can post the anonymized join clauses it would help me visualize it. But if it's something like "For all records in Produce-Foods with AnotherId IN (1,2), find ProduceNames or FoodNames that match, removing duplicates", then you may be stuck either moving the de-dupe out into the application layer or finding a way to move the DISTINCT to a different point in the query (usually later, if you're not actually RAM bound and it's just a bad estimate) so that SQL can get the right amount of RAM to handle the Distinct Sort.

The other thing you miiiight be able to do...
SELECT DISTINCT
    f.FoodName
FROM Food AS f
WHERE FoodId IN ( SELECT FoodId FROM Produce-Foods pf WHERE pf.AnotherId IN (1,2) )
UNION ALL
SELECT DISTINCT
    p.ProduceName
FROM Produce AS p
WHERE ProduceId IN ( SELECT ProduceId FROM Produce-Foods pf WHERE pf.AnotherId IN (1,2) )
Not sure how ugly that query plan is really going to wind up though, especially if that segment of the query is actually wider (needs more columns than just the name).

I'll try and get an example tomorrow.

But I don't want the union, I need the join.

Counts of people with each valid Food-Produce combo. And I am only filtering Food-Produce combos by what matches in the people table.

So there are multiple combinations of Food-Produce with different IDs and different other properties but I only care about the counts of the distinct combos Food-Produce names.

And now that I'm tired, I'm not sure this is the problem.

I know I get a different number of rows when I do distinct. I can explain why a bit more tomorrow.

izzion

@Karla
Ah. So, yeah, that sounds to me like the "underlying" issue is that the Food and Produce tables aren't fully normalized, so you have the Key-Value stuff going on and then the join column into the Produce-Foods table isn't deterministic (e.g. Apple might be 1 in some joins and 2 in others, when it's really just one food that has two "other properties" that could be normalized into a FoodProperties table).

I assume the contrived example was contrived, and the actual Produce and Foods table don't have contiguous ranges for each natural key (Produce or Food, respectively)? e.g. Apple might have FoodId 1, 2, 79, and 1727?

Karla

@izzion said in SQL Server - operator used tempdb to spill data:

@Karla
Ah. So, yeah, that sounds to me like the "underlying" issue is that the Food and Produce tables aren't fully normalized, so you have the Key-Value stuff going on and then the join column into the Produce-Foods table isn't deterministic (e.g. Apple might be 1 in some joins and 2 in others, when it's really just one food that has two "other properties" that could be normalized into a FoodProperties table).

I assume the contrived example was contrived, and the actual Produce and Foods table don't have contiguous ranges for each natural key (Produce or Food, respectively)? e.g. Apple might have FoodId 1, 2, 79, and 1727?

Sadly, Apple might have have multiple IDs...even worse Apple could be in more than one group (even with the same ID). Plus there is another key connecting the tables that I don't care about.

I've managed to isolate these lookups into a view and it takes less than 1 second. Also, I realized I was using a table I didn't need. This gets back to how y the database design is (I pulled the joins from another view to figure out the mappings of Food/Produce).

I'm pretty sure I'm not going to have much influence in changing the design.

Some time later

Now, when trying to isolate the problem in the bigger view (and ignoring the above Food/Produce lookups).

Getting rid of all the CTEs to simplify.

Joining on the 2 main data tables without distinct is fine with distinct has spill level 1 and 4.

Some time later

Realized the distinct I was using on the data tables was because of some bad data. This data is older than we need to report on and I can exclude that (now I'm under 200K rows though this will double from the other source).

The view above has a distinct but that is isolated to the lookup mapping tables.

Now the join with all the data I want takes 3 seconds. This is good.

It is still spilling on the sort (though it is closer-the actual is only 50% more than the estimated. It only does so when joined to the view of the lookups (which is the only place I have a distinct now).

So the ery of the lookup data is the cause now.

izzion

@Karla
Sadly, the normal tool for “I have this view that works pretty well, but using it doesn’t solve the spill” (persisting the view to disk with an Indexed View) won’t work in this case, because you can’t index a view with a DISTINCT.

How often does the data in the view change, compared to when it’s needed? Could you persist the view to a new table with an ETL process in a periodic batch job, and then reference the table rather than the view?

Karla

@izzion said in SQL Server - operator used tempdb to spill data:

@Karla
Sadly, the normal tool for “I have this view that works pretty well, but using it doesn’t solve the spill” (persisting the view to disk with an Indexed View) won’t work in this case, because you can’t index a view with a DISTINCT.

How often does the data in the view change, compared to when it’s needed? Could you persist the view to a new table with an ETL process in a periodic batch job, and then reference the table rather than the view?

Yeah, I can probably do that. I'll wait to see what happens when we get the additional data.

While I will be somewhat obsessed until I eliminate the spill...the fact that I got the query down to 2 seconds is probably good enough for everyone above me and will make working with it in Power BI far less painful.

I'm sure someone won't be happy I spent as much time as I did to get it from 7 down to 2. I'll count some of that time as validating the numbers (since I did find some issues that were wrong with the query the may have resulted in incorrect numbers).