.net: Most performant way to pull arbitrarily large datasets from SQL

Weng

Parameter in: A chunk of SQL code. "Select foo from bar"
Executing this SQL against a given MSSQL server will return a dataset of arbitrary size - up to the 4096 column limit and an arbitrarily vast number of rows. Total size can exceed available RAM and swap but is guaranteed not to exceed available scratch disk.

We need to stage this in chunks through a datatable or similar in memory, do some shit to it (make note of certain values, remove some, etc.) and pass the chunk on to an output formatter that will eventually put it to disk in some file format or another.

Thoughts?

Jaime

Use C# or PowerShell. The SQLDataReader class only holds a window of data (usually one row) in memory while you process it.

cheong

Agreed. With this sort of data size, you can only use SqlDataReader to read it line by line.

BaconBits

It kind of depends on what you mean. I agree with the others that it sounds like what you want to be using is probably SqlDataReader.

SqlDataReader is easily the most efficient, and will probably perform the best. The caveat is that the data connection will be open the entire time the reader is running, which can result in blocking other queries if you're reading from a table that's getting inserts, updates or deletes. You could use WITH (NOLOCK) on your SELECT queries, but you really need to be certain that that is what you really want because it's usually considered bad practice. You could do a hybrid of sorts by staging data to temp tables or permanent staging tables to relieve locking pressure from the SqlDataReader, too

SqlDataAdapter will minimize the amount of time you're connected to the database and minimize possible locking issues, but you're limited to pulling the entire result set into memory to do that. If your query can easily be broken up into repeatable windows that you're confident won't cause issues, this might actually work better.

However, this:

We need to stage this in chunks through a datatable or similar in memory, do some shit to it (make note of certain values, remove some, etc.) and pass the chunk on to an output formatter that will eventually put it to disk in some file format or another.

Says SSIS to me. In my experience this will work better than anything else.

Weng

We arrived at a series of SqlDataAdapter datatable fills, but hadn't thought of trying bare SqlDataReader. Sometimes the answer is staring right in your face. We'll benchmark it.

Our current solution is actually SSIS. We have an elaborate module that dynamically builds and executes packages. The SQL end is fine, but SSIS has a lot of ugly kinks and limitations in the file handling end. Our options basically boiled down to "write better file I/O components for SSIS" or "reimplement in the raw". Given SSIS' obscenely complex and overwrought API and onerous licensing costs, just doing it from scratch is way cheaper.

BaconBits

What exactly are you trying to do with data files that SSIS can't handle? I don't care, I'm just curious.

Weng

Various formatting idiocy and edge cases, mostly.

We ingest and emit more or less arbitrary files to/from our clients and need to be able to adapt to whatever idiot wrinkle the client has decided they want.

CSV is mercifully easy formatting wise, but SSIS doesn't handle obscenely wide columns well at all.

SSIS Excel support is godforsaken terrible and breaks at the slightest provocation. And is 32bit only.

There are other issues, but those are the common ones.

NTW

@Weng said:

And is 32bit only.

There are ways to make it 64 bit with some voodoo around the MS Access database engine redistributable. But they are a yuuuuuge pita. (after you install both drivers, you need to use OLE DB to make the connection with a conn string instead of an excel connector). Then when you 32 bit mode in Visual studio, and 64 bit mode on the SSIS server. But I wouldn't recommend it.

NTW

@Weng said:

We need to stage this in chunks through a datatable or similar in memory, do some shit to it (make note of certain values, remove some, etc.) and pass the chunk on to an output formatter that will eventually put it to disk in some file format or another.

Can you do a bit of analysis on the data set first, and then partition your query based on a WHERE clause that gives you back a set size of data? SSIS does this automatically with the buffer, so it actually isn't a terrible solution. I wouldn't use a cursor, even a FAST_FORWARD one. You could try OFFSET-FETCH...

This is a place that SQL server could take a page from oracle's book provide BULK COLLECT cursors.

One other thing you could try to do most of the processing in a table-valued CLR function? You can yield rows in a stream so you don't have to load the whole thing up in RAM, and you can collect things as the data is processed.

See also MSDN

blakeyrat

Do you have to do the processing in .net? Why not make SQL do the work? It's good at it.

blakeyrat

@Weng said:

Various formatting idiocy and edge cases, mostly.

Why don't you do that separately from the data processing bit?

So it sounds like you're doing two things here, and correct me if I'm being an idiot:

a) Doing some huge data processing/aggregation operation that loads in tons and tons of data from the DB
b) Formatting it into a report

SSIS is perfect for stuff like a) but not very good at doing stuff like b). A tool like Tableau is great at doing b) but not as good at doing a) (although probably better than you'd imagine).

Well the obvious solution to me is to do two different steps. Have SQL or SSIS create a reduced dataset containing only the data that goes into the report; write that data into another table. Then create your .net office automation, or Tableau report, or whatever, based on the reporting table.

BaconBits

Interesting. I've only ever needed SSIS for serialization, not reporting, so an export to Excel is something I've not only never needed, it's something I've never even considered.

I have had problems where .csv files for importing have had embedded carriage returns and SSIS pukes on those, so I will agree that SSIS's text file handling is fairly mediocre. It also really flips out when the number of columns changes. I have a package that imports the output of some cloud data reports for Exchange, and Microsoft likes to add and remove fields from it all the time and the guy generating the data report insists on using -Property * in his PowerShell. SSIS pukes when that happens, even though it's well aware of what the headers are.

Weng

Nah, it's straight "reporting". The only processing we're doing in this direction is just taking note of some field values as they go by so we can say "that record was reported in such and such file, dipshit" and suppressing the column with those identifiers in it because

We do that in the framework code instead of the SQL for a bunch of reasons, prime among them being that asking the guys developing on the framework to make sure to write code to support non-core functionality is a surefire way to make sure that functionality doesn't function.

Now, "reporting" is in quotes because 99.97 percent of the time, it's actually outputting a data interchange file for the client's systems. And those systems are always preexisting and frozen, so bugs in their file handling are permanent.

Yes, Excel is routinely used as an interchange format. Yes, I rant about this every time. We are yet to onboard a single client who didn't find an obscure edge case in SSIS' (and by extension, JET's) Excel handling.

I don't actually have an immediate response to your suggestion beyond "complicated, more dependencies, probably put us in the same boat with different details"

blakeyrat

@Weng said:

Nah, it's straight "reporting". The only processing we're doing in this direction is just taking note of some field values as they go by so we can say "that record was reported in such and such file, dipshit" and suppressing the column with those identifiers in it because

... and instead of just using the WHERE clause to do that, you want to run literally every row of the table through a .net program?

@Weng said:

We do that in the framework code instead of the SQL for a bunch of reasons, prime among them being that asking the guys developing on the framework to make sure to write code to support non-core functionality is a surefire way to make sure that functionality doesn't function.

I'm having trouble parsing this sentence, it sounds like you're saying that if you ask the framework guys to do the work it'll guarantee it won't work right, and that's why you ask them to do it? (I'm not sure what part of what we're talking about is "non-core functionality", so maybe I'm misinterpreting?)

@Weng said:

Now, "reporting" is in quotes because 99.97 percent of the time, it's actually outputting a data interchange file for the client's systems. And those systems are always preexisting and frozen, so bugs in their file handling are permanent.

That's still reporting. Although not the kind Tableau is good at, so.

@Weng said:

Yes, Excel is routinely used as an interchange format.

Because you can or because you must?

Or to rephrase it: you are using Excel as an interchange format. Is that because there's no other choice? Or just because "we've always done it that way"?

@Weng said:

I don't actually have an immediate response to your suggestion beyond "complicated, more dependencies, probably put us in the same boat with different details"

I know people think I'm a Nazi for saying you should always do things the right way, so you'll probably dismiss my opinion out-of-hand.

But replacing a few lines of WHERE clause with hundreds of lines of .net code written by developers who are incompetent doesn't seem to me like a good trade-off. Am I crazy here?

Weng

@blakeyrat said:

and instead of just using the WHERE clause to do that, you want to run literally every row of the table through a .net program?

Oh. No, the data is already filtered in the SQL. Every row SQL emits ends up in the output. Basically, each row also has a serial number column which we suppress from the output and stuff back into the database to record the association with the report.

[Quote]I'm having trouble parsing this sentence, it sounds like you're saying that if you ask the framework guys to do the work it'll guarantee it won't work right, and that's why you ask them to do it? (I'm not sure what part of what we're talking about is "non-core functionality", so maybe I'm misinterpreting?)[/quote]

Two teams: The framework team, which builds and owns the data export component as a whole. The app team, which builds client apps and the SQL that runs on the framework. Inserting the aforementioned database link between the report and the individual records supports a framework feature, and an obscure one at that. So if asked to do that in their SQL, the app team would 'forget' every time.

[Quote]Because you can or because you must?

Or to rephrase it: you are using Excel as an interchange format. Is that because there's no other choice? Or just because "we've always done it that way"?[/quote]

No other choice in that our clients dictate the interchange specs long before anyone is involved. This is mediated by salesmen who do not sell custom software but instead our physical end products. They aren't interested in the software end of things and know Excel, so it comes up every time the client hasn't involved their own software people.

I scream and rant literally every time, but WtfCorp politics mean we can't try to fix the problems.

blakeyrat

@Weng said:

Oh. No, the data is already filtered in the SQL. Every row SQL emits ends up in the output.

Then how does this:

Executing this SQL against a given MSSQL server will return a dataset of arbitrary size - up to the 4096 column limit and an arbitrarily vast number of rows. Total size can exceed available RAM and swap but is guaranteed not to exceed available scratch disk.

End up in Excel format? Like... how is this working now?

I'm not trying to grill you to death, I'm honestly trying to understand the scenario we're trying to solve here.

Weng

We dynamically build an SSIS package that runs the SQL, has a custom component in the middle to do our internal stuff and outputs to an Excel (or whatever) destination, and then execute that package.

This also brings up another problem with using SSIS like this: The libraries leak memory somewhere. After a few tens of thousands of packages built and executed it hits the 32bit process limit.

blakeyrat

I think you missed my point. Excel format can't store an "arbitrary vast number of rows", so I'm asking how it could possibly work as-is.

It's like you're saying you take this 50,000 gallon tank of oil, and need to empty it in this 55 gallon drum-- what's the best way of doing that?

Weng

Oh. That. Presumably it blows up in a fun and entertaining manner if you exceed it's limits.

Trying to decouple the file end of things from the SQL end of things so we can relatively painlessly add more formats in the future. Each format may obviously have limitations lower than the generalized SQL limitations.

NTW

@Weng said:

We dynamically build an SSIS package that runs the SQL, has a custom component in the middle to do our internal stuff and outputs to an Excel (or whatever) destination, and then execute that package.

If you're already doing a bunch of .net code for your custom generation of a SSIS package, why not a CLR table valued function that does the work for you and writes your file out? Your query just joins to your function and you're done (you can pass in the file name, etc to the function).

Note - if you are modifying the database state this might not work as i'm pretty sure CLR TVFs can't do that.

Weng

SQL user cannot access the necessary storage and DBA politics prevent me from bringing it into automated deployment or being responsive to issues in less than 30 days (which is their minimum deployment turnaround for SQL CLR)

Note the current SSIS solution runs on separate application servers for exactly those reasons.

cheong

FYI, SqlDataAdapter use SqlDataReader internally to fill the DataSet. So except the problem of needing to keep the connection, and you may need to run SqlCommand yourself to update the data, it should be alright. (Beware of deadlock if there will be other things updating the table, though)

For the problem on data consistency BaconBits mentioned, of your SQL server can tolerate it, consider to run SqlDataReader on a Snapshot, then generate a batch of update statements and run them to the actual database.

Weng

Oh, yeah. We've long figured out the contention issue. Liberal nolocking, mostly - our use case is fine for that, since Bad Things have happened if there is concurrent processing to the same group of records, and the only UI doesn't care if the data is inconsistent at that level.

BaconBits

@cheong said:

For the problem on data consistency BaconBits mentioned

SqlDataReader shouldn't cause a consistency problems by itself. Using WITH (NOLOCK) will cause consistency problems because it allows reading dirty data. It's availability problems that normal READ COMMITTED behavior will cause because it won't let you read dirty data.

For example, say your whole operation takes 1 ms to process 10,000 records and you're just SELECTing things, and you've got 500 million records. That's about 8 and a half minutes. So, you start your DataReader running, and the query executes and gets a shared lock on every relevant row, page, or table (depending on what the query engine decides is best).

Thirty seconds later, someone tries to update that table. Except... an UPDATE requires an exclusive lock. The transaction can't get an exclusive lock, however, because there's already a shared lock from another transaction. So the giant SELECT transaction is now blocking the UPDATE transaction. So, the UPDATE query waits. Most queries time out at the longest at 5 minutes. So, the UPDATE would fail... at the end of the timeout. At the end of the timeout, the UPDATE query would fail, possibly causing the whole transaction to roll back depending on what your application decides. You can set the timeout lower to make it fail faster, but maybe you don't want to do that. If the UPDATE started 8m into the 8m 30s operation, then it'd suddenly get to complete the UPDATE when the block was freed 30s later.

You can use SNAPSHOT ISOLATION and you can use READ COMMITTED SNAPSHOT (they're not the same) but either way you have to turn them on database-wide (with the ALLOW_SNAPSHOT_ISOLATION and READ_COMMITTED_SNAPSHOT options, respectively). The reason you might not want to do that is because they have caveats like increased storage requirements, fundamentally altering the behavior of the locking so you can get different results, changing how rows are saved on disk, slowing updates, slowing data reads, increasing tempdb contention, etc. so they're not something you just want to enable without planning or testing. Like that 500 million record query? READ COMMITTED is the default behavior I described above. Under READ COMMITTED SNAPSHOT, it saves the row version data to the tempdb so that it knows which ones to send, but the UPDATE won't block. Under regular SNAPSHOT isolation, the UPDATE also won't block, but the system will create the snapshot by dumping the whole query to the tempdb behind the scenes. Depending on your hardware and query, that could be a a lot of I/O and a lot of storage space. Is that bad? I don't know. Maybe your large SELECT is aggregating and sorting, so it's already using a ton of tempdb and you've planned for that. It depends on what you're doing, how accurate you need your data, and what kind of pressures you've got.

That's all just what my understanding is, too; my understanding here certainly isn't perfect.

And, yes, everybody is annoyed that Microsoft decided to call their second type of row versioning READ COMMITTED SNAPSHOT ISOLATION when they already have isolation levels READ COMMITTED and SNAPSHOT, and had the command SET ALLOW_SNAPSHOT_ISOLATION ON to enable the SNAPSHOT isolation level. It makes sense why because it alters how the READ COMMITTED level behaves, but it makes it really hard to understand what people are talking about without being very specific.

Jaime

Remember that all the SNAPSHOT stuff was bolted on in SQL 2005 to mimic Oracle's locking model so the workloads that used to work a lot better on Oracle's database can now compete on equal footing in SQL Server. It's not a surprise that they don't fit in smoothly.

In other news, when first working with SQL Server, Oracle people always freak out when they can't roll back a transaction they didn't explicitly begin. When first working with Oracle, SQL Server people can't figure out why Oracle sometimes commits their transactions on them at what seems like random times.

cheong

No, I'm not saying the problem is from SqlDataReader. Just as I said, SqlDataAdapter is using SqlDataReader internally to read data. If it would introduce data consisency problem, using other methods wouldn't help either.

I mention "data consistancy problem you mentioned" because that's what I see to summerize that paragraph. Perheps I should just the more accurate one - "problem with locks".

Btw, my suggestion to use snapshot is primary to evade exclusive locks without the risk of dirty reads.

It's difficult to post relatively long paragraph and still keep track of logic when it can be interrupted 2-3 times in the time of writing it.