When is parallelization not a performance gain?


  • Notification Spam Recipient

    So recently an unexplained performance issue has appeared with a daily data load from our Microsoft SQL Servers.

    The data load is a pretty simple set of Data Flow tasks inside an SSIS package, literally a bunch of table copies preceded by a truncate statement on the receiving end. This typically takes 30-45 minutes on average and all is fine and dandy.

    As of 7/22, this 30-minute-long process has ballooned into 6-hour-long-:wtf: and the DBAs are criticising the package as being at fault because everything is ordered in a line in a sequence (i.e. no parallelization is attempted because only one table gets processed at a time).

    Their suggestion? Split the data loads into four groups to "take advantage of" parallelization.

    IMHO this is BS, because it's not like four network streams are going to be any more efficient overall than a single one (there are only two servers in this chain), and I believe the only effect this will have is to table-lock up to four tables at a time instead of just one.

    Am I bonkers? Does this actually make sense from a practical standpoint?

    IIRC parallelization was nice inside queries where you were handling multiple joins and stuff, but straight data loads shouldn't have any performance impact, right?


  • ♿ (Parody)

    If the 4 parallel processes will be working on different tables it might be faster if the tables are physically on different disks. But it could also make it slower of they aren't.

    How unexplained is the problem? Don't know root cause or can't tell what is taking so long? Like.. I/O vs CPU?

    If you've exhausted your ability to analyze the problem, just trying stuff like this will tell you something, even if it's only that it doesn't help.



  • It depends entirely on where the bottleneck is. If, as you seem to think, the network is the bottleneck, then doing the load in parallel chunks isn't really going to help. The only thing I can think of that would be helped by parallelizing the load is if the bottleneck is the CPU (or, like boomzilla said, it's writing to several different disks).

    If the time went from 30 minutes to 6 hours in one day, then I would say you should figure out what changed on that day.


  • ♿ (Parody)

    @Dragnslcr said in When is parallelization not a performance gain?:

    I would say you should figure out what changed on that day.

    We've had problems with backups sometimes running long or something and causing word behavior.


  • Notification Spam Recipient

    @boomzilla said in When is parallelization not a performance gain?:

    If the 4 parallel processes will be working on different tables it might be faster if the tables are physically on different disks.

    Yeah, I'm pretty sure the DBAs haven't set up partitioning like that at the table level. It's all one database, pretty sure it's on its' own disk extent at least (though, not granted sufficient access to check that).

    @boomzilla said in When is parallelization not a performance gain?:

    How unexplained is the problem?

    So unexplained that when I asked them to check using logs they responded "We only keep logs for two weeks and everything we might have had about that day has already been deleted." Because keeping logs for a reasonable amount of time is a Bad Idea apparently (Yes, it's taken two weeks before they responded to the initial inquiry).

    Essentially, without us (meaning me or my team) touching anything, this server and at least one other (unrelated to this particular process, but it was noted in the email chain) have suddenly experienced performance degradation overnight (the source server seems fine).

    @boomzilla said in When is parallelization not a performance gain?:

    Don't know root cause or can't tell what is taking so long? Like.. I/O vs CPU?

    Yeah, the best we've got is "we'll watch it overnight and see if it runs faster", meaning the SSIS package log, which (since they're not tracing it or anything) doesn't really say much except "Package started" and "Package ended". I'm doubtful that they have the right tools to properly diagnose the issue at all (well, unless they're savvy with their SQL and interpreting raw data, maybe).

    @boomzilla said in When is parallelization not a performance gain?:

    just trying stuff like this will tell you something, even if it's only that it doesn't help.

    Yup. Their suggestion (since we can't deploy the "parallelized" version for at least another three weeks) was to move it a little further into the day (when the main Batch Processing is supposed to be completed, therefore trying to avoid possible contention). No, it didn't help, though it pushed processing well into the work day. SMH.

    @Dragnslcr said in When is parallelization not a performance gain?:

    If the time went from 30 minutes to 6 hours in one day, then I would say you should figure out what changed on that day.

    Yeah, and for who knows why, nobody can tell us this seemingly simple thing.

    Like, we have regularly scheduled deployments, but nobody was touching the affected servers, and we don't have visibility on other departments activities like that, so I believe it will eventually boil down to a bunch of finger-pointing before we can start talking about talking about a solution.


  • Notification Spam Recipient

    @boomzilla said in When is parallelization not a performance gain?:

    @Dragnslcr said in When is parallelization not a performance gain?:

    I would say you should figure out what changed on that day.

    We've had problems with backups sometimes running long or something and causing word behavior.

    Yeah, and that was noticed and intermittent, and these hips don't lie:
    0_1470700291594_upload-be9008d4-e7a9-489e-be55-d7881cb95e56

    Yet it's been fighting teeth and nails to get them to acknowledge that there's even a problem.



  • @Tsaukpaetra Hmm, consistently ~21 minutes, then 6 – 6.5 hours for 3 days, back to normal for 3 days, then consistently bad since then. I have no idea what changed, but it appears to have changed three times. That might make it easier to track down, but given your lack of log files, unfortunately it probably won't help.



  • @Tsaukpaetra I'd say it depends on how/if the tables interact with each other.

    SQL Server ought to be perfectly find loading data into unrelated tables simultaneously. If those tables hold indexes to each other, you'll be in lock-hell though.

    The other question is what's capping performance. If it's, say, your network connection, 4 streams won't help. (Unless each one is going to a different network interface, possibly.)



  • @boomzilla said in When is parallelization not a performance gain?:

    We've had problems with backups sometimes running long or something and causing word behavior.

    Like... spell checking?



  • When is parallelization not a performance gain?

    When met with the full force of Amdahl's Law.



  • @Dragnslcr even if the bottle neck is network, it might still be better to parallel the query because you might free up contesting queries which are blocking each other.

    Here's a question: have you tried "set transaction level read uncommitted"?

    If your query has several sub queries, or several joins, or hits a hot table that's used by others this could "magically" speed things up.

    This would be mostly for testing though, because if it does improve things you need to figure out how to share your space. Are dirty reads fine? Read uncommitted. Do they require exact? Have the non important stuff take a back seat in priority. Everything critical? Review your indexes so you can row lock instead of page/table lock.

    Already at row level locks, and everything is critical? Talk to your managers to establish a pecking order, get better hardware, get more hardware and merge data in a queue, etc.

    Also, do everything in your power to not drink the ocean. Filter to as specific data as you can, utilize batching in 100k row updates, get with your co workers and make sure they do the same.

    Make use of fetch, offset

    Consider batch loading to a temp table where there will be no contention, and running a merge (or insert, update). If you're coming from c# or similar, use transactions for many row insert/updates, sql is always better with data sets, row manipulation is slow compared to bulk operations.

    Have them run and save a running sql query monitor every hour / half hour and save the results. If it's causing production issues it should be pretty easy to get management sign off. You'd be looking for locks or high load from rogue user sql aka euc access applications



  • @Tsaukpaetra said in When is parallelization not a performance gain?:

    So unexplained that when I asked them to check using logs they responded "We only keep logs for two weeks and everything we might have had about that day has already been deleted." Because keeping logs for a reasonable amount of time is a Bad Idea apparently (Yes, it's taken two weeks before they responded to the initial inquiry).

    Essentially, without us (meaning me or my team) touching anything, this server and at least one other (unrelated to this particular process, but it was noted in the email chain) have suddenly experienced performance degradation overnight (the source server seems fine).

    If they're so cautious about the logs, maybe you should check the log for varchar(max) fields and the sort. These will cause bad I/O performance.

    And if the performance issue occurs just overnight (i.e.: the previous day runs okay), the first thing I'd check is system event log for disk error.



  • Is there any explanation for the anomalously late start time on 7/19? While it is unlikely to be anything more than it appears to be - a late start of an otherwise typical run - the proximity to the beginning of the problem demands that it at least be accounted for. Even if it has no direct bearing on the change, it might point to something that does. I would feel uncomfortable not knowing what it was, even if I thought it had nothing to do with it (I don't, but still).



  • @ScholRLEA said in When is parallelization not a performance gain?:

    a late start of an otherwise typical run

    Looks to me like a second, extra run; there was a normal run at 01:00 the same day. No idea why there was an extra run, but it does appear to be extra rather than late.


  • BINNED

    @Tsaukpaetra said in When is parallelization not a performance gain?:

    So recently an unexplained performance issue has appeared with a daily data load from our Microsoft SQL Servers.

    The data load is a pretty simple set of Data Flow tasks inside an SSIS package, literally a bunch of table copies preceded by a truncate statement on the receiving end. This typically takes 30-45 minutes on average and all is fine and dandy.

    As of 7/22, this 30-minute-long process has ballooned into 6-hour-long-:wtf: and the DBAs are criticising the package as being at fault because everything is ordered in a line in a sequence (i.e. no parallelization is attempted because only one table gets processed at a time).

    Their suggestion? Split the data loads into four groups to "take advantage of" parallelization.

    IMHO this is BS, because it's not like four network streams are going to be any more efficient overall than a single one (there are only two servers in this chain), and I believe the only effect this will have is to table-lock up to four tables at a time instead of just one.
    IIRC parallelization was nice inside queries where you were handling multiple joins and stuff, but straight data loads shouldn't have any performance impact, right?

    Network is pretty fast, compared to any other IO. More network streams makes the task CPU-bound instead of IO-bound. If you have some idle cores (and memory), there is a chance to get some performance gains just by breaking your data into streams.

    It is something to test, either it gets slightly faster, or starts thrashing and you come here and educate us :--D


  • Notification Spam Recipient

    @HardwareGeek said in When is parallelization not a performance gain?:

    @ScholRLEA said in When is parallelization not a performance gain?:

    a late start of an otherwise typical run

    Looks to me like a second, extra run; there was a normal run at 01:00 the same day. No idea why there was an extra run, but it does appear to be extra rather than late.

    Yes, that was a run done due to deployment on that day. Essentially, on deployment days we're given until end of business day to call out any deployment issues, and a 1am run would be outside their window to correct any mistakes.


  • Notification Spam Recipient

    @dse said in When is parallelization not a performance gain?:

    Network is pretty fast, compared to any other IO. More network streams makes the task CPU-bound instead of IO-bound. If you have some idle cores (and memory), there is a chance to get some performance gains just by breaking your data into streams.
    It is something to test, either it gets slightly faster, or starts thrashing and you come here and educate us :--D

    Ah yes, I knew I forgot something. This is between servers in Arizona and Florida.


  • Notification Spam Recipient

    @Matches said in When is parallelization not a performance gain?:

    If your query has several sub queries, or several joins, or hits a hot table that's used by others this could "magically" speed things up.

    In theory there are no subspecies. It's essentially
    Insert into blah select * from blek



  • I just read this article yesterday about how processor cache can make a seemingly parallel algorithm run as slow as a serial one.

    Probably not relevant in this particular case but it answers the title so well that I had to post it.


  • Discourse touched me in a no-no place

    @Tsaukpaetra said in When is parallelization not a performance gain?:

    This is between servers in Arizona and Florida.

    That makes things harder, as the bottleneck could be on the network between the endpoints and you almost certainly don't control that utterly and can't attach load monitors to all the routers on the path. But it isn't necessarily the problem anyway.

    You need to get some monitoring in place on the systems that you do control so that you can find out what's blocking things. Is it CPU? Is it disk? Is it local net bandwidth? Is it memory? Also be careful of anyone who they did nothing while looking shifty; changing the definition of the monitoring in use can have profound performance impacts (usually to the negative) even though it doesn't necessarily “change” anything. Detailed logging has a wonderful habit of causing problems so you should be very suspicious. (I once dealt with a system which was incredibly slow until we turned down the logging level, which stopped it from effectively serializing everything in the one thread…)

    Did any table or index definitions change? That sort of thing ought to be kept around despite the short logging period as it's pretty damn significant! Has the updated version of the code changed the nature of the jobs being run? Have you had a big campaign to get new business resulting in there just being a higher workload than before? :) Let's not forget the possibility that things might be taking longer because there's just more work to do.



  • @Tsaukpaetra make sure you're not using select * into thing from source somewhere, that's frequently a performance killer as sql server waits for a lock on temp db. (Doesn't happen often, but will for sufficiently complex sql, or busy boxes.)


  • Notification Spam Recipient

    @Matches said in When is parallelization not a performance gain?:

    you're not using select * into thing from source somewhere

    I'm not sure I'm interpreting that correctly, but I assure you, I'm not actually inserting stuff into temp tables. It was the shorted representable line I could think of to demonstrate the simplicity of the operation.

    Essentially, it is "Copy all data from table1 from Florida into SYSNAME_table1 in Arizona"


  • Winner of the 2016 Presidential Election Banned

    @Tsaukpaetra said in When is parallelization not a performance gain?:

    So unexplained that when I asked them to check using logs they responded "We only keep logs for two weeks and everything we might have had about that day has already been deleted." Because keeping logs for a reasonable amount of time is a Bad Idea apparently (Yes, it's taken two weeks before they responded to the initial inquiry).

    :wtf:⁉



  • @Tsaukpaetra Using

    'Insert into table values (value1, 2,3) select * from something'

    is different than

    'select * into table from something'

    The first one is much better in terms of lock contention.

    I mainly mention it because people tend to get lazy and use the second form, which fucks up badly once things get busy.


  • Notification Spam Recipient

    @Matches said in When is parallelization not a performance gain?:

    @Tsaukpaetra Using

    'Insert into table values (value1, 2,3) select * from something'

    is different than

    'select * into table from something'

    The first one is much better in terms of lock contention.

    I mainly mention it because people tend to get lazy and use the second form, which fucks up badly once things get busy.

    Yeah. I'm pretty sure SSIS is specifying all the wanted columns individually in the select statement, and in the insert statement it's doing it in batches.
    0_1470765123113_upload-e1c494f0-c357-4d6d-a53c-668f76cc41e6