Stream compress s3 file in python

dangeRuss

So I have some bash script that works fairly well, but I want to try to do it in mostly pure python. I am running into a lot of roadblocks and none of the GPTs are being very helpful.

aws s3 cp s3://bucket/file.txt - |zstdmt -9 | aws s3 cp s3://bucket/file.txt.zst

This bash command works and is super fast and can saturate all the CPUs on a fast enough connection to s3.

I know python has a GIL, but allegedly the zstdandard library uses cthreads so it can use more than a single CPU and boto3 is threaded on the download and upload, but only for the download_file and download_fileobj which deal with "files".

boto3.s3.transfer - Boto3 1.34.93 documentation

Download an S3 object to a file.

        Variants have also been injected into S3 client, Bucket and Object.
        You don't have to use S3Transfer.download_file() directly.

        .. seealso::
            :py:meth:`S3.Client.download_file`
            :py:meth:`S 3.Client.download_fileobj`

Unfortunatelly this method doesn't seem to have been injected

import boto3

s3 = boto3.client('s3')
response = s3.get_object(Bucket='mybucket', Key='mykey')
data = response['Body'].read()

the response['Body'] returns a stream which would've been perfect to wrap a Zstdcompressor around and pass to the upload which also is not properly threaded when used with a stream.

I though it would be trivial to find something in python that can simulate a file object using a stream or sonething similar. Named pipes is my next idea, but even with that and dealing with multiple processes seems to be a PITA. I really though this would be easier to replicate in python.

Gustav

@dangeRuss said in Stream compress s3 file in python:

zstdmt

Do they think it'll compress better if the command line lacks vowels?

Anyway. Python has EEEEXTREEEEMEEEELYYYYY slow built-in networking libraries. There is absolutely nothing whatsoever you can do - BTDT. Every method of achieving more than dial-up speeds requires using subprocess module to spawn a program written in non-retarded language.

dkf

@Gustav said in Stream compress s3 file in python:

Python has EEEEXTREEEEMEEEELYYYYY slow built-in networking libraries.

If you are willing to work with the absolute bottom layer of the networking libraries, Python is only very slow. And forget about using threads to help, at least until there's strong evidence that they've actually successfully taken the ol' GIL out in the backyard with a shotgun and shovel.

dangeRuss

@Gustav said in Stream compress s3 file in python:

@dangeRuss said in Stream compress s3 file in python:

zstdmt

Do they think it'll compress better if the command line lacks vowels?

Anyway. Python has EEEEXTREEEEMEEEELYYYYY slow built-in networking libraries. There is absolutely nothing whatsoever you can do - BTDT. Every method of achieving more than dial-up speeds requires using subprocess module to spawn a program written in non-retarded language.

the aws cli I'm using is written in python and is able to saturate the network bandwidth on an ec2. I don't think I buy your claim about the networking bits being slow (although I'm sure it's not as fast as C or Rust.

Gustav

@dangeRuss said in Stream compress s3 file in python:

@Gustav said in Stream compress s3 file in python:

@dangeRuss said in Stream compress s3 file in python:

zstdmt

Do they think it'll compress better if the command line lacks vowels?

Anyway. Python has EEEEXTREEEEMEEEELYYYYY slow built-in networking libraries. There is absolutely nothing whatsoever you can do - BTDT. Every method of achieving more than dial-up speeds requires using subprocess module to spawn a program written in non-retarded language.

the aws cli I'm using is written in python

Or so you think. See also: numpy, and everything else Python crowd claims to show Python is suited for high performance applications.

Gustav

@dangeRuss said in Stream compress s3 file in python:

I don't think I buy your claim about the networking bits being slow

It was the very first post I made on this forum, in fact!

https://what.thedailywtf.com/post/2028908

dangeRuss

@Gustav said in Stream compress s3 file in python:

@dangeRuss said in Stream compress s3 file in python:

@Gustav said in Stream compress s3 file in python:

@dangeRuss said in Stream compress s3 file in python:

zstdmt

Do they think it'll compress better if the command line lacks vowels?

Anyway. Python has EEEEXTREEEEMEEEELYYYYY slow built-in networking libraries. There is absolutely nothing whatsoever you can do - BTDT. Every method of achieving more than dial-up speeds requires using subprocess module to spawn a program written in non-retarded language.

the aws cli I'm using is written in python

Or so you think. See also: numpy, and everything else Python crowd claims to show Python is suited for high performance applications.

Python is not meant to be high performance, it's meant to be easy to use. We're not trying to win any awards here, I would be happy with 100MB/s transfer speed, I just want to be able to process the file without having to use a ton of memory/disk space at a reasonable speed.

Gustav

@dangeRuss yeah, that's a wrong SI prefix for anything Python. How about 100kB/s instead? See linked thread. I'm sorry but sometimes the solution is to stop using the wrong tool (and Python is a wrong tool for downloading anything larger than 32kB).

Gustav

@dangeRuss if you really want to do it in Python, I'd consider spawning aws process, attaching its stdout to a named pipe, and reading from the other end, then writing to another named pipe and attaching it to stdin of another aws process. That's just about the only way to make it take less than a day.

dkf

@Gustav said in Stream compress s3 file in python:

How about 100kB/s instead?

I've saturated a 100Mbit local ethernet system with UDP messages generated by Python. And then discovered that I couldn't scale that up in practice because of a lot of other problems unrelated to Python (and more to do with the details of the protocol being used and the way modern ethernet routers work that nobody on the team — or among anyone I've spoken to more widely within WTF-U — knew before we found out).

The higher level networking libraries in Python are much slower. Getting good speed out of I/O is, as usual, all about minimising buffer copies and memory management. It was ever thus.

dangeRuss

So I just wrote some wrappers and was able to get it to do a 14GB file in under 3 mins.

real    2m37.386s
user    5m29.740s
sys     0m39.521s

Keep it mind it downloaded the file, compressed using zstandard level 8 and uploaded all at the same time during those 2.6 minutes.

So about 90MB/s. Not too shabby.

Edit: actually outperformed the original piped approach

real    4m25.256s
user    3m55.383s
sys     0m39.822s

dkf

@dangeRuss Not too bad if that's going over the open internet.

Arantor

@dangeRuss but tell me again how slow Python is, hmm?

Bulb

@Gustav said in Stream compress s3 file in python:

@dangeRuss said in Stream compress s3 file in python:

I don't think I buy your claim about the networking bits being slow

It was the very first post I made on this forum, in fact!

https://what.thedailywtf.com/post/2028908

It sounds to me like that's a specific problem of the urllib. Which is probably part of the reason (the other part is that it's really dumb) most people have switched to using requests, or at least the urllib3 that that is based on. Of course that kinda defeats the point that “Python comes with batteries included”, because it doesn't, but that's what the state of software engineering is these days.

dkf

@Bulb It's the exact same thing as always, as I said up thread:

@dkf said in Stream compress s3 file in python:

Getting good speed out of I/O is, as usual, all about minimising buffer copies and memory management. It was ever thus.

That applies to all languages. Python has some particular nuances (it's got a brutally thick layer between what happens at user level and the base level of system calls or even just the real bytecode engine) but getting buffer management right is always important for IO. Understanding how to handle streaming of things matters a lot too (as then you can avoid a lot of memory churn) but that's often harder in Python than you might think unless you dive down to the bottom layer of its networking libraries.

Still, nothing quite beats the server-side networking problems I was finding in Ruby a decade ago, with common libraries setting up huge caches only for them to be used exactly once... I've not looked since, but Ruby always was famous for being resource hungry so I suspect that's not gone away.

dangeRuss

@dkf said in Stream compress s3 file in python:

@Bulb It's the exact same thing as always, as I said up thread:

@dkf said in Stream compress s3 file in python:

Getting good speed out of I/O is, as usual, all about minimising buffer copies and memory management. It was ever thus.

That applies to all languages. Python has some particular nuances (it's got a brutally thick layer between what happens at user level and the base level of system calls or even just the real bytecode engine) but getting buffer management right is always important for IO. Understanding how to handle streaming of things matters a lot too (as then you can avoid a lot of memory churn) but that's often harder in Python than you might think unless you dive down to the bottom layer of its networking libraries.

Still, nothing quite beats the server-side networking problems I was finding in Ruby a decade ago, with common libraries setting up huge caches only for them to be used exactly once... I've not looked since, but Ruby always was famous for being resource hungry so I suspect that's not gone away.

So I ended up subsclassing io.RawIOBase and implementing a buffer using a threading.Queue with chunksize in each item. Seems to be working fairly well. Was able to create the object, give it to zstandard to write the compressed data to and to the upload thread to upload the data, then give the the compressor stream to the download thread to write to and all seems well.