Bus for messages and massive files?

gleemonk

Hey there! I'm looking for a message broker that can disentangle how we move data between hosts. Basically we have data coming in over FTP (yes), SFTP, HTTP, and a host of proprietary protocols. That data is currently copied via SSH/rsync to the devices that need it. So the systems that receive it and the systems that process it need to know each other. Which is a pain.

So I started looking into message brokers. At first I thought Kafka would fit our use-case very well. Unfortunately it doesn't do big files. And we do have packs in the multi-gigabyte range. Though the more frequent case is 1 MB. I saw some creative people split their files to funnel them through Kafka. I find that distasteful. Another way would be to have a storage server where large files are pushed and only the reference passed via Kafka. That would add complexity too.

So is there a message broker that:

Can do pub/sub
Has mature libraries for Python/PHP/Shell
Transfers huge messages/files
No hard latency requirements, "eventually" is good enough
Optional: Can archive and replay

I don't really know much about the topic so please do comment if you see another solution than a message broker.

JBert

@gleemonk said in Bus for messages and massive files?:

Another way would be to have a storage server where large files are pushed and only the reference passed via Kafka. That would add complexity too.

This is not a new thing, it's called the Claim Check pattern. We used it in some cases where files were a couple of megabytes because message brokers still hate it when the message is larger than a few hundred kilobytes.

The only thing to watch out for is that this storage server will now become a potential bottleneck, and all components need to agree on an access method which will then be "locked" into the system.

You could have some extra indirection where the reference first needs to be looked up to even know what storage server to pick (storing extra information about the access method is then possible) but that needs to be done ahead of time because now that lookup is locked into the system.

ObjectMike

The claim check pattern seems reasonable. I would just try to round robin among a few file servers so you don't have that bottleneck.

If you have a ton of data (especially both file size and file count) maybe some sort of distributed file server cluster. Maybe some sort of Hadoop/HDFS-lite type system?

If you're already using AWS you could store in S3, so you'd get horizontal scaling pretty easily.

gleemonk

@JBert Thanks for the feedback. I'm not worried about bottlenecks because a single fileserver can easily withstand our burstiest times. We're not moving that much. Packs in the multi-gigabyte range only occur a few times per hour. It's all about simplifying the process. So I'd like to avoid having two protocols involved. But yeah if there is no other way we'll have to use a fileserver. Though I wonder if the leanest thing to do in that case is just to poll the fileserver and forget about a dedicated message broker.

@mikehurley I'm not worried about performance because a single fileserver on a gigabit link is sufficient for our purposes. It would be neat to have failover though.

dkf

@gleemonk said in Bus for messages and massive files?:

Though I wonder if the leanest thing to do in that case is just to poll the fileserver and forget about a dedicated message broker.

The problem with that is that clients end up having to do lot of work because of communications they don't care about.

Luhmann

@gleemonk said in Bus for messages and massive files?:

just to poll the fileserver and forget about a dedicated message broker.

technically yes, but reasons for inserting an intermediate that might apply:

source or end-points are under different control (as in organisational or not your software)
stability, recoverability, support, ...: a central point can easily add logging/tracing and play back, e.g. send a bunch of messages again, maybe even manipulate them before re-sending. This is of course very tied to point 1. Even just knowing exactly when the stream stopped and that restarting stuff won't leave you with gaps in the stream can be important enough

Sending big files over such a thing is not an easy task in any system. We work with MS BizTalk, Mirth, a third one I forgot and our own solution. Neither one can handle larger files reliable. Generally files are either stored on a file server. For specific purposes (WAN/Cloud transfer of big files) someone on a blue monday created a transfer tool based on BITS.

gleemonk

Just for your information we're now looking into Apache ActiveMQ Artemis because their docs say

The only realistic limit to the size of a message that can be sent or consumed is the amount of disk space you have available.

This gives me some confidence it will fit our use-case well.