Algorithm to obtain union of sets of pieces of data over unreliable network

aitap

I've been thinking this problem over for a while, but hasn't arrived at a satisfactory solution so far.

Let's suppose that Alice and Bob both have sets of pieces of data (each piece up to 64kB in size, usually much less; the whole set is less than 10⁵ pieces), and the sets are mostly similar, but there's a small difference. Alice and Bob have somehow established a connection over an unreliable network: it may break after transmitting a few pieces of data, but should suffice for at least one. Alice and Bob want to obtain a union of those sets. What should they do?

The problem looks like it's already been solved, but I've never found anything exactly matching my requirements. Which algorithm named after a well-known programmer did I miss?

My attempt at solving this: let's add a unique ID to each of the data pieces (for example, an integer timestamp and/or a small hash should be enough in my case) and produce an array of these IDs, sorted, on both sides. Now:

Alice computes a rolling checksum (using e.g. librsync) of the array and transfers it to Bob
Bob computes the difference between Alice's checksum and his own array and sends it to Alice
Alice uses the computed difference to reconstruct Bob's array of IDs
Having both arrays, Alice can determine which pieces of data to send and which ones to request, which she does
Repeat until rolling checksum shows no difference.

If I ever manage to find the time to implement that, it would need to be portable/embeddable enough to run on both PCs and smartphones (which means C++, right?) and leave the network layer to the user (or at least make it possible to do a relatively complex setup involving both TLS client certificates and NAT traversal).

Is there a simpler way to solve the problem?

topspin

Peer-to-peer file sharing solves a problem that is at least somewhat similar: It works over an unreliable network, between different / unreliable peers in arbitrary order, and individual parts need to be verified for integrity. The integrity is afaik verified by trees of cryptographic hashes (i.e. many hashes of small blocks, fewer hashes of those hashes, etc.).
What's missing from that is that, unlike your problem, you only have one set of data instead of two different ones. But isn't rsync made exactly for that?

PleegWat

@aitap Use part of rsync but not just use rsync for an rsync job? And why a rolling checksum, rather than just checksumming each piece of data?

Various cloud storage apps may have solved this, but probably with the added assumption that their proprietary server is bob.

PJH

@PleegWat said in Algorithm to obtain union of sets of pieces of data over unreliable network:

rsync

That was going to be my suggestion for pointers as well.

blakeyrat

Based on your problem description the network is reliable for each chunk of information, but can only send a few at a time. So... putting a simple hash on each chunk should be perfectly sufficient.

The only issue that's left is determining how many chunks there are. Which is also easily solved. Your packet looks like:

Machine1|76/12937|3HhwapIouwhsds|(actual data)

You say what machine it is, how many chunks it has to send (in this case, this is chunk 76 out of 12937) and the hash of the chuck so the receiver can determine if it already has it or not. Right? That's all you need. If you're just querying to find out if both sides already have that chunk, you can leave out the actual data. (But based on how the network works, there might not be any point to that-- especially with tiny 64k chunks.)

I'm not sure why the rolling checksum would be useful since both sides, presumably, could have data that the other side doesn't. Also you never said the chunks had to be delivered in any particular order anyway.

But this is all to hypothetical, because what's the actual problem you're trying to solve? Bittorrent already exists, just use that. If this is preventing you from writing an actual product, tell us what the product needs to do and I'm sure there are a hundred libraries to solve that problem and run in your C++ mobile environment or whatever.

aitap

Thank you all for your replies!

@PleegWat said in Algorithm to obtain union of sets of pieces of data over unreliable network:

Use part of rsync but not just use rsync for an rsync job?

I know that, in theory, I could get two rsync binaries to talk over whatever pipe/connection I establish and not just ssh, but I don't have those pieces of data as separate files.

And why a rolling checksum, rather than just checksumming each piece of data?

@topspin said in Algorithm to obtain union of sets of pieces of data over unreliable network:

one set of data instead of two different ones. But isn't rsync made exactly for that?

Rsync sends full list of files, then negotiates changes to individual files.

I don't want to transmit the full list of pieces of data because its size may well exceed the size of the sole piece of data I need to transfer. I assume that rolling checksum of the list is smaller and works well when changes can be expressed as insertions and deletions.

@blakeyrat said in Algorithm to obtain union of sets of pieces of data over unreliable network:

I'm not sure why the rolling checksum would be useful since both sides, presumably, could have data that the other side doesn't.

Yeah, that's the part I like the least: Alice uses rolling checksum to reconstruct Bob's list of chunks only to manually compute differences again before she and Bob can exchange actual data. Also, Bob's array Alice is going to reconstruct this way is packed, which may complicate comparing it against Alice's copy.

tell us what the product needs to do and I'm sure there are a hundred libraries to solve that problem

It's a kind of homework question: I want to teach myself programming networked apps, so I'm writing a messenger with reliable message delivery. In particular, I'm interested in getting the "reliable" part to work (I don't expect to have free time to finish the rest). Which means that I should optimize for the case when the difference is only one or a few messages, but the same algorithm should be able to reconstruct full message history, should one of the participants lose it.

PleegWat

@aitap If you can reliably partition the data, you could hash the partitions and communicate that. Then drill down into partitions with mismatches.

LaoC

@aitap said in Algorithm to obtain union of sets of pieces of data over unreliable network:

It's a kind of homework question: I want to teach myself programming networked apps, so I'm writing a messenger with reliable message delivery. In particular, I'm interested in getting the "reliable" part to work (I don't expect to have free time to finish the rest). Which means that I should optimize for the case when the difference is only one or a few messages, but the same algorithm should be able to reconstruct full message history, should one of the participants lose it.

That case has some constraints that make it easier:

The set of blocks is actually an ordered list of messages, so you don't have to assume someone would add a message in the middle of a list that both participants already have. Otherwise I don't think there would be a way around sending the whole list of IDs since neither participant knows how many there should be in the first place.
It makes no sense to sync then entire set every time so each participant could just send the peer/publish (if it's multiuser chat) the latest message they have locally. If the others notice they don't have anything with that timestamp, they'd ask for all messages since the last timestamp they have from this participant.

To get around the unreliability I'd just use some TCP-like resend-until-ACK, so it would be assured that messages are added in chronological order without gaps.

aitap

@PleegWat said in Algorithm to obtain union of sets of pieces of data over unreliable network:

If you can reliably partition the data, you could hash the partitions and communicate that. Then drill down into partitions with mismatches.

Binary search, but with timestamps instead of array indices as reference? Good idea, thank you. Much better than linearly exchanging differences if network latency is high.

@LaoC said in Algorithm to obtain union of sets of pieces of data over unreliable network:

The set of blocks is actually an ordered list of messages, so you don't have to assume someone would add a message in the middle of a list that both participants already have.

True, unless someone has a horribly skewed clock. But then all kinds of things start breaking down, including TLS, so there's no need to account for that.

It makes no sense to sync then entire set every time so each participant could just send the peer/publish (if it's multiuser chat) the latest message they have locally.

Yes! A participant who has just created a message should just send it and only bother with synchronization if some checksum doesn't match.

If the others notice they don't have anything with that timestamp, they'd ask for all messages since the last timestamp they have from this participant.

It may get complicated if you allow participants to have multiple devices. (Is it possible to create such a situation that one would be writing in the middle of message history by losing a device and getting a new one?) I'm afraid of creating a corner case where earlier messages are lost because one of the participants believes that they are already delivered (and offers later messages) and another one can't ask for them. That's what the synchonization is for, I guess.

To get around the unreliability I'd just use some TCP-like resend-until-ACK, so it would be assured that messages are added in chronological order without gaps.

A simple ACK would be "here's how many messages I think there are in our message history". A better one would be "here is a collision-resistant (but not necessarily cryptographically strong) hash of the list of message IDs", but it's O(message count) to calculate.

dkf

@aitap said in Algorithm to obtain union of sets of pieces of data over unreliable network:

A better one would be "here is a collision-resistant (but not necessarily cryptographically strong) hash of the list of message IDs", but it's O(message count) to calculate.

Might as well use something of cryptographic strength. Those algorithms are pretty well tested and you can pick one to fit the message length and trade that off against collision likelihood (which is pretty low with structured input, FWIW).

The real question I have is whether the difference rate is low enough that you should use a divide and conquer algorithm or whether you're stuck with a more linear algorithm. You could compute the SHA-1 hash of each data item, put those hashes in order in a list, partition the list into chunks of “appropriate size”, hash those list chunks, and then compare the list of those hashes to see what you need to do. This would cheaply let you at least see whether there's a long list of items at the front and end of the overall sequence that don't need special work. Better partitioning (perhaps depending on other inherent properties of the data?) would likely get you bigger wins. For example, partitioning by (rounded) timestamp might work particularly well if you expect almost all the changes in the most recent period (a common feature of most dynamic datasets). Understanding the data will help…

A simple ACK would be "here's how many messages I think there are in our message history".

It's usual to put the channel/acknowledgement code at a lower level than the sync protocol. Yes, you don't need to, but it's much more difficult to write the much more asynchronous protocols that arise from that. (This stuff definitely gets brain-boggling, especially when you've got more than two parties communicating, and everything is about a thousand times worse if there's an antagonist about too…)

PleegWat

@dkf And what if the reason the communication is lossy is that some intermediate router will drop any UDP packet containing one of the 7 naughty words?

dkf

@PleegWat said in Algorithm to obtain union of sets of pieces of data over unreliable network:

one of the 7 naughty words?

What are the other six? I can see dropping B•••••m but you've obviously got a grander concept in mind…

(FWIW, intermediate routers do worse things to UDP than just dropping it.)

aitap

@dkf said in Algorithm to obtain union of sets of pieces of data over unreliable network:

Might as well use something of cryptographic strength.

You are right. Any possible savings from using a meticulously picked hash algorithm are going to be negligible.

You could compute the SHA-1 hash of each data item, put those hashes in order in a list, partition the list into chunks of “appropriate size”, hash those list chunks, and then compare the list of those hashes to see what you need to do.

A few days ago I accidentally discovered that there is a data structure called Merkle tree, which is basically what you describe but repeated until only one hash is left (then participants can exchange hashes of sub-trees, skipping branches where hashes match, just like @PleegWat wrote).

This definitely calls for some experimentation with different partitioning schemes, chunk sizes and tree depths. Your post has been very inspiring.

It's usual to put the channel/acknowledgement code at a lower level than the sync protocol. Yes, you don't need to, but it's much more difficult to write the much more asynchronous protocols that arise from that.

Thanks for the warning. I had assumed that setting a timeout and restarting the procedure if it happens would be enough, and TCP would handle the rest. Implementing a protocol with ACKs over UDP seems way over my depth, because I would also need to watch for packet sizes (512 byte payload is maximal "safe" with IPv4, but people have sent kilobyte-sized packets and got away with it!), duplicates and probably a lot of stuff I'm not even aware of.

dkf

@aitap said in Algorithm to obtain union of sets of pieces of data over unreliable network:

people have sent kilobyte-sized packets and got away with it!

That depends on the minimum MTU of all the links that you are using between the sender and receiver hosts. Ethernet has an MTU of around 1500, but it's sometimes lower (occasionally much lower) for other comms technologies. Packet splitting and reassembly is possible… but not best relied upon. And I've seen even very respected companies run afoul of this (and fixing it to make Games For Windows Live able to update itself was intensely annoying; I'm guessing they were using TFTP since TCP doesn't have this problem as it can auto-negotiate the transfer unit size).

I wish I didn't know so much about this. Low-level networking is pretty arcane stuff. :(

PleegWat

@dkf I typically like working with our low-level stuff, but that's in part because I don't spend much time there. And we do analysis, which is a different beast than normal communication.

PJH

@dkf said in Algorithm to obtain union of sets of pieces of data over unreliable network:

but it's sometimes lower (occasionally much lower) for other comms technologies

From memory, one CDMA provider in Norway had it < 1000.

Not fun, since that was the lowest common denominator that we had to use on our systems, despite other providers being more.. reasonable.