Setting up basic synchronization



  • In our setup, each user has a bunch of client devices that may be synced through a central server.

    The server has no visibility into the data being synced across devices, so it only serves as a transportation and storage point (data is encrypted on each device before transport).

    Is there a standard architecture that we should be using to sync the data between devices? All types of modification actions are possible (Create, Update, Delete).

    Edit: We are thinking of something like -

    Encrypted Client Data
    Encrypted Client Journal
    Unencrypted Client List
    Unencrypted Journal Fingerpint (i.e. SHA1 hash).

    The journal would contain entries like:

    Action ID
    Action (e.g. Create, Update, Delete)
    Change Contents (new data to add, item to delete, etc.)
    UTCDateTime of the action (for ordering)


  • Discourse touched me in a no-no place

    The sort of thing Dropbox does? I normally sync stuff between computers using that (manually encrypting some of it.)



  • That's... really vague.

    What protocol are they using? TCP/IP? Can you go higher-level to HTTP? Or is it something entirely different?

    Can you give us an idea of what these "client devices" are, what kind of data is being synched, how much of it is there, and what protocols the "client devices" are capable of communicating in?

    Because I can guarantee the answer's going to be different if they're $35 barcode scanners that work on a custom RF network that isn't wifi than if they're $1500 laptops with Windows 8.



  • @blakeyrat said:

    That's... really vague.

    What protocol are they using? TCP/IP? Can you go higher-level to HTTP? Or is it something entirely different?

    It's over https, they're using TCP/IP, usually they send JSON messages with a web API receiving the data and then inserting into a database



  • @blakeyrat said:

    That's... really vague.

    What protocol are they using? TCP/IP? Can you go higher-level to HTTP? Or is it something entirely different?

    Can you give us an idea of what these "client devices" are, what kind of data is being synched, how much of it is there, and what protocols the "client devices" are capable of communicating in?

    Because I can guarantee the answer's going to be different if they're $35 barcode scanners that work on a custom RF network that isn't wifi than if they're $1500 laptops with Windows 8.

    Client devices could be anything - right now laptops, but we need to support mobile devices in the future.



  • Oh well then it's easy. "Anything".

    You just need the world's simplest REST application that only has one endpoint and 3 verbs. You can write that in almost any web-capable language in like an hour.



  • @rad131304 said:

    Client devices could be anything - right now laptops, but we need to support mobile devices in the future.

    Can you guarantee they all speak TCP/IP and HTTP? "Mobile devices" is still pretty goddamned vague.

    You have to set SOME kind of limit here.



  • @blakeyrat said:

    @rad131304 said:
    Client devices could be anything - right now laptops, but we need to support mobile devices in the future.

    Can you guarantee they all speak TCP/IP and HTML? "Mobile devices" is still pretty goddamned vague.

    You have to set SOME kind of limit here.

    All our devices will transmit json messages over HTTPS. They will all be communicating to the same API.

    Our issue is how do we deal with a corrupt data set when the server doesn't know what's actually in the data? Generally it's just syncing an encrypted blob.

    Should we just force devices to check for updates before any data modifications are allowed?



  • @rad131304 said:

    All our devices will transmit json messages over HTTPS. They will all be communicating to the same API.

    Ok! The teeth are pulled!

    @rad131304 said:

    Our issue is how do we deal with a corrupt data set when the server doesn't know what's actually in the data?

    I don't think you can, even in theory. Maybe some expert in hashing here has a brilliant solution for you?



  • @blakeyrat said:

    @rad131304 said:
    All our devices will transmit json messages over HTTPS. They will all be communicating to the same API.

    Ok! The teeth are pulled!

    @rad131304 said:

    Our issue is how do we deal with a corrupt data set when the server doesn't know what's actually in the data?

    I don't think you can, even in theory. Maybe some expert in hashing here has a brilliant solution for you?

    That's why I was thinking of having an extra journal - at least that way if somebody get's really out of sync they can follow the journal to catch up. Once all the devices catch up, the data can be rebased and the journal truncated.

    Devices are required to register to gain access, so the list of devices with data access is known.



  • @rad131304 said:

    That's why I was thinking of having an extra journal - at least that way if somebody get's really out of sync they can follow the journal to catch up.

    It makes sense to store your change history, but since every change is 100% different than the last (due to encryption) it means you can't use diffs of any kind and your storage costs will balloon up fairly quickly.

    @rad131304 said:

    Once all the devices catch up,

    How could the server possibly know that?

    EDIT:

    @rad131304 said:

    Devices are required to register to gain access, so the list of devices with data access is known.

    Ok; but:

    1. The situation where one device rolls-back the history, then tries to update the data before a second device has yet seen the roll-back is going to be a really tricky, nasty edge-case. You're going to need the ability to "fork" a history, and also a way for clients to select which fork they want to be looking at.

    2. If a device dies, HD fails, battery conks out, etc. your server's going to be forever stuck in a "haven't updated all clients!" state.



  • @blakeyrat said:

    @rad131304 said:
    That's why I was thinking of having an extra journal - at least that way if somebody get's really out of sync they can follow the journal to catch up.

    It makes sense to store your change history, but since every change is 100% different than the last (due to encryption) it means you can't use diffs of any kind and your storage costs will balloon up fairly quickly.

    @rad131304 said:

    Once all the devices catch up,

    How could the server possibly know that?

    It's highly unlikely more than one device would be performing data modification simultaneously; the devices would be in charge of updating the data/journal. I suppose we could just encrypt the action/data of the journal.



  • @rad131304 said:

    It's highly unlikely more than one device would be performing data modification simultaneously;

    "highly unlikely" means your solution can be inefficient, it doesn't mean the solution can be non-existent.

    @rad131304 said:

    the devices would be in charge of updating the data/journal. I suppose we could just encrypt the action/data of the journal.

    Since they're all communicating over HTTPS, presumably with a valid cert, why not ditch the client-side encryption altogether? The transmissions back-and-forth will be secure, and now you can store real diffs and the server can do some kind of data consistency checking for you.



  • @blakeyrat said:

    "highly unlikely" means your solution can be inefficient, it doesn't mean the solution can be non-existent.

    I'm not sure I understand this statement.

    @blakeyrat said:

    Since they're all communicating over HTTPS, presumably with a valid cert, why not ditch the client-side encryption altogether? The transmissions back-and-forth will be secure, and now you can store real diffs and the server can do some kind of data consistency checking for you.

    The data we are transiting is PII, and we would prefer that, in the event of our compromise, any data acquired to be difficult to turn into something usable.



  • @rad131304 said:

    I'm not sure I understand this statement.

    There's two types of issues:

    1. Impossible - these you don't need to solve

    2. Possible - these you do

    "highly unlikely" is still "possible", therefore you need to write your code to cope with that situation cleanly.

    @rad131304 said:

    The data we are transiting is PII, and we would prefer that, in the event of our compromise, any data acquired to be difficult to turn into something usable.

    Right...?

    I didn't say "don't encrypt the data". I said, "don't encrypt the data on the client. HTTPS will encrypt it on the wire, the server can encrypt it "at rest" (the term the HIPAA rules use), the clients can also encrypt it "at rest" if you feel that's necessary.

    Basically, HTTPS already does what you want to do separately, and you're already using HTTPS, so it makes no sense to encrypt data, then encrypt the data again to transmit it, then have a bunch of annoying engineering challenges because the server has to deal with encrypted data.



  • @blakeyrat said:

    @rad131304 said:
    I'm not sure I understand this statement.

    There's two types of issues:

    1. Impossible - these you don't need to solve

    2. Possible - these you do

    "highly unlikely" is still "possible", therefore you need to write your code to cope with that situation cleanly.

    Which is what we are trying to come up with. That's why I was asking if there was some standard architecture.

    @blakeyrat said:

    Right...?

    I didn't say "don't encrypt the data". I said, "don't encrypt the data on the client. HTTPS will encrypt it on the wire, the server can encrypt it "at rest" (the term the HIPAA rules use), the clients can also encrypt it "at rest" if you feel that's necessary.

    Basically, HTTPS already does what you want to do separately, and you're already using HTTPS, so it makes no sense to encrypt data, then encrypt the data again to transmit it, then have a bunch of annoying engineering challenges because the server has to deal with encrypted data.

    If we are compromised, those storage encryption keys can be taken when the data is taken.


  • I survived the hour long Uno hand

    @rad131304 said:

    If we are compromised, those storage encryption keys can be taken when the data is taken.

    If you are compromised, the client's encryption keys can be taken as well. If you're compromised, the game is basically up.



  • @Yamikuronue said:

    @rad131304 said:
    If we are compromised, those storage encryption keys can be taken when the data is taken.

    If you are compromised, the client's encryption keys can be taken as well. If you're compromised, the game is basically up.

    We don't hold those.



  • @rad131304 said:

    That's why I was asking if there was some standard architecture.

    Well, I think a Lotus Notes guy would say Domino Server is ideal for this, but.

    I don't know of any good ones that aren't awful "enterprise-y" bloatware. Are there any? Probably.

    @rad131304 said:

    If we are compromised, those storage encryption keys can be taken when the data is taken.

    Who's "we"?

    If you mean the client, then add on a requirement for full-disk encryption in addition to your requirement for TCP/IP and HTTPS support.

    If "we" means the server then... well, the server certainly should have some form of full-disk encryption on it.

    If the thief can crack Bitlocker (or equivalent), they deserve access to the data. You're roughly 47,324,342 times more likely to be compromised via social engineering.

    Since you keep using the word "we", I assume you're in a company. Talk to your security guy, ask him what the requirements are. I can guarantee that laptops with full-disk encryption and data passing over HTTPS are both well-within HIPAA guidelines for PII.



  • @blakeyrat said:

    @rad131304 said:
    That's why I was asking if there was some standard architecture.

    Well, I think a Lotus Notes guy would say Domino Server is ideal for this, but.

    I don't know of any good ones that aren't awful "enterprise-y" bloatware. Are there any? Probably.

    @rad131304 said:

    If we are compromised, those storage encryption keys can be taken when the data is taken.

    Who's "we"?

    If you mean the client, then add on a requirement for full-disk encryption in addition to your requirement for TCP/IP and HTTPS support.

    If "we" means the server then... well, the server certainly should have some form of full-disk encryption on it.

    If the thief can crack Bitlocker, they deserve access to the data. You're roughly 47,324,342 times more likely to be compromised via social engineering.

    How about a flaw in our API architecture that might leak data we didn't intend to leak? How about a 0-day that gives a malicious attacker access to the server while online? Those are both more likely than physical compromise of cold storage.



  • @rad131304 said:

    How about a flaw in our API architecture that might leak data we didn't intend to leak?

    You don't need to worry about that; you require HTTPS, remember?

    You only need to worry about a flaw in HTTPS' encryption. And it's pretty damned proven at this point.

    @rad131304 said:

    How about a 0-day that gives a malicious attacker access to the server while online?

    How do you secure your existing servers? Why would the solution for this one be any different?

    @rad131304 said:

    Those are both more likely than physical compromise of cold storage.

    The first, definitely not. The second... maaaaaaybe? If you're using a shitty server OS? And have no firewall? And your security guy is a drunk?

    Look, millions of companies have secured servers. This isn't an impossible problem.

    If you really, really, really, really, really, really, really think you 100% absolutely need client-side encryption, then knock yourself out. But also realize the trade-offs you're making in the complexity of the solution.

    (For example, what if the flaw in your API architecture you talk about is only present because your decision to do client-side encryption made it far more complicated than it needed to be?)



  • @blakeyrat said:

    If the thief can crack Bitlocker (or equivalent), they deserve access to the data. You're roughly 47,324,342 times more likely to be compromised via social engineering.

    This is actually the biggest concern and why we want client-side encryption. If one of our (meaning the company's) employees who has access to the servers gets compromised through some sort of social engineering attack, we'd rather that failure not leak our client's data. If the compromised employee can lead to privileged, that's game over in your encryption scenario.



  • Is this literally the first time your company has set up a secured server to store PII?

    If that's the case, you might consider hiring a consultant.

    Our company is beholden to HIPAA, so ours are set up securely, but I haven't participated in that part of the business so I really can't give any advice on that matter.

    However, how exactly you configure the server is almost completely orthogonal to what you're asking in the first post. The server would have to be configured that way regardless of what protocol/product/database you choose.

    (But I think you're vastly underestimating the security of Bitlocker if you think it's your biggest risk factor. It's probably not even in the top 10.)



  • @blakeyrat said:

    @rad131304 said:
    How about a flaw in our API architecture that might leak data we didn't intend to leak?

    You don't need to worry about that; you require HTTPS, remember?

    You only need to worry about a flaw in HTTPS' encryption. And it's pretty damned proven at this point.

    What? People make mistakes programming APIs and end up giving users access to data they're not supposed to have all the time. Remember AT&T when they leaked Apple IMEIs? Or any time you could just increment the id of a query string by 1 to get to the next user's account?

    I'm not saying we'd do that on purpose, I'm saying mistakes happen. look at openSSL and heartbleed, or any number of other programming mistakes that lead to compromises.



  • @rad131304 said:

    What? People make mistakes programming APIs and end up giving users access to data they're not supposed to have all the time. Remember AT&T when they leaked Apple IMEIs? Or any time you could just increment the id of a query string by 1 to get to the next user's account?

    You're already sharing the encryption key among ground of "client devices", so that any one of them can decrypt the messages of any other of them. I don't see the practical difference between that and not doing client-side encryption at all, risk-wise.

    But hey. Talk to your security guy, see what he thinks, and do what you gotta do.

    I'm going to work.



  • @blakeyrat said:

    @rad131304 said:
    What? People make mistakes programming APIs and end up giving users access to data they're not supposed to have all the time. Remember AT&T when they leaked Apple IMEIs? Or any time you could just increment the id of a query string by 1 to get to the next user's account?

    You're already sharing the encryption key among ground of "client devices", so that any one of them can decrypt the messages of any other of them. I don't see the practical difference between that and not doing client-side encryption at all, risk-wise.

    I think you've misunderstood the architecture. Each user is only authorized to have access to their own data. Though there are multiple users, there should not be a way for user B to ever see user A's data. Though a user may wish to view or change their own data on multiple devices, we have no need for visibility into it, and we would prefer to provide our users with assurances that a compromise of the servers does not leak data in an unencrypted form.



  • I think the problem, at least from my understanding of what you are trying to do, is that you are refusing to trust the servers these clients communicate their data to, to access the data. This seems like the sticking point that is going to cause you the most problems.

    Is there really no way you can communicate to a trusted server to validate the data and report back if a client is corrupted? It makes all of your complexity vanish


  • Java Dev

    That seems pretty much the case. You'll have to trust someone to access and validate the data.

    If you trust the server, then you can validate and merge according to complex business rules there.
    If you trust the client, then all you can do in the server is checking the client's upload is based on your most recent revision.



  • @PleegWat said:

    That seems pretty much the case. You'll have to trust someone to access and validate the data.

    If you trust the server, then you can validate and merge according to complex business rules there.
    If you trust the client, then all you can do in the server is checking the client's upload is based on your most recent revision.

    So it sounds like the journal function I edited into the OP is going to be our best solution. With, possibly, electing one of the user's nodes to be the authoritative device that does any rebase of the data root (just to keep the journal manageable).



  • @rad131304 said:

    Should we just force devices to check for updates before any data modifications are allowed?

    You push a notification about the update. Otherwise, how often are you going to check.

    Now if you're talking about transmission corruption, the target device has to handle that.



  • @xaade said:

    @rad131304 said:
    Should we just force devices to check for updates before any data modifications are allowed?

    You push a notification about the update. Otherwise, how often are you going to check.

    Now if you're talking about transmission corruption, the target device has to handle that.

    The data should remain fairly static - less than 1 update per day across all devices. We planned to check at, no worse than, every login, but we want to find a way to keep polling to a minimum. I don't think that we can rely on HTTPS socket support yet, though. That's why I was thinking about a "check before create/update/delete" condition.



  • I still think pushing update notification is better. It's not going to compromise the data, and what if your user logs in before the update but after your top of day. I'm speaking from dealing with car dealers. Unpredictable.

    Another idea for data corruption.

    You can transmit parity information to the encrypted information, so that this can be checked without knowing how to decrypt the data.



  • @xaade said:

    I still think pushing update notification is better. It's not going to compromise the data, and what if your user logs in before the update but after your top of day. I'm speaking from dealing with car dealers. Unpredictable.

    IOW, just stop being dumb and use the websocket api - my concern had been that not all devices would support the protocol. I've not dug that deeply into it, though, to know how good/bad the polyfills are (if they even exist).



  • @rad131304 said:

    (if they even exist).

    Depends on how big the shim.gif are.



  • @blakeyrat said:

    the server can encrypt it "at rest" (the term the HIPAA rules use)

    Encryption "at rest" always raises the question "how are you going to manage the keys?" Short of working with HSMs, that key-management problem then basically becomes the focal point for attacks on your system's data-at-rest. Solutions like BitLocker basically tie it into the OS credentialing system, which is fine provided the OS credential system is stronger than the encryption on the data-at-rest. This may not be the case, though -- a malicious user (whether an insider or an intruder) who has or gains local admin anywhere in an AD forest can bring down the house on you in certain circumstances using a pass-the-hash attack. Linux systems may or may not be vulnerable to similar attacks against Kerberos (the focus of Kerberos pass-the-ticket attacks has been the NT Kerberos implementation, which makes it hard to discern if they're a threat to Kerberized Linux systems as well) -- relying on separated credentials for these servers may be wise.

    @xaade said:

    You can transmit parity information to the encrypted information, so that this can be checked without knowing how to decrypt the data.

    Well, you really want a cryptographic hash or digital signature there....

    .



  • @xaade said:

    @rad131304 said:
    (if they even exist).

    Depends on how big the shim.gif are.

    Looks like it should be OK to use websockets; all major browsers and mobile devices seem to, generally, have some form of support.



  • Way to use "solved problem"!



  • @xaade said:

    Way to use "solved problem"!

    I think ultimately we will use an authoritative client + root data + journal + notification approach.

    • Offline clients will just update at login, and then listen for notifications.
    • Each client will register it's position in the journal with the server.
    • Once all clients reach a point in the journal, the authoritative client can rebase the root data and discard the journal items.
    • If a notification is somehow missed by a client that's supposed to be receiving notifications (by some sort of connectivity issue), then a client can use the journal to retrieve all of the necessary changes to update properly.


  • He miss journal and notificationings?
    How is that?

    Automatic system update.

    Just get the root!



  • @rad131304 said:

    I think you've misunderstood the architecture.

    If so it's only because you been Vague-y McVagues-a-lot when asking this question.

    @rad131304 said:

    Each user is only authorized to have access to their own data.

    See, that's a new wrinkle that hasn't been previously said anywhere in the thread. Until this point, you've only been talking about "client devices", this is the first you've mentioned having user accounts of some sort.

    @rad131304 said:

    So it sounds like the journal function I edited into the OP is going to be our best solution.

    I highly disagree. I think you have way more complexity than you should and...

    @rad131304 said:

    With, possibly, electing one of the user's nodes to be the authoritative device that does any rebase of the data root (just to keep the journal manageable).

    that's going to add HEAPS of new complexity on top of everything.

    The less code you have, the fewer machines executing code you have, the easier it'll be to keep things secure.

    @rad131304 said:

    The data should remain fairly static - less than 1 update per day across all devices.

    Look, instead of dropping-in all these little hints one at a time, why not just explain to us, specifically, what you're trying to accomplish?

    @tarunik said:

    Encryption "at rest" always raises the question "how are you going to manage the keys?"

    Yes yes yes, but the point is: thousands of companies are successfully doing it, so difficulty-aside, it's definitely possible.

    Rad131304 is so dedicated to not trusting his server, it boggles the mind-- sooner or later that server's going to have to deal with the PII, and it's a heck of a lot better to concentrate your security in that one place than on X "client devices".


  • 🚽 Regular

    @xaade said:

    You can transmit parity information to the encrypted information, so that this can be checked without knowing how to decrypt the data.

    We compute a SHA-1 hash on the data our mobile things send and append it to the binary data blob that gets sent, even on an embedded device it isn't computationally too intensive to cause issues.
    There are a lot of free implementations and I used one of those rather than try and roll my own out of RFC 3174.

    We have never had any issues with corruption making it through that but we have had bad data come off the wire and be caught (which surprised me a little as we use TCP).

    Edit: Bad as in the client's log shows the correct bytes and the correct number of bytes were received by the server but it was mangled.



  • @blakeyrat said:

    @rad131304 said:
    I think you've misunderstood the architecture.

    If so it's only because you been Vague-y McVagues-a-lot when asking this question.

    Sorry, I didn't know what exactly would be germane to the problem, I was trying not to over-complicate the basics.

    @blakeyrat said:

    @rad131304 said:
    Each user is only authorized to have access to their own data.

    See, that's a new wrinkle that hasn't been previously said anywhere in the thread. Until this point, you've only been talking about "client devices", this is the first you've mentioned having user accounts of some sort.

    I thought it was obvious; again, sorry.

    @blakeyrat said:

    @rad131304 said:
    So it sounds like the journal function I edited into the OP is going to be our best solution.

    I highly disagree. I think you have way more complexity than you should and...

    @rad131304 said:

    With, possibly, electing one of the user's nodes to be the authoritative device that does any rebase of the data root (just to keep the journal manageable).

    that's going to add HEAPS of new complexity on top of everything.

    The less code you have, the fewer machines executing code you have, the easier it'll be to keep things secure.


    Totally agree - less complexity is better which is why I'm asking the question. I thought we were trying to do it wrong and so I was looking for a fresh approach. Again, that's why I was vague before. Sorry.
    @blakeyrat said:
    @rad131304 said:
    The data should remain fairly static - less than 1 update per day across all devices.

    Look, instead of dropping-in all these little hints one at a time, why not just explain to us, specifically, what you're trying to accomplish?

    Because the rate of data updating shouldn't be important?

    We are trying to sync user data between client devices, the sync network shouldn't have visibility into the data. What are possible architectures that could work? That's what what we are trying to do. There are implementation details that might make us pick one architecture over the other, but I wasn't asking you guys to do my job for me, I wanted a pointer to a place where we could look over our options and choose the best one for us.

    I was trying to balance giving enough information to solve the problem, without you guys having to wade through a pile of requirements that may or may not be important. I clearly failed at that.



  • @rad131304 said:

    We are trying to sync user data between client devices, the sync network shouldn't have visibility into the data. What are possible architectures that could work?

    Right; but then right away you said, "oh and BTW the server should be able to validate the data".

    I was really just trying to point out the incompatible requirements. Either you live in a world where the server can validate the data, or you live in a world where the clients all independently encrypt the data. Those two worlds don't intersect.

    And to be perfectly honest, I'm not even sure what you mean by "architecture". Are you looking for an off-the-shelf product? Or a database schema? Or debating between REST and SOAP?

    @rad131304 said:

    I was trying to balance giving enough information to solve the problem, without you guys having to wade through a pile of requirements that may or may not be important. I clearly failed at that.

    Well mentioning the "client devices" were all laptops and smartphones that understood TCP/IP and HTTPS was pretty critical to the question asked. 😄



  • Listen, I tried to apologize for not giving you enough information; criticism of our business requirements is unhelpful.



  • @xaade said:

    He miss journal and notificationings?
    How is that?

    Automatic system update.

    Just get the root!

    I'm concerned with change conflict; how might two conflicting changes be resolved? I suppose we could always follow the workflow:

    1. check status and update if needed
    2. update local copy
    3. push update and notification

    and simply not allow changes if the server can't be reached? At that point, change conflict is incredibly unlikely and the client that discovers the issue could just prompt the user to choose how to deal with the issue.



  • @rad131304 said:

    simply not allow changes

    That's not the best option.

    And you'll still have conflict, because you can have two people editing a record simultaneously, unless you are going to do checkouts.



  • @xaade said:

    @rad131304 said:
    simply not allow changes

    That's not the best option.

    And you'll still have conflict, because you can have two people editing a record simultaneously, unless you are going to do checkouts.

    Well, this is another case of not explaining the architecture well - only one physical person should ever have access to that data, the problem only arises if they did something like make a modification on their phone and laptop simultaneously - which is unlikely, but possible.

    I agree, I would rather not use some form of checkout system though.



  • @rad131304 said:

    only one physical person should ever have access to that data

    How are you guarantying that though?



  • @xaade said:

    @rad131304 said:
    only one physical person should ever have access to that data

    How are you guarantying that though?

    There is no guarantee, it just wouldn't be particularly prudent for the user to share their access; hence why I am concerned with conflicts.




Log in to reply