Reversing a several stage string transformation



  • OK, so I have a program that takes a string, runs it through a cipher. It then takes the output of the cipher, and converts it to a phonetic notation via a series of regular expressions. Then it uses some more regexes to 'clean up' the phonetic string into something pronouncable. Then it turns the phonetic notation into a human-readable string.




    Now, the hard part: How do I reverse this process? I've thought about it long and hard, and have no idea how I'm going to do this.




    Oh and it's in Java.




    The library to do this (code is inside the jar-within-a-jar): https://sites.google.com/site/starfox008/DesktopXlat.zip?attredirects=0&d=1




    This is a request for an action plan, don't think I'm just going "do my job and then plz send me teh codez".



  • @008 said:

    OK, so I have a program that takes a string, runs it through a cipher.

    Is the cipher reversible?

    @008 said:

    It then takes the output of the cipher, and converts it to a phonetic notation via a series of regular expressions. Then it uses some more regexes to 'clean up' the phonetic string into something pronouncable.

    Are the regexes reversible?

    @008 said:

    Then it turns the phonetic notation into a human-readable string.

    Is that process reversible?

    @008 said:

    I've thought about it long and hard, and have no idea how I'm going to do this.

    Find the answer to the three questions above.



  • @blakeyrat said:

    Is the cipher reversible?

    Yes. This is the so-called "easy part." If I can manage to reach this stage I'm pretty much home-free.
    @blakeyrat said:
    Are the regexes reversible?

    Information loss occurs here. Reversing this is the so-called "hard part." This is the one that has me wracking my brains trying to figure out how to proceed.
    @blakeyrat said:
    Is that process reversible?

    Yes. This is another "easy part".



  • @008 said:

    @blakeyrat said:
    Are the regexes reversible?

    Information loss occurs here. Reversing this is the so-called "hard part." This is the one that has me wracking my brains trying to figure out how to proceed.

    Ok, well, then change the RegEx so it's reversible.

    So what do you want from us, exactly?



  • If by "change the regexes so they're reversible" you mean make it so they don't lose information, that's kind of difficult; the output data of the algorithm is supposed to be pronounceable by a human voice and not deviate too far away from the cipher string. Even making (ciphertext) "x" become a vowel sound in some places is pushing it. (Those who have played Star Fox Adventures and looked at the code will know immediately what the cipher is supposed to be.)



  • @008 said:

    If by "change the regexes so they're reversible" you mean make it so they don't lose information, that's kind of difficult;

    Well, then you either have to write the difficult code, or you have to find some other way of accomplishing your goal that isn't as difficult.

    I really don't know what you're asking of us... are you expecting someone to crack open that .jar file, swoop down, and fix all your difficult problems? Because your original post said you weren't going for a "send me the codes" situation. Ok; so what are you looking for?

    @008 said:

    (Those who have played Star Fox Adventures and looked at the code will know immediately what the cipher is supposed to be.)

    Is there anybody on this forum who isn't a furry?



  • @blakeyrat said:

    Is there anybody on this forum who isn't a furry?

    No. Next question.

    @blakeyrat said:

    I really don't know what you're asking of us...

    A suggestion that is a bit more substantial than echoing back the problem scenario in the form of an imperative, for one.



  • If the regexes lose information, then by definition you can't reverse them in all cases. If you need to reverse them, you need to change them so that they do not lose information. That should be obvious.

    The problem then is how to reversibly transform the ciphertext into something pronounceable. I don't really have any experience in this, but here are some thoughts. You need to draw up a 1-1 mapping between ciphertext elements and target elements. In some cases the target element will be longer than the ciphertext element. In other cases they may be the same size and even the same value.

    I'm thinking it might be best to use two or three character elements in your ciphertext and map most that are reasonably pronounceable through as themselves. You need to reserve some pronounceable target blocks for translated ciphertext elements - these blocks will be the longer ones. So it could be that, for example, when you see an element starting with "AG" you know it's a four-letter element instead of a two-letter one, and a lookup from the next two characters tells you what the original ciphertext element was. You wouldn't use all the target elements in the AG block since you don't want to have, e.g. "AGKF" appearing in your target string. You'd probably need several blocks reserved for this, or else have this block be larger (e.g. a six-letter element).

    Generally speaking, if you're using two-element blocks, if you only use elements with one vowel and one consonant you should wind up with something pretty pronounceable (maximum of two consecutive vowels or consonants). This might be too restrictive, though. Longer blocks give you more flexibility but may be harder to arrange so that you always get something pronounceable. See how you go, I guess.



  • So the proposed solution is to basically throw out everything after the cipher stage and rewrite it so that it loses no information.



    However, I have a limited output set (limited to what phonemes my voice can readily produce on demand):

    a ɛ e ɪ i y u o ɒ ʔ b ʃ d ɸ g h ʒ k l m n p r s t β w x j = 29



    This is my input set:

    a b c d e f g h i j k l m n o p q r s t u v w x y z ø = 27



    Are you suggesting I treat two or more input tokens as their own token or something?



  • @008 said:

    So the proposed solution is to basically throw out everything after the cipher stage and rewrite it so that it loses no information.



    However, I have a limited output set (limited to what phonemes my voice can readily produce on demand):

    a ɛ e ɪ i y u o ɒ ʔ b ʃ d ɸ g h ʒ k l m n p r s t β w x j z= 30



    This is my input set:

    a b c d e f g h i j k l m n o p q r s t u v w x y z ø = 27



    Are you suggesting I treat two or more input tokens as their own token or something?

    I forgot one, and the edit button disappeared.



  • Your output key space is larger than your input key space; are you sure you are losing information? Is it infeasible to run a computationally complete set of input values through the regex system? If not, why not just do that and see if you have any collisions, if not then you know you have a reversible map and just store it in a table for later use.



  • Under the current system, yes information gets lost; several combinations of input values produce identical output when shoved through the system. A lot (the second and beyond of a group of duplicated consonants, "h" when not before a vowel letter, etc.) simply produce "". In theory it is technically possible for an input to have an infinite number of "h"s allowed after the end of it. I tried getting a little bit of this information back as syllable stress (determined by adding up letters in the cipher string according to an arbitrary value, then applying stress to the syllable val%(number of vowel sounds in output+1), or nowhere if that result is out of range.



  • you can't reverse an irreversible transformation. Your options are to either switch to a reversible transformation, or retain the original text elsewhere so that when it comes time to go back to the original text, you still have it and don't need to reverse anything.



  • Since it isn't possible to guarantee that the original ciphertext is retrievable ("you only have a sound recording. What now?"), I guess I'll have to rewrite the transformation to be lossless. Now the question is what to do with them and try not to disturb the current output too much. Some inputs are obvious (k=>k), some are not (c=>???). And then there's the issue of mapping five vowel letters to 9 sounds.



  • Well, how essential is the information lost? Could you still reverse the cipher if you just assumed some reasoneable defaults for the irreversible parts? e.g. assume that the original text didn't contain any silent "h" and stuff like that.

    Alternatively, do you have any other information about the original text? E.g. if you know the language and vocabulry, you might try to use a dictionary.

    If you can't constrain the problem any further, then so far you sound as if your bosses tasked you with developing a perfect speech recognition engine.



  • The input text is supposed to be a subset of linguistic interlinear gloss that's in a specific order (at least if the output is to make sense for what it's for) with English content words. It's trivial to apply the cipher to this input and reverse that. The hard part is getting the input back after it's been turned into a string of phonemes (meant to be spoken by a human, or a computer program that can speak with raw IPA/X-SAMPA input.) The phonemes get turned into a written form (such as the Latin or Cyrillic alphabets) as the final output that most people would experience. As far as going the dictionary file route, such files are big and incomplete.

    As for assuming things about the input beforehand, since a (potentially silent) "h" comes from a (non-silent in all cases unless you're British) "r" for example, it would make the output difficult to parse by a human at best, complete gibberish at worst.

    I have the power to modify the phonetic stages, but since there's a (albeit small) corpus of text using the old system (pretty much tests of the system); they'd be rendered invalid.



  • @008 said:

    Since it isn't possible to guarantee that the original ciphertext is retrievable ("you only have a sound recording. What now?"), I guess I'll have to rewrite the transformation to be lossless. Now the question is what to do with them and try not to disturb the current output too much. Some inputs are obvious (k=>k), some are not (c=>???). And then there's the issue of mapping five vowel letters to 9 sounds.

    That may actually help you. For instance, you said one of the problems was repeated consonants getting lost. So instead of dropping the extras, take one of the additional vowel sounds and shove that in between.


    For characters where there's no clear mapping, you can pick another of the unmapped vowel sounds to indicate "the next character uses an alternate mapping", and map the remaining characters that way.



  •  Why can't you simply keep the original text when you transform it, so that when you need to go back you just look at the original?



  • @008 said:

    Since it isn't possible to guarantee that the original ciphertext is retrievable ("you only have a sound recording. What now?")

    That's why. On the plus side, using the ciphertext directly as a representation makes reverse-translation easy. On the minus side, it's also a huge pain in the ass to attempt to read and pronounce.



  • @008 said:

    @008 said:
    Since it isn't possible to guarantee that the original ciphertext is retrievable ("you only have a sound recording. What now?")
    That's why. On the plus side, using the ciphertext directly as a representation makes reverse-translation easy. On the minus side, it's also a huge pain in the ass to attempt to read and pronounce.
    Also, I belive you noted that distinct chains of characters can reduce to a single phoneme, which would mean that for multiple distinct words you might have the same pronunciation (think "plane" and "plain" in English for a simple example). You might need to do something with the probability of Markov chain (both in word and multi-word) occurrence here in order to come up with good guesses on those types of reductions.

    I guess what I'm saying is that you might want to look at it like this: 

    You've got a subset of the English wordspace, and since that set of words is at least countably infinite and probably finite, you should be able to come up with a list of all of the highly probably words and their word chains to Nth order (i.e. one wouldn't expect the phrase "antidisestablishmentarianism lamp" to be highly probable assuming we are working with a real linguistic sentence here). Given this information, you should be able to map the phoneme sets backwards with a high degree of accuracy - even more so when you introduce that weighting you were talking about introducing for dropped consonants.

    I'm thinking about it this way - each resultant phoneme has a finite set of probable source nodes with different probabilities; those source nodes aren't in a random order and for certain source nodes, the preceding/following nodes may provide you with a reasonably good way to guess which path to use as your map. This might be further refined by looking not just within word boundaries but across word boundaries since word order is also not a random process for most languages.

    edits: fixed some of my Engrish (I'm sure there's more)



  • Re: Reversing a several stage string transformation [SOLVED]

    Thank you guys for your help. I've updated the file (link is still valid) with what I ended up doing.


Log in to reply