Optimization/compress algorithm problem.



  • I am having a problem I am trying to figure out how to make a algorithm to do find the best shortest code with a given input, as follows.

    We have a code of the commands "-", "[", and "]". The "-" is followed by a word of one to four letters, doing the action of that word. The brackets do the bracket actions (which are not considered to be words).

    Now, to compress, we add two registers and four commands: "=" and "+" are followed by a word, doing the action of that word but also setting the first or second register. The "." and "," do whatever is stored in the corresponding register. In addition, the "[" and "]", in addition to their actions, push both registers to stack, and pop registers from stack, respectively.



  • I don't think that's enough information to answer the question. How many "actions" are possible? How are variables stored/retrieved? What's the practical purpose of this new language, and would you be better-served by using an existent scripting language?

    It sounds like what you're looking for is an algorithm to remove code that affect the output of the function, is that correct?



  • @blakeyrat said:

    I don't think that's enough information to answer the question. How many "actions" are possible? How are variables stored/retrieved? What's the practical purpose of this new language, and would you be better-served by using an existent scripting language?
    This code is actually hypothetical; it is not actually a scripting language at all. Instead, it is simply an equivalent problem to the one I am trying to solve. What the "actions" are is irrelevant, and there are too many to count. But none of them affect the stack or either of the two registers. (I think I did describe everything relevant. And, it isn't actually new. Is it simply something I made up in order to describe the problem.)

    @blakeyrat said:

    It sounds like what you're looking for is an algorithm to remove code that affect the output of the function, is that correct?
    Now I do not understand what you mean, either. What I am looking for is an algorithm to change some "-" commands to "+", "=", ",", "." in order to shorten the code while keeping the same sequence of "actions".



  • Ok, well, I'm not qualified I guess because I still don't get it. Sorry, my brain doesn't work until I know exactly what the big picture is, exactly what it is we're trying to accomplish.



  • So... if you have a sequence like "-word1[-word1]" you can replace it with "=word1[.]", but if it was "-word1]-word1[" there's not much you can do, is that correct?

    It seems to me that by linking the bracket actions with push/pop of registers you're severely limiting how useful these optimisations are in practice. The exception would be if the bracket actions are fairly rare, in which case in between each ] action you can use the registers.

    You've given us no indication that the [ and ] actions are structured in a way that would make sense to link them with the stack. For instance, you might have a preponderance of one or the other, leaving the stack in an invalid state. What do you do if at some point in the string you have processed more ] actions than [ actions? Do you populate the registers with random values?

    Unless the bracket actions are much more structured than you've said, I can't see much scope for storing registers on the stack for later retrieval. All you can really do is see which words are the most common (weighted by length, I guess) between where you are and the next ] action, and use the registers for them.

    Note that you can still do several words in a block with registers if the pattern changes enough to make it useful. Suppose we have words AAAA, BBBB, CCCC and DDDD and you have a pattern like
    -AAAA-BBBB-AAAA-BBBB-AAAA-CCCC-AAAA-CCCC-CCCC-DDDD-CCCC-DDDD-DDDD]
    You could reduce this to
    =AAAA+BBBB.,.+CCCC.,,=DDDD,..]

    But any specifics to get to this point will depend on what sort of distribution patterns your words have, and crucially on the frequency of the ] action.



  • You've given us no indication that the [ and ] actions are structured in a way that would make sense to link them with the stack. For instance, you might have a preponderance of one or the other, leaving the stack in an invalid state. What do you do if at some point in the string you have processed more ] actions than [ actions? Do you populate the registers with random values?
    Actually the brackets are always matching up correctly (I am sorry I didn't indicate, and I don't actually know the frequency of them). Also I have figured out that instead of a stack, you can see it as a tree for the purpose of this optimization. So, in "A[B]C" you might consider A to be at the root of the tree, and B and C to be its branches. Because, the value of the registers is the same at the beginning of B as at the beginning of C, which is what is necessary for this kind of optimization.


  • OK, this does open up a couple of extra possibilities. Actually, though, on reflection I don't think it's all that important unless your branches are relatively short. The reason is that if you have to execute a non-register operation, it costs no extra to store it to a register. Thus, it's cheap to switch to a new set of register values. So if there's one set of registers that works well for B, and a different set for C, go ahead and use what works best in each branch. But if the branches are short (so that optimising across both branches is competitive with optimising each one separately), or if there's no clear preferred set of register values in one branch, the stack may be useful.

    The simplest method of optimisation would be to look at each branch, for each word determine the frequency F and the number of characters C, then calculate (F-1) * C - this is the number of characters you can save by storing it in a register. If the word is already in one register at the beginning of the branch, use F instead of F-1. Take the two words with the highest result, stick them in the registers when you first encounter them (if not already in registers) and use those register values for the whole branch.

    However, you can probably do better with a block approach, if the branches are long. The size of the block would depend on the number and length of words in use and their relative frequencies and clustering. I don't know enough about it to get into specifics, though. Ideally you'd want to make it adaptive so it can say "OK, for the first 22 words we should use these values in the registers, then for the next 14 words we should use these two...."

    If there's a marked difference in frequencies between actions, the most important optimisation - which I would hope has already been done - is to make the most commonly used actions the shortest. That's obviously outside the scope of what you're asking for, of course, and probably not within your purview anyway, but I just thought I'd throw it in there. If it's not the case, you should certainly be able to make some decent savings.



  • Just reimplement Forth.



  • Thanks for this information. I will try to figure out what I can do with this information. I don't know much about adaptive algorithms though.


Log in to reply