This is one of the “wow, this is so simple” problems that turns out to be utterly horrendous when you get around to actually writing the code for it. It seems like pretty much every would-be programmer tries to write their own text editor at some point, and roll their own syntax highlighting while they’re at it; I know I did.
I took several passes at this algorithm:
<o:p> </o:p>First, I did it brute-force; I listened for changes to the document, and colored the entire thing each time, using a series of documentText.indexOf(keywords[i]) type calls. It worked, kind of, but when you got over a few lines, it was slower than a really slow thing.
<o:p> </o:p>So I figured I’d speed things up and use Apache’s RegEx package. This did give me a noticeable speed bump, but it was still unacceptably slow.
<o:p> </o:p>This led me to coloring only the line that was changed. This works, in most cases, and is as fast as I’d ever need it to be. The problem comes from nested syntax modes; a change on a line of JavaDoc code might turn it into Java code, and that might mean one, two, or N lines would need to be recolored. You don’t want to recolor the entire document, but you want to make sure you color everything that has changed.
<o:p> </o:p>What I ended doing was storing a token on each line, telling me what kind of code it was; JavaDoc, Java, String, etc. When a line was changed, it was colored immediately. I would then color each subsequent line, until I encountered a line that did not change type; e.g., it started as Java code, and ended as Java code.
<o:p> </o:p>Just to clarify; when I say a line’s type, I am referring to
the type of the last character on the line.
For example, the line
/* this is my uber-cool int */ int reallyBigIntJustInCase = 0;
would be considered Java code, while the line
int reallyBigIntJustInCase = 0; /** this is my uber-cool int */
would be considered JavaDoc code.
Numerics can be treated as a special type of keyword; just add another RegEx to the list. Strings can be treated as a syntax mode unto themselves… at which point you get to add escape characters to your algorithm.
This all sounds fairly straight foreword, but the actual implementation spanned several classes, and a few hundred lines of code.
Of course, having proved to myself that I could implement a decent editor-highlighter, I went back to using Jedit.