Syntax highlighting: its been bothering me



  • I've been programming for quite some time now (since 97 in some form or another) and over this time i've programmed in many languages including C, and more recently VB, VbScript, SQL, HTML (even though some of those are technically scripting)

    During this time i've been involved with a few dev tools including a generic editor or 2.

    One thing that's bugged me most of this time, is how to write an efficent syntax highlighting engine.

    I mean, you can use the obvious (vb mode)

    for each line in richtext1.lines

      for each token in tokens

          'highlight word and set colour.

       next

    next

     

    But that's going to be incredibly slow and painful to write.

    Ideally, i'd be looking to write some form of engine that was language generic - meaning it could highlight the following language, scripts and markup:

    XML, HTML, XSLT, VbScript, Visual Basic, SQL

     

     

    Anyone got any useful suggestions?



  • > Anyone got any useful suggestions?



    I'm not sure you will consider this as a usefull suggestion, but who knows :

    do not  write a syntax highlighter. Use an efficient editor, emacs for example.



    If you really want to write one, then search for "lexical analysis" in google,

    or for "flex". The implementation you proposed has an awfull complexity but

    that's the idea. A lexical analyzer generator will reduce this complexity drastically

    -- you will just have to write the regular expressions that describes best your tokens,

    the tool will generate an FSM that you will use to match the word on your lines.



    But I repeat, if you want syntax highlighting, smart indentation and many more

    usefull things, use emacs, SciTe or VIM.

    Drop notepad / visual studio.





  • Another suggestion is search for a program called notepad2. Its like
    notepad, but lots of extras, including syntax highlighting.



  • I assume he means for a function on his code-related website, or somesuch, not for his own personal code use.


    Because then he'd have to write a text-editor, too.



  • @Mike R said:

    Another suggestion is search for a program called notepad2. Its like
    notepad, but lots of extras, including syntax highlighting.

    If you need to include Syntax Highlighting in a custom text editor, a good idea could be to use the Scintilla control (if it's available in your graphical toolkit).



  • I've tried doing it with regular expressions in VB.net. (I wrote a database tool that was much faster than enterprise manager on MSDE an could do almost all the (simple) things I do on databases). It turned out to be very very slow (one of the problems is the richttextedit field, it's not very fast). If you want to go through with this I suggest you do syntax checking bases on time, not on character input (or do it after a person has not pressed a key for a while). This will keep typing quick and crisp, and it'll highlight when there's some spare time.

    Drak



  • Well, that only keeps the typing "quick and crisp" if hilighting is non-blocking or if typing cancels the hilighting step.  In both of those cases, (the latter especially) it would be just as crisp if you calculated per keystroke rather than per 1-second-of-not-typing.

    If you use a timer to do the 1 second delay, you still block execution of the update loop deep in the program.

    Ideally, you only want to hilight that which has changed.  So, it may be a good idea to shadow a copy of the document as an array of lines. and only update the one that has focus.  This won't save much however, unless you either A) write your own text control, or B) can block paint events during the resetting of the text and cursor position in the text box.

    Bsaically, what I am saying is that if you really want crisp, responsive typing in a syntax hilighting control, you prety much have to code the eintire thing yourself.



  • Don't discard the initial solution of using the rich text control so fast. I did just that for syntax highlighting a few years. The trick is realizing that most of the time you only need to update the current line (the rest of the time you will have to update two lines or the entire text). Typical formatting time was only a few milliseconds per line highlighting the usual operator/string/numeric tokens and ~500 reserved word/functions. This will be unnoticable if you only update the current line though it is long enough to cause a problem if you have several thousand lines of code to update at once.

    I made no (or very little) attempts to optimize things so I'm sure it can be made faster. With a sufficiently efficient token parser and keyword lookup I would think you could do quite a bit with this method. If you needed faster the only other approach I can think of would be writing your own text control that manually draws the text...fast but non-trivial to write.



    The specific source code is on SourceForge at <A HREF="http://cvs.sourceforge.net/viewcvs.py/mwedit/mwedit/EsmScriptDlg.cpp?rev=1.4&view=markup" EsmScriptDlg.cpp, though its not particularily pretty.



  • Forgot to mention the method of interest would be

    <font><font color="#b22222">CEsmScriptDlg::ParseLine()</font></font>



  • @DaveHumphrey said:

    Don't discard the initial solution of using the rich text control so fast. I did just that for syntax highlighting a few years. The trick is realizing that most of the time you only need to update the current line (the rest of the time you will have to update two lines or the entire text). Typical formatting time was only a few milliseconds per line highlighting the usual operator/string/numeric tokens and ~500 reserved word/functions. This will be unnoticable if you only update the current line though it is long enough to cause a problem if you have several thousand lines of code to update at once.

    I made no (or very little) attempts to optimize things so I'm sure it can be made faster. With a sufficiently efficient token parser and keyword lookup I would think you could do quite a bit with this method. If you needed faster the only other approach I can think of would be writing your own text control that manually draws the text...fast but non-trivial to write.

    The specific source code is on SourceForge at , though its not particularily pretty.

    Indeed, i think that the rich text control is particularly well suited to syntax hilighting IF you optimize it well enough.

    I used it for a BrainF**k interpreter for a while (yeah, BrainF**k with syntax hilighting...) and it was decently fast, but I ended up disabling it when i started needing to load and parse files upwards of 12MB



  • @Otac0n said:

    @DaveHumphrey said:

    Don't discard the
    initial solution of using the rich text control so fast. I did just
    that for syntax highlighting a few years. The trick is realizing that
    most of the time you only need to update the current line (the rest of
    the time you will have to update two lines or the entire text). Typical
    formatting time was only a few milliseconds per line highlighting the
    usual operator/string/numeric tokens and ~500 reserved word/functions.
    This will be unnoticable if you only update the current line though it
    is long enough to cause a problem if you have several thousand lines of
    code to update at once.

    I made no (or very little) attempts to
    optimize things so I'm sure it can be made faster. With a sufficiently
    efficient token parser and keyword lookup I would think you could do
    quite a bit with this method. If you needed faster the only other
    approach I can think of would be writing your own text control that
    manually draws the text...fast but non-trivial to write.

    The specific source code is on SourceForge at <>, though its not particularily pretty.

    Indeed, i think that the rich text control is particularly well suited to syntax hilighting IF you optimize it well enough.

    I used it for a BrainF**k interpreter for a while (yeah, BrainF**k with syntax hilighting...) and it was decently fast, but I ended up disabling it when i started needing to load and parse files upwards of 12MB



    Ha! A Brainfuck program of 12 MB, that's interesting; could you consider most of it as "data"? (As in >++>+++++>++)


  • +>+++>+++>+++++++



  • 1337



  • @Joost_ said:


    Ha! A Brainfuck program of 12 MB, that's interesting; could you consider most of it as "data"? (As in >++>+++++>++)

    Yeh, the 12MB program was generated (from C++).

    but, if you want a real big Brainfuck program, try "LostKingdom", a game by Jon Ripley: http://jonripley.com/brainfuck/games/

    if you want my interpreter, you can contact me too.



  • Damn forums!

    Thats supposed to be "a real, big BF program."  It's only about 96kb.

    and, btw, the 12MB program generated a bitmap image, but as i said, it was generated BF code.

     



  • I've written several BF interpreters myself (one is about 200 bytes when assembled) and a BF to C translator too (in brainfuck). This post put the itch back into me for writing a BF compiler, preferably in brainfuck. If this Jon Ripley fellow wrote his >100KB brainfuck programs by hand, he should join the Marines or something.

    But first I'm going to do some research on what's really supposed to happen when "<FONT face="Courier New">.</FONT>" sees EOF and EOL. And what should happen after EOF? For example, I noticed that Jon Ripley's bf2* translators all start with the instructions "<FONT face="Courier New">[.]</FONT><FONT face="Times New Roman">". That's weird. In most of my implementations, that would eat up all input, never to be seen again. After the EOF has passed, a 27 or a 0 remains in the first memory location, and subsequent "<FONT face="Courier New">.</FONT>" instructions return without changing anything, I think.</FONT>



  • @noe said:



    But I repeat, if you want syntax highlighting, smart indentation and many more
    usefull things, use emacs, SciTe or VIM.

    VIM ... arrggghhhhhhhh !!!

    I installed SUSE 10... just had seen linux 2 or 3 times before (Mandrake)... I took more than one hour to learn how to save on VIM... grrrrrr... ESC ESC :w ENTER

    But instead I took an incredible lightning fast setup of a 3G pcmcia card (Option / Qualcomm) from TMN of 15 minutes... the ISP, told me it was not possible to setup on Linux!

    Don't you hate those dumb helpdesk girls ?

    LOL :D

     



  • The most efficient way I know to go about syntax highlighting is using a finite state machine with output (deterministic or non-deterministic, and Mealey or Moore types, if you want to search for really nasty details).

    FSMs (Finite State Machines) are automata used for representing regular expressions. But instead of saying a*b, you have to say something like
    - we start in state 1
    - if in state 1, and you encounter an "a", go back to state 1
    - if in state 1, and you encounter a "b", go to state 2
    - if in state 2, the pattern has been recognized

    Instead of recognizing the expression, you can attach actions to the states (or the symbols recognized). E.g, if you attach the actions "color the previous character red" to state 1 and "color the previous character blue" to state 2, then a's will get labeled red when followed by a b, and all b's will get labeled blue.

    To make it slightly more efficient, you remember the state of your automaton for every block of characters (e.g. 1000 or so), and when a change has been made you start from the last unchanged point. This is lightning fast.

    You can generate an FSM from every regular expression. There are enough libraries that can do this. They usually do not support actions, but tools like flex can do quite complex things.



  • This is one of the “wow, this is so simple” problems that turns out to be utterly horrendous when you get around to actually writing the code for it.  It seems like pretty much every would-be programmer tries to write their own text editor at some point, and roll their own syntax highlighting while they’re at it; I know I did.

    I took several passes at this algorithm:

    <o:p> </o:p>First, I did it brute-force; I listened for changes to the document, and colored the entire thing each time, using a series of documentText.indexOf(keywords[i]) type calls.  It worked, kind of, but when you got over a few lines, it was slower than a really slow thing.

    <o:p> </o:p>So I figured I’d speed things up and use Apache’s RegEx package.  This did give me a noticeable speed bump, but it was still unacceptably slow.

    <o:p> </o:p>This led me to coloring only the line that was changed.  This works, in most cases, and is as fast as I’d ever need it to be.  The problem comes from nested syntax modes; a change on a line of JavaDoc code might turn it into Java code, and that might mean one, two, or N lines would need to be recolored.  You don’t want to recolor the entire document, but you want to make sure you color everything that has changed.

    <o:p> </o:p>What I ended doing was storing a token on each line, telling me what kind of code it was; JavaDoc, Java, String, etc.  When a line was changed, it was colored immediately.  I would then color each subsequent line, until I encountered a line that did not change type; e.g., it started as Java code, and ended as Java code.

    <o:p> </o:p>Just to clarify; when I say a line’s type, I am referring to the type of the last character on the line.  For example, the line

    /* this is my uber-cool int */ int reallyBigIntJustInCase = 0;

    would be considered Java code, while the line

    int reallyBigIntJustInCase = 0; /** this is my uber-cool int */

    would be considered JavaDoc code.

    Numerics can be treated as a special type of keyword; just add another RegEx to the list.  Strings can be treated as a syntax mode unto themselves… at which point you get to add escape characters to your algorithm.

    This all sounds fairly straight foreword, but the actual implementation spanned several classes, and a few hundred lines of code.

    Of course, having proved to myself that I could implement a decent editor-highlighter, I went back to using Jedit.



  • hi, XoK

    I ahve just brought a 3G option card and i am using suse 10, Is great to hear that you get it work. But my ISp
    also tols me that it will not work with linux. Can you share with me how you make this work?

    Thank You


    raid



  • @noe said:

    > Anyone got any useful suggestions?

    I'm not sure you will consider this as a usefull suggestion, but who knows :
    do not  write a syntax highlighter. Use an efficient editor, emacs for example.

    Someone does have to write those though.
    If noone ever wrote something because something existed that does it already there would hardly be any progress in software development.
    Why create a new programming language, Bjarne Stroustrup and Bill Gossling for example would have thought, there's C already...
    Linus would never have bothered with Linux because there's SunOS, etc. etc.



  • The most efficient way I know to go about
    syntax highlighting is using a finite state machine with output
    (deterministic or non-deterministic, and Mealey or Moore types, if you
    want to search for really nasty details).





    just out of curiosity... how is a FSM any different than a RE in terms of what you're talking about?



  • @hoens said:

    The most efficient way I know to go about
    syntax highlighting is using a finite state machine with output
    (deterministic or non-deterministic, and Mealey or Moore types, if you
    want to search for really nasty details).





    just out of curiosity... how is a FSM any different than a RE in terms of what you're talking about?

    A RE is merely a pattern matching, it has no memory or state, a state machine (coupled with a tokenizer before it to simplify the work) will consider the current input based on the previous linear one (e.g. the current state of the machine), and will be able to do syntax analysis and validation to boot.

    I'd recommand reading Text Processing in Python for more informations about that kind of stuff, it has lightweight introduction to REs and state machines, their use cases and the advantages/disadvantages of both.

    For syntax analysis and syntactic coloration, the point of REs is that they can only look for fairly simple structures, or match predefined keywords, state machines are much more flexible and powerful.



  • Assuming you're outputting the code to HTML (can be changed easily enough for anything else):

    replace

    /(\W)(if|then|for|while|more|words|go|here)(\W)/gi

    with

    $1<font color="red">$2</font>$3

    and repeat for every color.  For quotes:  /("[^"]+")/gi  (same for 's)


    if you want it language specific, first match:

    (<script( language="javascript">)?>.*?</script>), same for vbscript.  Anything
    not matched gets the "Default", HTML syntax.



  • Also, check out starlight: http://dean.edwards.name/my/behaviors/#star-light.htc

    it's a CSS-based highlighter for just about any language.


Log in to reply