Syntax highlighting: its been bothering me

agrath

I've been programming for quite some time now (since 97 in some form or another) and over this time i've programmed in many languages including C, and more recently VB, VbScript, SQL, HTML (even though some of those are technically scripting)

During this time i've been involved with a few dev tools including a generic editor or 2.

One thing that's bugged me most of this time, is how to write an efficent syntax highlighting engine.

I mean, you can use the obvious (vb mode)

for each line in richtext1.lines

for each token in tokens

'highlight word and set colour.

But that's going to be incredibly slow and painful to write.

Ideally, i'd be looking to write some form of engine that was language generic - meaning it could highlight the following language, scripts and markup:

XML, HTML, XSLT, VbScript, Visual Basic, SQL

Anyone got any useful suggestions?

noe

> Anyone got any useful suggestions?

I'm not sure you will consider this as a usefull suggestion, but who knows :

do not write a syntax highlighter. Use an efficient editor, emacs for example.

If you really want to write one, then search for "lexical analysis" in google,

or for "flex". The implementation you proposed has an awfull complexity but

that's the idea. A lexical analyzer generator will reduce this complexity drastically

-- you will just have to write the regular expressions that describes best your tokens,

the tool will generate an FSM that you will use to match the word on your lines.

But I repeat, if you want syntax highlighting, smart indentation and many more

usefull things, use emacs, SciTe or VIM.

Drop notepad / visual studio.

Mike_R

Another suggestion is search for a program called notepad2. Its like
notepad, but lots of extras, including syntax highlighting.

dhromed

I assume he means for a function on his code-related website, or somesuch, not for his own personal code use.

Because then he'd have to write a text-editor, too.

masklinn

@Mike R said:

Another suggestion is search for a program called notepad2. Its like
notepad, but lots of extras, including syntax highlighting.

If you need to include Syntax Highlighting in a custom text editor, a good idea could be to use the Scintilla control (if it's available in your graphical toolkit).

Drak

I've tried doing it with regular expressions in VB.net. (I wrote a database tool that was much faster than enterprise manager on MSDE an could do almost all the (simple) things I do on databases). It turned out to be very very slow (one of the problems is the richttextedit field, it's not very fast). If you want to go through with this I suggest you do syntax checking bases on time, not on character input (or do it after a person has not pressed a key for a while). This will keep typing quick and crisp, and it'll highlight when there's some spare time.

Drak

Otac0n

Well, that only keeps the typing "quick and crisp" if hilighting is non-blocking or if typing cancels the hilighting step. In both of those cases, (the latter especially) it would be just as crisp if you calculated per keystroke rather than per 1-second-of-not-typing.

If you use a timer to do the 1 second delay, you still block execution of the update loop deep in the program.

Ideally, you only want to hilight that which has changed. So, it may be a good idea to shadow a copy of the document as an array of lines. and only update the one that has focus. This won't save much however, unless you either A) write your own text control, or B) can block paint events during the resetting of the text and cursor position in the text box.

Bsaically, what I am saying is that if you really want crisp, responsive typing in a syntax hilighting control, you prety much have to code the eintire thing yourself.

DaveHumphrey

Don't discard the initial solution of using the rich text control so fast. I did just that for syntax highlighting a few years. The trick is realizing that most of the time you only need to update the current line (the rest of the time you will have to update two lines or the entire text). Typical formatting time was only a few milliseconds per line highlighting the usual operator/string/numeric tokens and ~500 reserved word/functions. This will be unnoticable if you only update the current line though it is long enough to cause a problem if you have several thousand lines of code to update at once.

I made no (or very little) attempts to optimize things so I'm sure it can be made faster. With a sufficiently efficient token parser and keyword lookup I would think you could do quite a bit with this method. If you needed faster the only other approach I can think of would be writing your own text control that manually draws the text...fast but non-trivial to write.

The specific source code is on SourceForge at <A HREF="http://cvs.sourceforge.net/viewcvs.py/mwedit/mwedit/EsmScriptDlg.cpp?rev=1.4&view=markup" EsmScriptDlg.cpp, though its not particularily pretty.

DaveHumphrey

Forgot to mention the method of interest would be

<font><font color="#b22222">CEsmScriptDlg::ParseLine()</font></font>

Otac0n

@DaveHumphrey said:

Don't discard the initial solution of using the rich text control so fast. I did just that for syntax highlighting a few years. The trick is realizing that most of the time you only need to update the current line (the rest of the time you will have to update two lines or the entire text). Typical formatting time was only a few milliseconds per line highlighting the usual operator/string/numeric tokens and ~500 reserved word/functions. This will be unnoticable if you only update the current line though it is long enough to cause a problem if you have several thousand lines of code to update at once.

I made no (or very little) attempts to optimize things so I'm sure it can be made faster. With a sufficiently efficient token parser and keyword lookup I would think you could do quite a bit with this method. If you needed faster the only other approach I can think of would be writing your own text control that manually draws the text...fast but non-trivial to write.

The specific source code is on SourceForge at , though its not particularily pretty.

Indeed, i think that the rich text control is particularly well suited to syntax hilighting IF you optimize it well enough.

I used it for a BrainF**k interpreter for a while (yeah, BrainF**k with syntax hilighting...) and it was decently fast, but I ended up disabling it when i started needing to load and parse files upwards of 12MB

Joost_

@Otac0n said:

@DaveHumphrey said:
Don't discard the
initial solution of using the rich text control so fast. I did just
that for syntax highlighting a few years. The trick is realizing that
most of the time you only need to update the current line (the rest of
the time you will have to update two lines or the entire text). Typical
formatting time was only a few milliseconds per line highlighting the
usual operator/string/numeric tokens and ~500 reserved word/functions.
This will be unnoticable if you only update the current line though it
is long enough to cause a problem if you have several thousand lines of
code to update at once.

I made no (or very little) attempts to
optimize things so I'm sure it can be made faster. With a sufficiently
efficient token parser and keyword lookup I would think you could do
quite a bit with this method. If you needed faster the only other
approach I can think of would be writing your own text control that
manually draws the text...fast but non-trivial to write.

The specific source code is on SourceForge at <>, though its not particularily pretty.

Indeed, i think that the rich text control is particularly well suited to syntax hilighting IF you optimize it well enough.

I used it for a BrainF**k interpreter for a while (yeah, BrainF**k with syntax hilighting...) and it was decently fast, but I ended up disabling it when i started needing to load and parse files upwards of 12MB

Ha! A Brainfuck program of 12 MB, that's interesting; could you consider most of it as "data"? (As in >++>+++++>++)

dhromed

+>+++>+++>+++++++

Joost_

1337

Otac0n

@Joost_ said:

Ha! A Brainfuck program of 12 MB, that's interesting; could you consider most of it as "data"? (As in >++>+++++>++)

Yeh, the 12MB program was generated (from C++).

but, if you want a real big Brainfuck program, try "LostKingdom", a game by Jon Ripley: http://jonripley.com/brainfuck/games/

if you want my interpreter, you can contact me too.

Otac0n

Damn forums!

Thats supposed to be "a real, big BF program." It's only about 96kb.

and, btw, the 12MB program generated a bitmap image, but as i said, it was generated BF code.

Joost_

I've written several BF interpreters myself (one is about 200 bytes when assembled) and a BF to C translator too (in brainfuck). This post put the itch back into me for writing a BF compiler, preferably in brainfuck. If this Jon Ripley fellow wrote his >100KB brainfuck programs by hand, he should join the Marines or something.

But first I'm going to do some research on what's really supposed to happen when "." sees EOF and EOL. And what should happen after EOF? For example, I noticed that Jon Ripley's bf2* translators all start with the instructions "[.]". That's weird. In most of my implementations, that would eat up all input, never to be seen again. After the EOF has passed, a 27 or a 0 remains in the first memory location, and subsequent "." instructions return without changing anything, I think.

XoK

@noe said:

But I repeat, if you want syntax highlighting, smart indentation and many more
usefull things, use emacs, SciTe or VIM.

VIM ... arrggghhhhhhhh !!!

I installed SUSE 10... just had seen linux 2 or 3 times before (Mandrake)... I took more than one hour to learn how to save on VIM... grrrrrr... ESC ESC :w ENTER

But instead I took an incredible lightning fast setup of a 3G pcmcia card (Option / Qualcomm) from TMN of 15 minutes... the ISP, told me it was not possible to setup on Linux!

Don't you hate those dumb helpdesk girls ?

LOL :D

TGV

The most efficient way I know to go about syntax highlighting is using a finite state machine with output (deterministic or non-deterministic, and Mealey or Moore types, if you want to search for really nasty details).

FSMs (Finite State Machines) are automata used for representing regular expressions. But instead of saying a*b, you have to say something like
- we start in state 1
- if in state 1, and you encounter an "a", go back to state 1
- if in state 1, and you encounter a "b", go to state 2
- if in state 2, the pattern has been recognized

Instead of recognizing the expression, you can attach actions to the states (or the symbols recognized). E.g, if you attach the actions "color the previous character red" to state 1 and "color the previous character blue" to state 2, then a's will get labeled red when followed by a b, and all b's will get labeled blue.

To make it slightly more efficient, you remember the state of your automaton for every block of characters (e.g. 1000 or so), and when a change has been made you start from the last unchanged point. This is lightning fast.

You can generate an FSM from every regular expression. There are enough libraries that can do this. They usually do not support actions, but tools like flex can do quite complex things.

thomas_galvin

This is one of the “wow, this is so simple” problems that turns out to be utterly horrendous when you get around to actually writing the code for it. It seems like pretty much every would-be programmer tries to write their own text editor at some point, and roll their own syntax highlighting while they’re at it; I know I did.

I took several passes at this algorithm:

<o:p> </o:p>First, I did it brute-force; I listened for changes to the document, and colored the entire thing each time, using a series of documentText.indexOf(keywords[i]) type calls. It worked, kind of, but when you got over a few lines, it was slower than a really slow thing.

<o:p> </o:p>So I figured I’d speed things up and use Apache’s RegEx package. This did give me a noticeable speed bump, but it was still unacceptably slow.

<o:p> </o:p>This led me to coloring only the line that was changed. This works, in most cases, and is as fast as I’d ever need it to be. The problem comes from nested syntax modes; a change on a line of JavaDoc code might turn it into Java code, and that might mean one, two, or N lines would need to be recolored. You don’t want to recolor the entire document, but you want to make sure you color everything that has changed.

<o:p> </o:p>What I ended doing was storing a token on each line, telling me what kind of code it was; JavaDoc, Java, String, etc. When a line was changed, it was colored immediately. I would then color each subsequent line, until I encountered a line that did not change type; e.g., it started as Java code, and ended as Java code.

<o:p> </o:p>Just to clarify; when I say a line’s type, I am referring to the type of the last character on the line. For example, the line

/* this is my uber-cool int */ int reallyBigIntJustInCase = 0;

would be considered Java code, while the line

int reallyBigIntJustInCase = 0; /** this is my uber-cool int */

would be considered JavaDoc code.

Numerics can be treated as a special type of keyword; just add another RegEx to the list. Strings can be treated as a syntax mode unto themselves… at which point you get to add escape characters to your algorithm.

This all sounds fairly straight foreword, but the actual implementation spanned several classes, and a few hundred lines of code.

Of course, having proved to myself that I could implement a decent editor-highlighter, I went back to using Jedit.

raid

hi, XoK

I ahve just brought a 3G option card and i am using suse 10, Is great to hear that you get it work. But my ISp
also tols me that it will not work with linux. Can you share with me how you make this work?

Thank You

raid

jwenting

@noe said:

> Anyone got any useful suggestions?

I'm not sure you will consider this as a usefull suggestion, but who knows :
do not write a syntax highlighter. Use an efficient editor, emacs for example.

Someone does have to write those though.
If noone ever wrote something because something existed that does it already there would hardly be any progress in software development.
Why create a new programming language, Bjarne Stroustrup and Bill Gossling for example would have thought, there's C already...
Linus would never have bothered with Linux because there's SunOS, etc. etc.

hoens

The most efficient way I know to go about
syntax highlighting is using a finite state machine with output
(deterministic or non-deterministic, and Mealey or Moore types, if you
want to search for really nasty details).

just out of curiosity... how is a FSM any different than a RE in terms of what you're talking about?

masklinn

@hoens said:

The most efficient way I know to go about
syntax highlighting is using a finite state machine with output
(deterministic or non-deterministic, and Mealey or Moore types, if you
want to search for really nasty details).

just out of curiosity... how is a FSM any different than a RE in terms of what you're talking about?

A RE is merely a pattern matching, it has no memory or state, a state machine (coupled with a tokenizer before it to simplify the work) will consider the current input based on the previous linear one (e.g. the current state of the machine), and will be able to do syntax analysis and validation to boot.

I'd recommand reading Text Processing in Python for more informations about that kind of stuff, it has lightweight introduction to REs and state machines, their use cases and the advantages/disadvantages of both.

For syntax analysis and syntactic coloration, the point of REs is that they can only look for fairly simple structures, or match predefined keywords, state machines are much more flexible and powerful.

cwolves

Assuming you're outputting the code to HTML (can be changed easily enough for anything else):

replace

/(\W)(if|then|for|while|more|words|go|here)(\W)/gi

with

$1$2$3

and repeat for every color. For quotes: /("[^"]+")/gi (same for 's)

if you want it language specific, first match:

(<script( language="javascript">)?>.*?</script>), same for vbscript. Anything
not matched gets the "Default", HTML syntax.

cwolves

Also, check out starlight: http://dean.edwards.name/my/behaviors/#star-light.htc

it's a CSS-based highlighter for just about any language.