.NET string.ToLower() == TooSlower() ?



  • Or: How I reduced my code execution time from 500ms to 0ms by not using the string.ToLower() method

     

    OK to put this in context, this was related to a method I wrote in C# that highlighted text elements in a block of text, for ease of searching. Specifically I used it in my site (www.dominicpettifer.co.uk), notice when you search for say 'Ajax' (minus quotes) it will highlight the word ajax in the text (that site is still using an older unoptimisd version of the highlighting method btw). This function was probably the most difficult peice of code I've ever written in my programming career (it's trickier than you think, don't even think of using replace() method!!). But when I finally got it sorted and bug free, I noticed the page execution time on my page jumped to something like 530ms when searching for   a e i o u n m p s t l y , . w h d r v c  I know searching for 20 individual letters seems silly, but it's only 3 blogs...I had to stress test it somehow. Commenting out the method reduced the execution time to 16-32ms so it was definitely the highlight text function slowing things down. Here's the code (unoptimised version) in full...

    public static string HighlightText(string textBlock, List<String> itemsToHighlight, string startTag, string endTag)
    {
        List<String> orderedItems = new List<String>();

        int itemsCount = itemsToHighlight.Count;

        for (int i = 0; i < itemsCount; i++)
        {
            string temp = "";
            int index = 0;
            int indexOfLargest;

            foreach (string item in itemsToHighlight)
            {
                if (item.Length > temp.Length)
                {
                    temp = item;
                    indexOfLargest = index;
                }

                index++;
            }

            itemsToHighlight.Remove(temp);

            orderedItems.Add(temp);
        }

        int overallPos = 0;

        while(overallPos < textBlock.Length)
        {
            int pos = textBlock.Length;
            string leftMostItem = null;
            foreach (string item in orderedItems)
            {
                int newPos = textBlock.ToLower().IndexOf(item, overallPos);

                if (newPos < pos && newPos != -1)
                {
                    pos = newPos;
                    leftMostItem = item;
                }
            }

            if (leftMostItem != null)
            {
                textBlock = textBlock.Insert(pos, startTag);
                textBlock = textBlock.Insert(pos + leftMostItem.Length + startTag.Length, endTag);
                overallPos = pos + (startTag.Length + endTag.Length) + leftMostItem.Length;
            }
            else
            {
                overallPos = textBlock.Length;
            }
        }

        return textBlock;
    }

    To cut a long story short, notice the highlighted line of code above. eg....

    int newPos = textBlock.ToLower().IndexOf(item, overallPos);

    ...after much debugging, performance testing (using Environment.TickCount to test code segment execution times etc), and general soul searching. I eventually discovered that if I replaced that line of code with the following...

    int newPos = textBlock.IndexOf(item, overallPos, StringComparison.OrdinalIgnoreCase);

     ...basically removing the ToLower() method and using StringComparison.OrdinalIgnoreCase enun param for the IndexOf() method instead, the page execution time dropped right down to 32ms, the time it was getting with the HighLight method commented out (32ms I put down to database access), so effectively the method was executing in 0ms :-) Who would have thought using the ToLower() method would effect performance soo much.

     

    So there you have it. If anyone wants a highly efficient method for highlight portions of text in a block of text, by wraping something around it (span tags say), it's there of you want it :) You call it like this...

    text = FormattingUtils.HighlightText(text, searchItems, "<span style=\'background-color: lightgray\'>', '</span>');

    searchItems argument is a Generic string ArrayList of items you want highlighed, you need to get them into a List<String> somehow on your own I'm afraid. Or you could probably modify the function to use a general purpose array if you're stuck with .NET 1.1

     

    Anyway, I've been working on this on and off for about a 1.5 weeks, so I'd though I'd write about it and share my experiences with other people to help them out. Now all I need is a WTF moment from someone telling me my code is fundamentally flawed somehow...or that someone else has already written this method and I've wasted 1.5 weeks :)
     



  • Ever hear of regex?

    I don't know the .NET way, but in Java:

    public String highlight(Collection<String> searchItems, String text) {
        StringBuilder builder = new StringBuilder();
        for (String item: searchItems) {
            if (builder.length() != 0) {
                    builder.append('|');
            }
            builder.append(item); // You might want to escape item.
        }
        return text.replaceAll('(' + builder.toString() + ')', "<span style='background-color: lightgray'>$1</span>');
    }
    


  • [quote user="danielpitts"]

    Ever hear of regex?

    I don't know the .NET way, but in Java:

    public String highlight(Collection<String> searchItems, String text) {
    StringBuilder builder = new StringBuilder();
    for (String item: searchItems) {
    if (builder.length() != 0) {
    builder.append('|');
    }
    builder.append(item); // You might want to escape item.
    }
    return text.replaceAll('(' + builder.toString() + ')', "<span style='background-color: lightgray'>$1</span>');
    }

    [/quote]

     The problem with the replaceAll() method is that it's case sensitive (same with .NET's Replace() method), so if the user searches 'ajax; it wouldn't highlight 'Ajax' or 'aJAx' etc.



  • Every time you called .tolower() you were creating a new string object, then comparing against that object, then throwing it away.  That'll tend to slow things down. 

    -cw



  • [quote user="Sunday Ironfoot"][quote user="danielpitts"]

    Ever hear of regex?

    I don't know the .NET way, but in Java:

    public String highlight(Collection<String> searchItems, String text) {
    StringBuilder builder = new StringBuilder();
    for (String item: searchItems) {
    if (builder.length() != 0) {
    builder.append('|');
    }
    builder.append(item); // You might want to escape item.
    }
    return text.replaceAll('(' + builder.toString() + ')', "<span style='background-color: lightgray'>$1</span>');
    }

    [/quote]

     The problem with the replaceAll() method is that it's case sensitive (same with .NET's Replace() method), so if the user searches 'ajax; it wouldn't highlight 'Ajax' or 'aJAx' etc.

    [/quote]

    Try:

    <font size="2">

    <font size="2">Regex rx </font><font color="#0000ff" size="2">=</font><font size="2"> </font><font color="#0000ff" size="2">New</font><font size="2"> Regex(item, RegexOptions.IgnoreCase);
    str = rx.Replace(text, "<span style='background-color: lightgray'>$&</span>');

     

    </font>

    </font><font size="2">Regex rx </font><font color="#0000ff" size="2">=</font><font size="2"> </font><font color="#0000ff" size="2">New</font><font size="2"> Regex(item, RegexOptions.IgnoreCase);
    str = rx.Replace(text, "<span style='background-color: lightgray'>$&</span>');

     

    </font>



  • [quote user="jsmith"]<font size="2">Regex rx </font><font color="#0000ff" size="2">=</font><font size="2"> </font><font color="#0000ff" size="2">New</font><font size="2"> Regex(item, RegexOptions.IgnoreCase);
    str = rx.Replace(text, "<span style='background-color: lightgray'>$&</span>');</font>[/quote]

     That also creates another problem in that the text block that gets returned has it's upper case letters converted to lowercase on the words that are searched. For instance the sentence is "Ajax performance techniques" the user searches 'ajax' and ' Performance' turns the sentence into "ajax Performance techniques".

     Also I assume Replace() method replaces only a single item at a time so you'd have to iterate through each item in the array of strings to search for, so if you searched for the word 'style' you'd end up with <span> tags wrapped around your style attribute in the already existing span tags (same goes with background, color, light, gray etc etc.).
     



  • [quote user="Sunday Ironfoot"]

    [quote user="jsmith"]<font size="2">Regex rx </font><font color="#0000ff" size="2">=</font><font size="2"> </font><font color="#0000ff" size="2">New</font><font size="2"> Regex(item, RegexOptions.IgnoreCase);
    str = rx.Replace(text, "<span style='background-color: lightgray'>$&</span>');</font>[/quote]

     That also creates another problem in that the text block that gets returned has it's upper case letters converted to lowercase on the words that are searched. For instance the sentence is "Ajax performance techniques" the user searches 'ajax' and ' Performance' turns the sentence into "ajax Performance techniques".

     Also I assume Replace() method replaces only a single item at a time so you'd have to iterate through each item in the array of strings to search for, so if you searched for the word 'style' you'd end up with <span> tags wrapped around your style attribute in the already existing span tags (same goes with background, color, light, gray etc etc.).
     

    [/quote]

    I tried it and it does preserve the case of the values being replaced.  It also does replace all occurrences.  I will admit that if it ends up replacing an attribute of an existing tag, it will create an invalid document and have unexpected results.  I'm sure someone here with a lot of experience with regular expressions could help eliminate that problem.

    I also timed it and for a small string, it took far less than 1ms to to two replacements.



  • [quote user="jsmith"]

    I tried it and it does preserve the case of the values being replaced.  It also does replace all occurrences.  I will admit that if it ends up replacing an attribute of an existing tag, it will create an invalid document and have unexpected results.  I'm sure someone here with a lot of experience with regular expressions could help eliminate that problem.

    I also timed it and for a small string, it took far less than 1ms to to two replacements.

    [/quote]

     

    You're right it does preserve case, sorry, my bad! Anyhow if you try to search any string that is a substring of the wrapping tags you're using to highlight the text, it all goes a bit wonky, for instance try searching for a e i o u as individual items, not as one word. 



  • [quote user="Sunday Ironfoot"][quote user="jsmith"]

    I tried it and it does preserve the case of the values being replaced.  It also does replace all occurrences.  I will admit that if it ends up replacing an attribute of an existing tag, it will create an invalid document and have unexpected results.  I'm sure someone here with a lot of experience with regular expressions could help eliminate that problem.

    I also timed it and for a small string, it took far less than 1ms to to two replacements.

    [/quote]

     

    You're right it does preserve case, sorry, my bad! Anyhow if you try to search any string that is a substring of the wrapping tags you're using to highlight the text, it all goes a bit wonky, for instance try searching for a e i o u as individual items, not as one word. 

    [/quote]

    Actually, if you keep an original document, and only search/replace on the original, but display the replaced text, it will work.  You would search for the "item" "a|e|i|o|u", then it will highlight a,e,i,o, and u. 



  • @danielpitts said:

    [quote user="Sunday Ironfoot"][quote user="jsmith"]

    I tried it and it does preserve the case of the values being replaced.  It also does replace all occurrences.  I will admit that if it ends up replacing an attribute of an existing tag, it will create an invalid document and have unexpected results.  I'm sure someone here with a lot of experience with regular expressions could help eliminate that problem.

    I also timed it and for a small string, it took far less than 1ms to to two replacements.

     

    You're right it does preserve case, sorry, my bad! Anyhow if you try to search any string that is a substring of the wrapping tags you're using to highlight the text, it all goes a bit wonky, for instance try searching for a e i o u as individual items, not as one word. 

    [/quote]

    Actually, if you keep an original document, and only search/replace on the original, but display the replaced text, it will work.  You would search for the "item" "a|e|i|o|u", then it will highlight a,e,i,o, and u. 

    [/quote]

    Actually you'd usually search for [aeiou].



  • [quote user="danielpitts"]

    Actually, if you keep an original document, and only search/replace on the original, but display the replaced text, it will work.  You would search for the "item" "a|e|i|o|u", then it will highlight a,e,i,o, and u. 

    [/quote]

     Cool, thats works upto a point. However, if I search for s smaller string thats a substring of another search term later on, eg. I search 'data' and 'database' (data is a substring of database), and it's in that order. It will highlight data but not database. I know I'm being picky here, but logically it should highlight the whole word database. I could maybe reorder the words in the array so that larger words are at the top, for instance if you search database followed by data they are both highlighted fine. Also what if the user types in characters used by regular expressions | \ + ? * etc. Again I could probably strip those out (Is # a regex token? What if the user searches 'C#.NET' ?).

    Thanks for the help here people. I was thinking maybe my method was overly long and convoluted and that there must be a simpler method.

     



  • [quote user="Sunday Ironfoot"][quote user="danielpitts"]

    Actually, if you keep an original document, and only search/replace on the original, but display the replaced text, it will work.  You would search for the "item" "a|e|i|o|u", then it will highlight a,e,i,o, and u. 

    [/quote]

     Cool, thats works upto a point. However, if I search for s smaller string thats a substring of another search term later on, eg. I search 'data' and 'database' (data is a substring of database), and it's in that order. It will highlight data but not database. I know I'm being picky here, but logically it should highlight the whole word database. I could maybe reorder the words in the array so that larger words are at the top, for instance if you search database followed by data they are both highlighted fine. Also what if the user types in characters used by regular expressions | \ + ? * etc. Again I could probably strip those out (Is # a regex token? What if the user searches 'C#.NET' ?).

    Thanks for the help here people. I was thinking maybe my method was overly long and convoluted and that there must be a simpler method.

     

    [/quote]

    Search for "\s(\witem\w)" and use $1 instead of $& in the replace method.

    As for tokens, there are 11 of them, [^$.|?*+() all can be fixed by inserting a \ before them.  It would be kinda neat to inform the user that they can search by regular expression.  That would add immense power to your search engine without any extra code on your part.



  • How about calling textBlock.toLower() [i]outside[/i] both loops? Because you're basically doing [code](whiles * foreaches - 1)[/code] redundant toLower()s, seeing as how textBlock, judging form this snippet, never changes for as long as the loops loop.



  • [quote user="Sunday Ironfoot"][quote user="danielpitts"]

    Actually, if you keep an original document, and only search/replace on the original, but display the replaced text, it will work.  You would search for the "item" "a|e|i|o|u", then it will highlight a,e,i,o, and u. 

    [/quote]

     Cool, thats works upto a point. However, if I search for s smaller string thats a substring of another search term later on, eg. I search 'data' and 'database' (data is a substring of database), and it's in that order. It will highlight data but not database. I know I'm being picky here, but logically it should highlight the whole word database. I could maybe reorder the words in the array so that larger words are at the top, for instance if you search database followed by data they are both highlighted fine. Also what if the user types in characters used by regular expressions | \ + ? * etc. Again I could probably strip those out (Is # a regex token? What if the user searches 'C#.NET' ?).

    Thanks for the help here people. I was thinking maybe my method was overly long and convoluted and that there must be a simpler method.

     

    [/quote]

    It should work to sort your "items" by length. I know that POSIX regex goes for longest match, but I guess .NET's goes for first match. Maybe there is a flag you can set? If not, sorting by length descending should do it.

    As someone pointed out, you can easily escape the strings for regex.  It also gives you the flexibility to allow users to USE regex to search by. Think "data(base|bank)?" would match database, data, databank.

    Although, depending on the "typical user", you may want to always escape, or give them a checkbox to allow them to use regex.

     



  • [quote user="masklinn"]

    Actually you'd usually search for [aeiou].

    [/quote]

    I think you might have missed the point masklinn. The user enters a list of strings (not known before hand). They may or may not overlap eachother, as well as exist in the resulting markup. 



  • Actually you should use string.ToUpper(), for .NET's string searching functions are optimized for searching upper case strings.

    Also, you could try to use a StringBuilder instead of a string, but that would require you to rewrite the entire method. The StringBuilder was made for cases like this where you do several changes to a string. Each time you modify the string you create a new string and this will eat memory. So the lines

    textBlock = textBlock.Insert(blabla);

    textBlock = textBlock.Insert(blabla);

    will create three new strings.

    If you change the code to use a string builder and append strings to it the code should run faster and take up much less memory. 

     

     


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.