Help with Regex (Matching words longer than 20 charcaters)



  • Hi, could someone help me with a regular expression? I'm wanting to match all words, in a block of text, that are longer than 20 characters, and can contain any character/number except a space or an HTML angled bracket.

     

    For instance http://www.firingsquad.com/media/hirez.asp?file=/games/battlefield_2142_preview/images/04.jpg would match

    ...but <a href="http://www.firingsquad.com/media/hirez.asp?file=/games/battlefield_2142_preview/images/04.jpg would match">a link</a> wouldn't because 'a link' is only 6 characters.

     I guess I would have to match the text between an closing and opening bracket. So far I have..

    Regex longWord = new Regex(@">.{20,}<");

     
    ...but I can't work out how to exclude spaces and angled brackets. Cheers!
     



  • /[a-zA-Z0-9]{20,}/



  • A string of consecutive alphanumeric chars, but nothing else?

    That's /[a-zA-Z0-9]{20,}/ indeed.



  • A string of 20 alphanumeric (etc) characters outside a tag, i.e. anything within angle brackets is disqualified.



  • Oh dur.

    Then I suppose it's best to strip HTML before testing for {20,} length.
     



  • [quote user="growse"]/[a-zA-Z0-9]{20,}/[/quote]

     

    I also want to include underscores, slashes, full stops, hyphens etc etc...basically anything from the ASCII set that's not either a space or an angled bracket. Also is it possible to make sure the increadibly long word isn't part of an attribute inside HTML elements so that for instance <img alt="reallyReallyReallyReallyReallyReallyLongDescriptionForThisImage" src="pic.gif" /> is ignored?

    I am stripping out (or parsing) HTML angled brackets entered by the user, however it may still contain HTML due to things like looking for line feeds or carriage returns and replacing them with <br /> tags, also the user can enter arbitrary tags such as [ b ]bold[/ b ] or [ url ="somelink.htm"]click here[/ url ] etc. (spaces added to stop them being parsed) which are converted to bold and click here



  • I think you're asking too much from regex.

    I would search for "[\w\d]{20,}" inbetween tags.
     



  • Just so you know what I'm trying to to, I'm trying to deal with increadibly long words that can break my HTML layout (by forcing widths of Ps and DIVs to the width of the long word). Of the research I've done, apparantly there's nothing in HTML/CSS that can handle this (appart from some propriatary IE only properties). and I've been told that this needs to be handled server side, so that's what I'm trying to do, by looking for longs words via Regex.



  • An amiable goal, the only problem is that a string length doesn't corrispond directly to a string width.

    "Four" can be smaller than "lili" (unless you are using fixed width font)

    I think your problem is more of a layout problem than regex.  I would adjust my css before trying to programatically inspect the text.
     



  • [quote user="danielpitts"]

    An amiable goal, the only problem is that a string length doesn't corrispond directly to a string width.

    "Four" can be smaller than "lili" (unless you are using fixed width font)

    I think your problem is more of a layout problem than regex.  I would adjust my css before trying to programatically inspect the text.
     

    [/quote]

    I understand that, so I'd choose a word length which is well within the width of the containment element to play on the safe side. I can use overflow:auto as an alternative to overcome the problem of long words but it does force an ugly scrollbar on the container element. Otherwise there is no otherway if you're using a fixed width page design.

    IMHO a lack of a CSS property that forcibly breaks long words onto the next line with say a hyphen (like in a newspaper column) is a major oversite of the CSS designers and W3C. IE has a property (word-wrap: break-word I think) but obviously it's IE only. Sorry rant over!

    I think I'd still like to tacle this, if only for the experience, but perhaps without Regex, maybe brute forcibly search through the entire string letter by letter.
     



  • What you are attempting is to validate the string, not search it. So, think what you want to allow in.

    http://www.codeproject.com/csharp/RegexTester.asp

    nice little article above with a nifty tool to help build your expression. Start out simple and then keep adding on what is allowed.

    this allows ' at the beginning of words, 'deLong, characters or numbers required by the +, then my grouping of allowables:

    '?' means optional. the first one is allow spaces, the pipe | is an or operator, '  - % $ is all that is allowed for the user to put in.  

    ^((')?[A-Za-z0-9]+( *|'|-|%|$)?)+$

     hope this helps.



  • Hope you don't mind if I conduct a small test, just to see what happens...

    ReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyLooooooooooooongWord

    Sorry if it screws anybodies layout up. Just want to see how other CMS's cope.



  • [quote user="Sunday Ironfoot"]

    IMHO a lack of a CSS property that forcibly breaks long words onto the next line with say a hyphen (like in a newspaper column) is a major oversite of the CSS designers and W3C. IE has a property (word-wrap: break-word I think) but obviously it's IE only. Sorry rant over![/quote]

    To counter your rant.  Do you really expect the W3C to spend the effort in CSS to define multi-lingual word-breaking rules?  The rules would be prohibitivly complex.  The concept of word-wrap is nice, but how do you word-wrap in Arabic script?  Traditional Chinese?  Russian?

     



  • [quote user="danielpitts"][quote user="Sunday Ironfoot"]

    IMHO a lack of a CSS property that forcibly breaks long words onto the next line with say a hyphen (like in a newspaper column) is a major oversite of the CSS designers and W3C. IE has a property (word-wrap: break-word I think) but obviously it's IE only. Sorry rant over![/quote]

    To counter your rant.  Do you really expect the W3C to spend the effort in CSS to define multi-lingual word-breaking rules?  The rules would be prohibitivly complex.  The concept of word-wrap is nice, but how do you word-wrap in Arabic script?  Traditional Chinese?  Russian?

    [/quote]

    Well, the CSS 3 draft does define properties for line-breaking and hyphenation. It doesn't try to specifiy how it should be done, it just states that the normal rules for the script in question should be used.

    Of course, this is CSS 3 we're talking about.  Whether any of us will live to see the spec finalised, let alone implemented, is a different matter. ;-)



  • [quote user="Sunday Ironfoot"]Just so you know what I'm trying to to, I'm trying to deal with increadibly long words that can break my HTML layout (by forcing widths of Ps and DIVs to the width of the long word). Of the research I've done, apparantly there's nothing in HTML/CSS that can handle this (appart from some propriatary IE only properties). and I've been told that this needs to be handled server side, so that's what I'm trying to do, by looking for longs words via Regex.
    [/quote]

    overflow: scroll
     



  • [quote user="Sunday Ironfoot"]Hi, could someone help me with a regular expression? [/quote]

    Sorry to revive a dead post, but I just now decided to read something besides the Side Bar posts and saw this post.  I thought I would post this anyway, in case someone has a similar problem to the OP.  The following would have worked:

    <font size="2"><font color="#0000ff" size="2">

    <font color="#000000">new Regex( @"(?<!<[^>]*)\s*([^\s><]{20,})" );</font>

    <font color="#000000">Hopefully, this helps someone else later on down the road.</font>

    <font color="#000000"></font><font size="2"></font>

    </font><font size="2"></font></font>

Log in to reply