Those who do not study regular expressions are doomed



  • to write code like this:

    if ((!strcmp($category[0], "a")) || (!strcmp($category[0], "A")) ||
    (!strcmp($category[0], "e")) || (!strcmp($category[0], "E")) ||
    (!strcmp($category[0], "i")) || (!strcmp($category[0], "I")) ||
    (!strcmp($category[0], "o")) || (!strcmp($category[0], "O")) ||
    (!strcmp($category[0], "u")) || (!strcmp($category[0], "U")))
    { echo " an "; }
    else
    { echo " a "; };



  • Why even use regexps?

    function isVowel($c)
    {
        return stripos('aeiouy', $c) !== false;
    }

    echo isVowel($category{0}) ? ' an ' : ' a ';



  • I work with a guy who claims to have been developing for years, but I got a look at his code the other day. He was flat-out lying. The man loves using Select Case statements, he uses If only when absolutely necessary. Regular expressions? Yeah right. To translate his programming style (which is in VB btw) to this particular problem, here would be his solution:

    Dim strChar As String
    Dim strReturn As String

    strChar = category(0)

    Select Case strChar

       Case "a"
          strReturn = " an "

       Case "e"
          strReturn = " an "

       Case "i"
          strReturn = " an "

       Case "o"
          strReturn = " an "

       Case "u"
          strReturn = " an "

       Case "A"
          strReturn = " an "

       Case "E"
          strReturn = " an "

       Case "I"
          strReturn = " an "

       Case "O"
          strReturn = " an "

       Case "U"
          strReturn = " an "

       Case Else
          strReturn = " a "

    End Select

     



  • @Manni said:

    I work with a guy who claims to have been
    developing for years, but I got a look at his code the other day. He
    was flat-out lying....

    <font size="5">H</font>e might not be lying.  He sounds like one of those guys who hasn't learned anything new since he started developing.<font size="6">



    </font>



  • So what's the RegExp solution? Because I think I would rather maintain something like this than try to figure out a RegExp expression that somebody probably copied off the web...

    (All psuedocode, of course)

    <FONT face="Courier New">switch ($category[0].toUpper())
    {</FONT>

    <FONT face="Courier New"> case "A", "E", "I", "O", "U":
      echo " an ";
      break;</FONT>

    <FONT face="Courier New"> default:
      echo "a";
      break;</FONT>

    <FONT face="Courier New">}</FONT>

    RegExp here is probably like using a chainsaw to sharpen a pencil.

    Though, I don't think this solution would work for all words. I can't think of any, but I'm sure there are many words that don't start with a vowel that would be preceded by "an". Like "historic" if you happen to be in the UK.

     



  • ... and by convention, there are words that start with a vowel that take "a" rather than "an" (think "eu" words like European, eucalyptus).



  • @A Wizard A True Star said:

    So what's the RegExp solution? Because I think I would rather maintain something like this than try to figure out a RegExp expression that somebody probably copied off the web...


    Something similar to md2perpe's stricmp solution, but using

    eregi("[aeiou].*", $category)

    He's right that using a regexp is excessive.



  • I think it's more if a word starts with a vowel sound then you use
    "an".  So words like "hour" where the 'h' is silent are preceded by
    "an", and words like "yellow" are preceded by "a".

    I
    think it's to prevent the speaker from having to speak two vowel sounds
    in a row, but seeing as English is weird there are undoubtedly
    exceptions.



  • @TealVeal said:

    I think it's more if a word starts with a vowel sound then you use "an".  So words like "hour" where the 'h' is silent are preceded by "an", and words like "yellow" are preceded by "a".

    I think it's to prevent the speaker from having to speak two vowel sounds in a row, but seeing as English is weird there are undoubtedly exceptions.

    As far as I know there aren't any words starting with Y in my Oxford english dictionary that must be preceeded with "an". It wasn't in the OP, either.

    You have a point regarding "hour", though.



  • @welcor said:

    @TealVeal said:

    I think it's more if a word starts with a vowel sound then you use "an".  So words like "hour" where the 'h' is silent are preceded by "an", and words like "yellow" are preceded by "a".

    I think it's to prevent the speaker from having to speak two vowel sounds in a row, but seeing as English is weird there are undoubtedly exceptions.

    As far as I know there aren't any words starting with Y in my Oxford english dictionary that must be preceeded with "an". It wasn't in the OP, either.

    You have a point regarding "hour", though.


    One might occasionally wish to speak of a single yttrium-oxide component, might one not? The indefinite article doesn't always immediately precede a "countable" noun.


  • Regex stuff aside, professional applications can't ever do this.



    There are some things that do not have a code solution, and the rules
    of English is one of those things.  The application is going to
    look like it was written by incompetent retards when phrases like this
    show up:



    a honest day's work

    an useful idea

    an euphoric feeling



    Just do the damn work of writing out your sentences and stop trying to be so "clever" by composing them in code.



    Obviously the same goes for forming plurals.  Don't do it in
    code.  Just write the separate strings.  (Java's ChoiceFormat
    class makes this easier.  I'm not sure what other languages offer.)



  • @VGR said:

    The application is going to
    look like it was written by incompetent retards when phrases like this
    show up:



    a honest day's work

    an useful idea

    an euphoric feeling


    In this case, the category was constrained to a specific set of options, all of which did make sense when the article was selected this way. Trying to handle arbitrary words wouldn't be a good idea.



  • @VGR said:

    Regex stuff aside, professional applications can't ever do this.



    There are some things that do not have a code solution, and the rules
    of English is one of those things.


    I think it can be done. Just use a (big ?) dictionary for all known exceptions, and default rules for the rest.



  • But there are too many exceptions.  And it doesn't help that you need to consider context - you can't just look at the next word.

    For example, write a trivial algorithm that will take sentences like the following and replace the asterisks with the appropriate article:

    UNION PACIFIC CUSTOMER LOGIN: REGISTER AS * UP CUSTOMER HERE
    IS THE ECONOMY ON * UP OR A DOWN?




  • @Iago said:

    But there are too many exceptions.  And it doesn't help that you need to consider context - you can't just look at the next word.

    For example, write a trivial algorithm that will take sentences like the following and replace the asterisks with the appropriate article:

    UNION PACIFIC CUSTOMER LOGIN: REGISTER AS * UP CUSTOMER HERE
    IS THE ECONOMY ON * UP OR A DOWN?




    Since I'm not a native speaker: what's the correct answer? In my naivety, it would replace the asterisk in both cases with "AN", but if it was that easy, you would not have asked?



  • @A Wizard A True Star said:

    So what's the RegExp solution?



    my $word = ($category[0] =~ /[aeiou]/i) ? 'an' : 'a';
    print " $word ";

    RegExp here is probably like using a chainsaw to sharpen a pencil.

    Though, I don't think this solution would work for all words. I can't think of any, but I'm sure there are many words that don't start with a vowel that would be preceded by "an". Like "historic" if you happen to be in the UK.
     


    Regular expressions are great for simple pattern matching, so I don't agree with the "chainsaw" comment.

    As
    you and others have pointed out, this algorithm produces broken English for a lot
    of words, e.g. "euphymism". The word "historic" is a bad example though. The "h" is not silent, and so it should be "a historic" in the UK and elsewhere. A lot of people make this mistake, including some "grammar nazis".



  • @amigan said:

    The word "historic" is a bad example though. The "h" is not silent, and so it should be "a historic" in the UK and elsewhere. A lot of people make this mistake, including some "grammar nazis".


    Can't say I agree with you here.  In the UK, I believe it is better to use "an historic" than "a historic".

    Perhaps this makes me a so-called "grammer nazi".



  • @dharbige said:

    @amigan said:
    The word "historic" is a bad example though. The "h" is not silent, and so it should be "a historic" in the UK and elsewhere. A lot of people make this mistake, including some "grammar nazis".


    Can't say I agree with you here.  In the UK, I believe it is better to use "an historic" than "a historic".

    Perhaps this makes me a so-called "grammer nazi".

    well 2 things - it's "grammar", without an E, and
    technically, it is better to use "a historic", but "an" has become acceptable for whatever reason.
    http://www.wsu.edu/~brians/errors/anhistoric.html



  • @dharbige said:

    @amigan said:
    The word "historic" is a bad example though. The "h" is not silent, and so it should be "a historic" in the UK and elsewhere. A lot of people make this mistake, including some "grammar nazis".


    Can't say I agree with you here.  In the UK, I believe it is better to use "an historic" than "a historic".

    Perhaps this makes me a so-called "grammer nazi".


    If you would normally drop the 'h', then you would use 'an'. That's more of a dialect thing though. For example, a cockney may say "The 'istorical artifact", in which case it would be consitent for them to say "An 'istorical artifact".

    The following page seems to imply that I'm advocating the "traditional rule", but also implies that either is ok:
    By "grammar nazi" I mean people who flame about grammar, so I wouldn't say you were one.



  • @ammoQ said:

    @VGR said:
    Regex stuff aside, professional applications can't ever do this.



    There are some things that do not have a code solution, and the rules
    of English is one of those things.


    I think it can be done. Just use a (big ?) dictionary for all known exceptions, and default rules for the rest.


    Do you have any idea how large that dictionary might be? And if it doesn't exist yet, how many hours would you have to spend putting it together? I have a dictionary file that I use for various purposes, it has 110,000 words. I'm certainly not signing up for that job.



  • how about "hour"



  • @ammoQ said:

    @Iago said:
    But there are too many exceptions.  And it doesn't help that you need to consider context - you can't just look at the next word.

    For example, write a trivial algorithm that will take sentences like the following and replace the asterisks with the appropriate article:

    UNION PACIFIC CUSTOMER LOGIN: REGISTER AS * UP CUSTOMER HERE
    IS THE ECONOMY ON * UP OR A DOWN?




    Since I'm not a native speaker: what's the correct answer? In my naivety, it would replace the asterisk in both cases with "AN", but if it was that easy, you would not have asked?

    The first instance is an abbreviation of the Union Pacific name, and is normally pronounced "you pee" (knock it off with the third-grade chuckles, you guys). The second is the word "up". So the first would take "a" and the second, "an".



  • @Manni said:

    @ammoQ said:
    @VGR said:
    Regex stuff aside, professional applications can't ever do this.



    There are some things that do not have a code solution, and the rules
    of English is one of those things.


    I think it can be done. Just use a (big ?) dictionary for all known exceptions, and default rules for the rest.


    Do
    you have any idea how large that dictionary might be? And if it doesn't
    exist yet, how many hours would you have to spend putting it together?
    I have a dictionary file that I use for various purposes, it has
    110,000 words. I'm certainly not signing up for that job.




    Of those 110,000 words, how many are exceptions? Why should e.g. a
    company like Microsoft that needs such a dictionary e.g. for their
    grammar checker in MS Office not be able to do it?



  • @Stan Rogers said:


    The first instance is an abbreviation of
    the Union Pacific name, and is normally pronounced "you pee" (knock it
    off with the third-grade chuckles, you guys). The second is the word
    "up". So the first would take "a" and the second, "an".




    Ah, I see. But IMO a similar problem (how to pronounce UP) must be
    solved for text-to-speech sofware, so once that is solved, the "a or
    an" questions is solved, too.



  • @ammoQ said:

    @Manni said:
    @ammoQ said:
    @VGR said:
    Regex stuff aside, professional applications can't ever do this.



    There are some things that do not have a code solution, and the rules
    of English is one of those things.


    I think it can be done. Just use a (big ?) dictionary for all known exceptions, and default rules for the rest.


    Do
    you have any idea how large that dictionary might be? And if it doesn't
    exist yet, how many hours would you have to spend putting it together?
    I have a dictionary file that I use for various purposes, it has
    110,000 words. I'm certainly not signing up for that job.




    Of those 110,000 words, how many are exceptions? Why should e.g. a
    company like Microsoft that needs such a dictionary e.g. for their
    grammar checker in MS Office not be able to do it?


    Perhaps I wasn't explicit enough in my response to get my meaning across. OK let's say there are 12 exceptions total in the list. Does that mean you don't have to look through all the words? Sure you can eliminate chunks of the list, but the resulting number of exceptions has no bearing on how long the search will take.

    You're starting to sound like management material, e.g. your multiple uses of the term "e.g." (I'm using it for comedic effect, btw), the suggestion to just whip up a dictionary list of exceptions to the "a"/"an" article rules, and implying that Microsoft's grammar checker is worth a damn.



  • @Manni said:



    You're starting to sound like management material, e.g. your multiple uses of the term "e.g." (I'm using it for comedic effect, btw), the suggestion to just whip up a dictionary list of exceptions to the "a"/"an" article rules, and implying that Microsoft's grammar checker is worth a damn.


    I believe you misunderstand me, and I probably misunderstand you. Never mind.



  • if we're talking about web pages here, you should probably use "<abbr>" tag to denote that one of the "UP"s is an abbreviation, and extending that for screen readers you should use the CSS:
    abbr { speak: spell-out }

    <font size="4">and then everyone's happy.

    Although, i personally wouldn't fret too much about a/an for some generated text. Users are relatively understanding. If it were for something important though, like a letter to a client or certificate you probably wouldn't be using automated language generation anyway.
    </font>



  • @ammoQ said:

    @Manni said:


    You're starting to sound like management material, e.g. your multiple uses of the term "e.g." (I'm using it for comedic effect, btw), the suggestion to just whip up a dictionary list of exceptions to the "a"/"an" article rules, and implying that Microsoft's grammar checker is worth a damn.


    I believe you misunderstand me, and I probably misunderstand you. Never mind.


    Re-reading my posts from yesterday, I may have been overly-harsh in my responses. I apologize for that, because it does seem like we're just not seeing eye-to-eye. I have no problem agreeing to disagree on the idea.



  • @VGR said:

    Regex stuff aside, professional applications can't ever do this.

    There are some things that do not have a code solution, and the rules of English is one of those things.  The application is going to look like it was written by incompetent retards when phrases like this show up:

    a honest day's work
    an useful idea
    an euphoric feeling

    Just do the damn work of writing out your sentences and stop trying to be so "clever" by composing them in code.

    Obviously the same goes for forming plurals.  Don't do it in code.  Just write the separate strings.  (Java's ChoiceFormat class makes this easier.  I'm not sure what other languages offer.)

    Absolutely, right.

    English is a Crazy Language



  • I remember benching a precompiled regex vs a chain of 5 or so string.Replace()'s. I was mildly surprised to see the .replaces absolutely smoke the regex by an order of magnitude. :/



  • @phx said:

    I remember benching a precompiled regex vs a chain of 5 or so string.Replace()'s. I was mildly surprised to see the .replaces absolutely smoke the regex by an order of magnitude. :/


    I suppose that's possible, given a certain setup.

    What was the setup, might I ask?
    And by that I mean: show us the code!



  • This was a while ago... it was quite simple (obviously simple enough for replace to handle it) and was along the lines of stripping certain control characters out of strings for an absolutely shithouse parser a collegue wrote.

    A WTF in itself, it was along the lines of replacing $1, $2 with parameters in a sql query. Kinda like a parameterized query, but difficult to use and buggy. So I had to strip $ signs out of user input, and a few others....

    So you compare a regex that does an output = Regex.Replace(input, "[$ etc]", "") with a output = input.Replace("$", "").Replace().Replace().Replace()......;

    Iterate thru that 10k times each and time it.

    Incidently one thing I remember was using the Milliseconds instead of TotalMilliseconds. That had me really confused for a while.



  • Crap.

    Also build a precompiled regex object before you run the benchmark.

    I guess the main difference is Replace looks tardy, but its just a heap allocation for a new string and a scan-and-copy blocks. Regex obviously isnt the tool for something that simple - I guess my point is that you dont need an industrial press to open a can of beans. (Unless you are trying to look cool)



  • I have an industrial press next to my refridgerator specifically for that purpose. Damn tuna cans.

    Also, I am terribly cool.



  • @dhromed said:

    I have an industrial press next to my refridgerator specifically for that purpose. Damn tuna cans.

    Also, I am terribly cool.


    Looks that way :-)

    I use a Helium-neon laser, but then I am an evil villain.  Or is it 'a evil villain'?  I forget.  Henchman, fetch me another grammer nazi to torture!

    Simon



  • The industrial press has many uses.



  • @VGR said:

    There are some things that do not have a code solution, and the rules of English is one of those things.

    I disagree. You just have to go about it a different way -- you disassemble words into phonemes rather than letters. The use of "a" versus "an" is actually very straightforward under that model. "Useful" and "euphoric" both start with the "you" sound, so they don't get the "an"; "honest" starts with "ah", so it does, while "house" starts with an aspirated "h", so it doesn't. (Edit: To clarify, I'm referring to American English pronunciation rules...) Many of the rules governing the English language are quite a bit simpler if you think in terms of phonemes.

    Of course, doing it that way requires a lot more code than checking for five vowels. I certainly agree that in virtually every case, if possible, sentences and strings should be composed beforehand.



  • @triso said:

     Manni wrote:

    I work with a guy who claims to have been developing for years, but I got a look at his code the other day. He was flat-out lying....

    <font size="5">H</font>e might not be lying.  He sounds like one of those guys who hasn't learned anything new since he started developing.<font size="6">

    </font>
    Yup. plenty guys like that around. Started on PLCs and haven't learned a new thing since 1985. The whole thing sounds very much like an old PLC programmer.


  •  You replied to a nearly 5 year old thread.



  • @Stan Rogers said:

     ammoQ wrote:
     Iago wrote:
    But there are too many exceptions.  And it doesn't help that you need to consider context - you can't just look at the next word.

    For example, write a trivial algorithm that will take sentences like the following and replace the asterisks with the appropriate article:

    UNION PACIFIC CUSTOMER LOGIN: REGISTER AS * UP CUSTOMER HERE
    IS THE ECONOMY ON * UP OR A DOWN?




    Since I'm not a native speaker: what's the correct answer? In my naivety, it would replace the asterisk in both cases with "AN", but if it was that easy, you would not have asked?

    The first instance is an abbreviation of the Union Pacific name, and is normally pronounced "you pee" (knock it off with the third-grade chuckles, you guys). The second is the word "up". So the first would take "a" and the second, "an".
    I'm not a native speaker either. I thought my grasp of the English language was quite good, but things like this have me throw my hands op in despair.


  • @dhromed said:

     You replied to a nearly 5 year old thread.

    Mommy, Heron did it first!

    But the English language hasn't changed a lot has it? So my comment is still valid. Just assume I posted it 4 years ago.



  •  An hotel is a good one (exception). Because the 'h' is pronounced, whereas it isn't in hour.



  • @RogerWilco said:

    @dhromed said:

     You replied to a nearly 5 year old thread.

    Mommy, Heron did it first!

    But the English language hasn't changed a lot has it? So my comment is still valid. Just assume I posted it 4 years ago.

    That an it was a spammer that rez'd the thread first anyways.



  • @RogerWilco said:

    Mommy, Heron did it first!
     

    I actually meant Heron, but you slipped a post in there.



  • @dhromed said:

    @RogerWilco said:

    Mommy, Heron did it first!
     

    I actually meant Heron, but you slipped a post in there.

    Interesting trivia:

    When you're viewing posts in this forum, look for the Next/Previous buttons in the upper-right. Being CS, the "Next" button (obviously) takes you to the previous item, and the "Previous" button (natch) takes you to the next item.

    But what's even weirder is that if you hit "Previous" on the most current item, you actually get the *first thread ever*. This one: http://forums.thedailywtf.com/forums/t/1242.aspx from 2004.

    I've made that "mistake" about a hundred times.



  • @dhromed said:

     You replied to a nearly 5 year old thread.

    Hmm. I didn't look at previous post dates, I just assumed it was new because it showed up in my Side Bar RSS feed... I assumed, you know, that it worked like every other RSS feed in the world. Yet another CS WTF, I guess?



  • @Heron said:

    Yet another CS WTF, I guess?

    I dunno. I would think that the necroposting spambot would be more of a WTF. Also, I was going to post pretty much what you did (but less eloquently), except I looked at the dates.


  • Discourse touched me in a no-no place

    @Heron said:

    Hmm. I didn't look at previous post dates, I just assumed it was new because it showed up in my Side Bar RSS feed... I assumed, you know, that it worked like every other RSS feed in the world. Yet another CS WTF, I guess?
    Spam gets posted, RSS feed gets updated, spam gets deleted, you post.



    Which bit is wrong?



  • @PJH said:

    @Heron said:
    Hmm. I didn't look at previous post dates, I just assumed it was new because it showed up in my Side Bar RSS feed... I assumed, you know, that it worked like every other RSS feed in the world. Yet another CS WTF, I guess?
    Spam gets posted, RSS feed gets updated, spam gets deleted, you post.



    Which bit is wrong?

    The bit where normally, the RSS feed doesn't show a post as "new" every time someone replies to it. I guess CS just thinks that if it's really really old, it should show up in the RSS feed again?



  • @blakeyrat said:

    When you're viewing posts in this forum, look for the Next/Previous buttons in the upper-right. Being CS, the "Next" button (obviously) takes you to the previous item, and the "Previous" button (natch) takes you to the next item.

    But what's even weirder is that if you hit "Previous" on the most current item, you actually get the *first thread ever*. This one: http://forums.thedailywtf.com/forums/t/1242.aspx from 2004.

    I've made that "mistake" about a hundred times.

    I see. You can go to the most recent thread by going to "first thread ever" and hitting "Next". That's useful.

Log in to reply