ALGORITHM to generate a realistic initial from a full middle name



  • I have a REAL WORLD database full of REAL WORLD middle names. Our company's sending data to an API that doesn't allow full middle names, but does allow middle initial. So I need to come up with an algorithm to generate middle initials from noisy and confused real-world middle name strings.

    LANGUAGE: C#

    My first idea is to take the string, start traversing characters left-to-right until I come across the first one that where Char.IsLetter() == true. And use that as the middle initial, or nothing if one isn't found.

    Is there a better method? SOCK IT TO ME!



  • That's pretty much the definition of "initial." Other than allowing for names in RTL languages and scanning from the right where applicable, I don't really see a way to do it better. I'll be interested to see if anyone else has other ideas.



  • Actually looking at how this code is written, it'd be REALLY handy if someone could come up with an XPATH to do the algorithm I just spelled-out.

    EDIT: it looks like XPATH string handling is too rudimentary for this, which means I'm going to have to make a little WTF in this code I think.



  • I USED LINQ FOR NO GOOD REASON

            private string GetInitial(string name)
            {
                if( name == null )
                {
                    return null;
                }
    
                var initial = name.Where(c => char.IsLetter(c)).FirstOrDefault();
    
                if( initial == default(char))
                {
                    return null;
                }
    
                return initial.ToString();
            }
    

  • Notification Spam Recipient

    @blakeyrat said:

    NO GOOD REASON

    Good Reason #1: It was thereavailable
    Good Reason #2: Nobody stopped you



  • why not let the database handle it?

    SELECT
    SUBSTRING(middleName, PATINDEX ('^[-A-Za-z/.]+$', middleName) , 1)
    

    that regex may need some work on it...



  • The data is originally sourced from a database, but this particular application doing the work only has access to an XML file.

    And the dev who initially wrote it populated the whole "Person" object from XPath strings. Annoying.



  • @blakeyrat said:

    Annoying

    yup.

    in that case, I'd keep the LINQ and declare it a day



  • I did. The PR's in.



  • How does it work for names from languages that use other alphabets? Like the single Cyrillic letters that get transliterated to English as "Ya.", "Zh.", "Shch.", etc.?



  • From what I can tell, "Char.IsLetter()" takes care of all languages.


  • BINNED

    I think he means "what happens if you have a name written in Cyrillic that already got transliterated to multiple Latin characters".

    I'd still go with your solution in that case, honestly. Writing something that can recognize that and replace it with a proper Cyrillic character (or a Latin character outside of the ASCII range) is damned near impossible, and fuck knows what the API you're dealing with would do if you fed it a Cyrillic character (if I understood you correctly you're limited to a single character?).



  • RTL and LTR are display properties. Codepoints are always stored in logical order in Unicode strings, so you always scan from the start of the string. What you might want to deal with are combining codepoints (at least normalize the string to composed form, since real names are unlikely to contain characters with no precomposed form).


  • I survived the hour long Uno hand

    @Onyx said:

    Writing something that can recognize that and replace it with a proper Cyrillic character (or a Latin character outside of the ASCII range) is damned near impossible,

    If I put down "Yami" as my middle name, do I want my middle initial to be Y or Я? Or や? Or should it just be 闇? Without more information, it's literally impossible to tell. So you'd want to assume if I wanted Я I'd have typed Яmi.


  • BINNED

    Right. Overall, it's not worth the effort. If there's transliteration happening somewhere outside of your control you should just ignore it and work with what you got.


Log in to reply