Deduplication Standards



  • Are there any standards for deduplicating person or company contact information?  I've been looking for some best practices, but so far all I see is a million helper utilities.  Any help would be appreciated.



  • @thejoe said:

    Are there any standards for deduplicating person
    or company contact information?  I've been looking for some best
    practices, but so far all I see is a million helper utilities. 
    Any help would be appreciated.




    What exactly do you mean by "deduplicating?"



  • Do you mean something like:

    CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
    DELETE FROM Contact;
    INSERT INTO Contact SELECT * FROM tmp;
    DROP TABLE tmp;

    (Or '<FONT face="Courier New" size=2>SELECT DISTINCT * INTO tmp FROM Contact;</FONT>', depending on your flavour of SQL)



  • @UncleMidriff said:

    @thejoe said:
    Are there any standards for deduplicating person or company contact information?  I've been looking for some best practices, but so far all I see is a million helper utilities.  Any help would be appreciated.


    What exactly do you mean by "deduplicating?"

    Deduplicating is the process of identifying duplicate records and merging them into a single record.



  • @phelyan said:

    Do you mean something like:

    CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
    DELETE FROM Contact;
    INSERT INTO Contact SELECT * FROM tmp;
    DROP TABLE tmp;

    (Or '<FONT face="Courier New" size=2>SELECT DISTINCT * INTO tmp FROM Contact;</FONT>', depending on your flavour of SQL)

    Sort of.  But that's not a great practice because names could have been typo'd or shortened (e.g., Robert -> Rob).  To catch some of this I'm using soundex.

    What about comparisons of address data?  Soundex doesn't work so well with street names.



  • You could geocode a street address. I think those give you back a
    validated response if the geocode goes through. If it doesn't go
    through, it needs fixing anyways d:



  • @tufty said:

    fuckin' amateurs

    File: 'functions':

    #!/bin/bash --noprofile
    ind=1;
    for arg in "$@";
    do
        if [ $ind -eq 1 ];
        then
            DBASE=$arg;
        elif [ $ind -eq 2 ];
        then
            PASS=$arg;
        elif [ $ind -eq 3 ];
        then
            PARENTTABLE=$arg;
        elif [ $ind -eq 4 ];
        then
            CHILDATTRIBUTE=$arg;
        elif [ $ind -eq 5 ];
        then
            CHILDTABLE=$arg;
        fi
        ind=$(( ind+1 ));
    done
    if [ -z $DBASE ] || [ $DBASE = "" ];
    then
    echo -e "\nSyntax: $(basename $0): <database> <password>.";
    exit 1; # 1 for failure
    fi
    if [ -z $PASS ];
    then
        echo;
        read -p "root@$DBASE's password:" -s PASS;
    fi
    if [ -z $DBASE ] ;
    then
    echo -e "\nSyntax: $(basename $0): <database> <password>.";
    exit 1; # 1 for failure
    fi
    echo;
    ### Variables: ###############################################################
    REM="DELETE " ### You will have to change this to 'DELETE'
    OLD_IFS=$IFS;         ### Save IFS for later
    IFS="
    ";                    ### New IFS has only \r.
    #DBASE="";
    STATEMENT="";
    PREPARED="";
    ### Functions: ###############################################################
    # Echo the argument into mysql running in batch mode saving the
    # result in $STATEMENT.
    function call_db {
        STATEMENT=$(echo "$1;
    quit" | mysql $DBASE -uroot -p$PASS --batch --skip-column-names -r);
    }
    # Turn the results into a comma separated list that can be
    # passed back into mysql.
    function process_result {
    PREPARED="";
        num=0;
        for i in $STATEMENT;
        do
            if [ $i != "to_change" ];
        then
            num=$(( num+1 ));
        if [ $num -eq 1 ];
        then
            PREPARED="'$i'";
        else
            PREPARED="$PREPARED,'$i'";
        fi
            fi
            done
            echo $PREPARED;
    }

    File 'deduplicateCorpStats.sh':

    #!/bin/sh
    . ./functions
    call_db "select ID from
        CorpStats LEFT JOIN CorpToCorpStats ON ID = corpStats
        where corpStats is null";
    process_result;
    echo "DELETE FROM CorpStats WHERE ID IN ($PREPARED)";

    This can be found in our code. I'm not making this up...

     


  • I've never actually heard of a standard for this...

    Also, if there are foreign customers, soundex (which is based on certain letter(comination)s sounding like others) doesn't sound like a good idea.

    We're doing something similar (joining two databases) and we're making a screen where it shows those records that it has detected as being the same, but the final choice is always up to the user..

    Drak



  • The definition of 'duplicate' is not constant, but can change from system to system.



    [code]<font face="Courier New">SELECT DISTINCT * INTO tmp FROM Contact;</font>[/code]



    could delete data.



  • @dhromed said:

    The definition of 'duplicate' is not constant, but can change from system to system.



    [code]<font face="Courier New">SELECT DISTINCT * INTO tmp FROM Contact;</font>[/code]



    could delete data.




    Or leave data the database considers unique but the business layer considers duplicate (same data but different key). Hence the something like.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.