Deduplication Standards



  • Are there any standards for deduplicating person or company contact information?  I've been looking for some best practices, but so far all I see is a million helper utilities.  Any help would be appreciated.



  • @thejoe said:

    Are there any standards for deduplicating person
    or company contact information?  I've been looking for some best
    practices, but so far all I see is a million helper utilities. 
    Any help would be appreciated.




    What exactly do you mean by "deduplicating?"



  • Do you mean something like:

    CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
    DELETE FROM Contact;
    INSERT INTO Contact SELECT * FROM tmp;
    DROP TABLE tmp;

    (Or '<FONT face="Courier New" size=2>SELECT DISTINCT * INTO tmp FROM Contact;</FONT>', depending on your flavour of SQL)



  • @UncleMidriff said:

    @thejoe said:
    Are there any standards for deduplicating person or company contact information?  I've been looking for some best practices, but so far all I see is a million helper utilities.  Any help would be appreciated.


    What exactly do you mean by "deduplicating?"

    Deduplicating is the process of identifying duplicate records and merging them into a single record.



  • @phelyan said:

    Do you mean something like:

    CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
    DELETE FROM Contact;
    INSERT INTO Contact SELECT * FROM tmp;
    DROP TABLE tmp;

    (Or '<FONT face="Courier New" size=2>SELECT DISTINCT * INTO tmp FROM Contact;</FONT>', depending on your flavour of SQL)

    Sort of.  But that's not a great practice because names could have been typo'd or shortened (e.g., Robert -> Rob).  To catch some of this I'm using soundex.

    What about comparisons of address data?  Soundex doesn't work so well with street names.



  • You could geocode a street address. I think those give you back a
    validated response if the geocode goes through. If it doesn't go
    through, it needs fixing anyways d:



  • @tufty said:

    fuckin' amateurs

    File: 'functions':

    #!/bin/bash --noprofile
    ind=1;
    for arg in "$@";
    do
        if [ $ind -eq 1 ];
        then
            DBASE=$arg;
        elif [ $ind -eq 2 ];
        then
            PASS=$arg;
        elif [ $ind -eq 3 ];
        then
            PARENTTABLE=$arg;
        elif [ $ind -eq 4 ];
        then
            CHILDATTRIBUTE=$arg;
        elif [ $ind -eq 5 ];
        then
            CHILDTABLE=$arg;
        fi
        ind=$(( ind+1 ));
    done
    if [ -z $DBASE ] || [ $DBASE = "" ];
    then
    echo -e "\nSyntax: $(basename $0): <database> <password>.";
    exit 1; # 1 for failure
    fi
    if [ -z $PASS ];
    then
        echo;
        read -p "root@$DBASE's password:" -s PASS;
    fi
    if [ -z $DBASE ] ;
    then
    echo -e "\nSyntax: $(basename $0): <database> <password>.";
    exit 1; # 1 for failure
    fi
    echo;
    ### Variables: ###############################################################
    REM="DELETE " ### You will have to change this to 'DELETE'
    OLD_IFS=$IFS;         ### Save IFS for later
    IFS="
    ";                    ### New IFS has only \r.
    #DBASE="";
    STATEMENT="";
    PREPARED="";
    ### Functions: ###############################################################
    # Echo the argument into mysql running in batch mode saving the
    # result in $STATEMENT.
    function call_db {
        STATEMENT=$(echo "$1;
    quit" | mysql $DBASE -uroot -p$PASS --batch --skip-column-names -r);
    }
    # Turn the results into a comma separated list that can be
    # passed back into mysql.
    function process_result {
    PREPARED="";
        num=0;
        for i in $STATEMENT;
        do
            if [ $i != "to_change" ];
        then
            num=$(( num+1 ));
        if [ $num -eq 1 ];
        then
            PREPARED="'$i'";
        else
            PREPARED="$PREPARED,'$i'";
        fi
            fi
            done
            echo $PREPARED;
    }

    File 'deduplicateCorpStats.sh':

    #!/bin/sh
    . ./functions
    call_db "select ID from
        CorpStats LEFT JOIN CorpToCorpStats ON ID = corpStats
        where corpStats is null";
    process_result;
    echo "DELETE FROM CorpStats WHERE ID IN ($PREPARED)";

    This can be found in our code. I'm not making this up...

     


  • I've never actually heard of a standard for this...

    Also, if there are foreign customers, soundex (which is based on certain letter(comination)s sounding like others) doesn't sound like a good idea.

    We're doing something similar (joining two databases) and we're making a screen where it shows those records that it has detected as being the same, but the final choice is always up to the user..

    Drak



  • The definition of 'duplicate' is not constant, but can change from system to system.



    [code]<font face="Courier New">SELECT DISTINCT * INTO tmp FROM Contact;</font>[/code]



    could delete data.



  • @dhromed said:

    The definition of 'duplicate' is not constant, but can change from system to system.



    [code]<font face="Courier New">SELECT DISTINCT * INTO tmp FROM Contact;</font>[/code]



    could delete data.




    Or leave data the database considers unique but the business layer considers duplicate (same data but different key). Hence the something like.


Log in to reply