Deduplication Standards

thejoe

Are there any standards for deduplicating person or company contact information? I've been looking for some best practices, but so far all I see is a million helper utilities. Any help would be appreciated.

@thejoe said:

Are there any standards for deduplicating person
or company contact information? I've been looking for some best
practices, but so far all I see is a million helper utilities.
Any help would be appreciated.

What exactly do you mean by "deduplicating?"

phelyan

Do you mean something like:

CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
DELETE FROM Contact;
INSERT INTO Contact SELECT * FROM tmp;
DROP TABLE tmp;

(Or 'SELECT DISTINCT * INTO tmp FROM Contact;', depending on your flavour of SQL)

thejoe

@UncleMidriff said:

@thejoe said:
Are there any standards for deduplicating person or company contact information? I've been looking for some best practices, but so far all I see is a million helper utilities. Any help would be appreciated.

What exactly do you mean by "deduplicating?"

Deduplicating is the process of identifying duplicate records and merging them into a single record.

thejoe

@phelyan said:

Do you mean something like:
CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
DELETE FROM Contact;
INSERT INTO Contact SELECT * FROM tmp;
DROP TABLE tmp;
(Or 'SELECT DISTINCT * INTO tmp FROM Contact;', depending on your flavour of SQL)

Sort of. But that's not a great practice because names could have been typo'd or shortened (e.g., Robert -> Rob). To catch some of this I'm using soundex.

What about comparisons of address data? Soundex doesn't work so well with street names.

SurfaceTension

You could geocode a street address. I think those give you back a
validated response if the geocode goes through. If it doesn't go
through, it needs fixing anyways d:

phelyan

@tufty said:

fuckin' amateurs

File: 'functions':

#!/bin/bash --noprofile

ind=1;
for arg in "$@";
do
    if [ $ind -eq 1 ];
    then
        DBASE=$arg;
    elif [ $ind -eq 2 ];
    then
        PASS=$arg;
    elif [ $ind -eq 3 ];
    then
        PARENTTABLE=$arg;
    elif [ $ind -eq 4 ];
    then
        CHILDATTRIBUTE=$arg;
    elif [ $ind -eq 5 ];
    then
        CHILDTABLE=$arg;
    fi
    ind=$(( ind+1 ));
done

if [ -z $DBASE ] || [ $DBASE = "" ];
then
echo -e "\nSyntax: $(basename $0): <database> <password>.";
exit 1; # 1 for failure
fi

if [ -z $PASS ];
then
    echo;
    read -p "root@$DBASE's password:" -s PASS;
fi

if [ -z $DBASE ] ;
then
echo -e "\nSyntax: $(basename $0): <database> <password>.";
exit 1; # 1 for failure
fi

echo;

### Variables: ###############################################################

REM="DELETE " ### You will have to change this to 'DELETE'
OLD_IFS=$IFS;         ### Save IFS for later
IFS="
";                    ### New IFS has only \r.
#DBASE="";
STATEMENT="";
PREPARED="";

### Functions: ###############################################################

# Echo the argument into mysql running in batch mode saving the
# result in $STATEMENT.
function call_db {
    STATEMENT=$(echo "$1;
quit" | mysql $DBASE -uroot -p$PASS --batch --skip-column-names -r);
}

# Turn the results into a comma separated list that can be
# passed back into mysql.
function process_result {
PREPARED="";
    num=0;
    for i in $STATEMENT;
    do
        if [ $i != "to_change" ];
    then
        num=$(( num+1 ));
    if [ $num -eq 1 ];
    then
        PREPARED="'$i'";
    else
        PREPARED="$PREPARED,'$i'";
    fi
        fi
        done
        echo $PREPARED;
}

File 'deduplicateCorpStats.sh':

#!/bin/sh

. ./functions

call_db "select ID from
    CorpStats LEFT JOIN CorpToCorpStats ON ID = corpStats
    where corpStats is null";
process_result;

echo "DELETE FROM CorpStats WHERE ID IN ($PREPARED)";

This can be found in our code. I'm not making this up...

Drak

I've never actually heard of a standard for this...

Also, if there are foreign customers, soundex (which is based on certain letter(comination)s sounding like others) doesn't sound like a good idea.

We're doing something similar (joining two databases) and we're making a screen where it shows those records that it has detected as being the same, but the final choice is always up to the user..

Drak

dhromed

The definition of 'duplicate' is not constant, but can change from system to system.

[code]SELECT DISTINCT * INTO tmp FROM Contact;[/code]

could delete data.

phelyan

@dhromed said:

The definition of 'duplicate' is not constant, but can change from system to system.

[code]SELECT DISTINCT * INTO tmp FROM Contact;[/code]

could delete data.

Or leave data the database considers unique but the business layer considers duplicate (same data but different key). Hence the something like.