Deduplication Standards
-
Are there any standards for deduplicating person or company contact information? I've been looking for some best practices, but so far all I see is a million helper utilities. Any help would be appreciated.
-
@thejoe said:
Are there any standards for deduplicating person
or company contact information? I've been looking for some best
practices, but so far all I see is a million helper utilities.
Any help would be appreciated.
What exactly do you mean by "deduplicating?"
-
Do you mean something like:
CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
DELETE FROM Contact;
INSERT INTO Contact SELECT * FROM tmp;
DROP TABLE tmp;(Or '<FONT face="Courier New" size=2>SELECT DISTINCT * INTO tmp FROM Contact;</FONT>', depending on your flavour of SQL)
-
@UncleMidriff said:
@thejoe said:
Are there any standards for deduplicating person or company contact information? I've been looking for some best practices, but so far all I see is a million helper utilities. Any help would be appreciated.
What exactly do you mean by "deduplicating?"Deduplicating is the process of identifying duplicate records and merging them into a single record.
-
@phelyan said:
Do you mean something like:
CREATE TABLE tmp SELECT DISTINCT * FROM Contact;
DELETE FROM Contact;
INSERT INTO Contact SELECT * FROM tmp;
DROP TABLE tmp;(Or '<FONT face="Courier New" size=2>SELECT DISTINCT * INTO tmp FROM Contact;</FONT>', depending on your flavour of SQL)
Sort of. But that's not a great practice because names could have been typo'd or shortened (e.g., Robert -> Rob). To catch some of this I'm using soundex.
What about comparisons of address data? Soundex doesn't work so well with street names.
-
You could geocode a street address. I think those give you back a
validated response if the geocode goes through. If it doesn't go
through, it needs fixing anyways d:
-
@tufty said:
fuckin' amateurs
File: 'functions':
#!/bin/bash --noprofile
ind=1;
for arg in "$@";
do
if [ $ind -eq 1 ];
then
DBASE=$arg;
elif [ $ind -eq 2 ];
then
PASS=$arg;
elif [ $ind -eq 3 ];
then
PARENTTABLE=$arg;
elif [ $ind -eq 4 ];
then
CHILDATTRIBUTE=$arg;
elif [ $ind -eq 5 ];
then
CHILDTABLE=$arg;
fi
ind=$(( ind+1 ));
doneif [ -z $DBASE ] || [ $DBASE = "" ];
then
echo -e "\nSyntax: $(basename $0): <database> <password>.";
exit 1; # 1 for failure
fiif [ -z $PASS ];
then
echo;
read -p "root@$DBASE's password:" -s PASS;
fiif [ -z $DBASE ] ;
then
echo -e "\nSyntax: $(basename $0): <database> <password>.";
exit 1; # 1 for failure
fiecho;
### Variables: ###############################################################
REM="DELETE " ### You will have to change this to 'DELETE'
OLD_IFS=$IFS; ### Save IFS for later
IFS="
"; ### New IFS has only \r.
#DBASE="";
STATEMENT="";
PREPARED="";### Functions: ###############################################################
# Echo the argument into mysql running in batch mode saving the
# result in $STATEMENT.
function call_db {
STATEMENT=$(echo "$1;
quit" | mysql $DBASE -uroot -p$PASS --batch --skip-column-names -r);
}# Turn the results into a comma separated list that can be
# passed back into mysql.
function process_result {
PREPARED="";
num=0;
for i in $STATEMENT;
do
if [ $i != "to_change" ];
then
num=$(( num+1 ));
if [ $num -eq 1 ];
then
PREPARED="'$i'";
else
PREPARED="$PREPARED,'$i'";
fi
fi
done
echo $PREPARED;
}File 'deduplicateCorpStats.sh':
#!/bin/sh
. ./functions
call_db "select ID from
CorpStats LEFT JOIN CorpToCorpStats ON ID = corpStats
where corpStats is null";
process_result;echo "DELETE FROM CorpStats WHERE ID IN ($PREPARED)";
This can be found in our code. I'm not making this up...
-
I've never actually heard of a standard for this...
Also, if there are foreign customers, soundex (which is based on certain letter(comination)s sounding like others) doesn't sound like a good idea.
We're doing something similar (joining two databases) and we're making a screen where it shows those records that it has detected as being the same, but the final choice is always up to the user..
Drak
-
The definition of 'duplicate' is not constant, but can change from system to system.
[code]<font face="Courier New">SELECT DISTINCT * INTO tmp FROM Contact;</font>[/code]
could delete data.
-
@dhromed said:
The definition of 'duplicate' is not constant, but can change from system to system.
[code]<font face="Courier New">SELECT DISTINCT * INTO tmp FROM Contact;</font>[/code]
could delete data.
Or leave data the database considers unique but the business layer considers duplicate (same data but different key). Hence the something like.