Similarity of short text



  • My current project will allow people to import items from a file, identifying records using a unique item code that they provide. If the item code is found in the database, that record is updated, otherwise a new record is created.

    I need to make the whole thing as simple as possible, but I want to avoid the risk of someone trying to create an item using what they think is a unique code, and inadvertently overwriting an existing record.

    I was thinking of comparing the item name (which must always be provided) against the record in the database to show a warning if the value has changed too much.

    I won't stifle your creativity by specifying how the similarity should be calculated, I'm very open to ideas. It is important that simply shifting a matching section (by adding or removing some text preceding it) should not result in a large change to the similarity. Ideally, the final similarity rating should be normalised, perhaps as a value between 0 and 1, allowing me to choose a value to act as the warning trigger.

    The item name field is constrained to 100 characters and will not include newlines. Also, I'm using PHP*.

    Any suggestions?


    * Yes, TRWTF is PHP, no need to mention that below.



  • I will mention that PHP's similar_text() seems to offer decent results in testing, but I don't know anything about the algorithm that it uses and the results are slightly confusing. You can get different similarity ratings by switching the two terms that are being compared.

    The breakdown on this SO thread is quite interesting.

    It might still be adequate for my needs, but other ideas are welcome.



  • Of course you can because switching the order necessarily makes them less similar...



  • @Arantor said:

    Of course you can because switching the order necessarily makes them less similar...

    Sarcasm, or am I missing something?



  • No, straight up. It's literally part of how the algorithm works.

    Consider: you're ultimately talking about the distance one has to go from A to B, but the route from A to B is not the same route as from B to A.



  • How many items are we talking about here? Something that would help is auto-completing the item name entry (or even providing a combobox if the number is low) and filling-in the item code from the DB based on that.

    It still leaves open the possibility of a typo in the item code, if the user edits it or decides to ignore the auto-fill and retype it themselves, but I'm guessing the error reduction would make this question moot.



  • There will be a thousand or so items, but the people that will be using it will only be happy if they can maintain their own, separate files. They also do most of the work offline.

    Items can also be imported in three separate ways, for stock, for an auction or for a sale on behalf of a client, and the required fields are different for each. The ways people generate these files are very varied and I'd never be able to get them to use a specific spreadsheet.

    It's a nice idea, I'd definitely choose something like that if I had more control over the users' processes.



  • I don't see anything in your description that prevents you from doing it. A "few thousand" records is trivial to shove in a browser's Local Storage (or, if they're working offline, your PHP is already presumably storing something somewhere.)

    But I don't get how the spreadsheet is involved. I thought it was a PHP site?

    Well anyway. Good luck.



  • I'm specifically trying to solve import from a spreadsheet. There's already protection in place to prevent them from duplicating item codes through the web user interface.


  • Discourse touched me in a no-no place

    @Keith said:

    I'm specifically trying to solve import from a spreadsheet. There's already protection in place to prevent them from duplicating item codes through the web user interface.

    It might be possible to reuse the validation code you've already got, assuming it wasn't done in a horrible way.

    But you also need to think about how you report the results of the import. Maybe via another spreadsheet? You don't want to fail the whole thing on a single dupe, so you end up with a row per imported row and can say in that row whether the import of the corresponding row worked or failed (and if it failed, how). We've hit this sort of thing with doing bulk data curation, and make as much work as possible is a useful approach (yes, it's the opposite of how much computing works).

    The really good thing about a spreadsheet is that there are some excellent GUI tools for editing them, and your users probably have at least one installed already…



  • Auto complete for inputs, and binary checksums against existing items has always done well for me.

    You can also use exists/not exists, or in/not in when inserting new reference data.


Log in to reply