Partial ("Fuzzy"?) string matching in Javascript



  • So I want a function that takes an answer and tries to match it against the correct answer. This is easy if I'm looking for an exact match (even case-insensitive and/or stripped of excess whitespace). What if I want to match one or all of an array of possible answers? What I have is something like

    (this one should match if the submitted answer contains at least one of the required words or phrases):

    function matchAny(submitted, requiredTerms) {
        "use strict";
        var output = [],
            input = submitted.split(', ');
        input.forEach(function (x) {output.push(requiredTerms.indexOf(x)); });
        return output.some(function (x) {return x !== -1; });
    }
    

    and for the match-all case:

    function matchAll(submitted, requiredTerms) {
        "use strict";
        var output = [],
            input = requiredTerms.toLowerCase().split(', ');
        input.forEach(function (x) {
            output.push(submitted.toLowerCase().indexOf(x));
            return output.every(function (x) { return x !== -1; });
        });
    }
    

    As is obvious from me asking this question, it doesn't work. At all. Is there a better way of doing this? Is it even possible to do without grungy code? Ideally answers would involve pure javascript and be as cross-browser friendly as possible (I need to support everything past IE 9 and especially mobile safari).



  • @Benjamin-Hall couldn't you use String.prototype.includes() in your matchAll case?

    EDIT: no you can't.



  • @lucas1 said in Partial ("Fuzzy"?) string matching in Javascript:

    @Benjamin-Hall couldn't you use String.prototype.includes() in your matchAll case?

    EDIT: no you can't.

    Oh. I just looked at that. Looks like it should work--why not? All I care about is the functionality, the rest can be modified as needed.


  • :belt_onion:

    Split only works on a single character IIRC, not a set of them.

    input = submitted.split(/\W/);
    

    splits on any non-word character and passes my test case of

    matchAny('some phrase here', ['phrase']);
    

    Though you'll probably want to do some mangling (toLowerCase and such).



  • Your matchAll function doesn't return anything, since your return is inside the function being passed into input.forEach.

    I think conceptually they should almost be the same: Once you've split the submitted and required terms, you just want requiredArray.some(...) for matchAny and requiredArray.every(...) for matchAll, with the ... condition being index > -1


  • sockdevs

    function multiMap(submitted, requiredTerms, reducer, initial) {
        "use strict";
        var input = submitted.split(/,/g).map(function(x) {
            return x.replace(/^\s+|\s+$/g, '').toLowerCase();
        });
        return input.reduce(reducer, initial);
    }
    
    function matchAny(submitted, requiredTerms) {
        return multiMap(submitted, requiredTerms, function(acc, val) {
            return acc || requiredTerms.indexOf(val) >= 0;
        }, false);
    }
    
    function matchAll(submitted, requiredTerms) {
        return multiMap(submitted, requiredTerms, function(acc, val) {
            return acc && requiredTerms.indexOf(val) >= 0;
        }, true);
    }
    
    console.log(matchAny('1,2,3', ['3', '4', '5']) === true);
    console.log(matchAny('1,2,3', ['4', '5', '6']) === false);
    console.log(matchAny('1,2,3', ['1', '2', '3']) === true);
    console.log(matchAny('1,2,3', ['3', '2', '1']) === true);
    console.log(matchAny('C,B,A', ['a', 'b', 'c']) === true);
    console.log(matchAll('1,2,3', ['3', '4', '5']) === false);
    console.log(matchAll('1,2,3', ['3', '4', '1']) === false);
    console.log(matchAll('1,2,3', ['1', '2', '3']) === true);
    console.log(matchAll('1,2,3', ['3', '2', '1']) === true);
    

    If i assume more modern browser features or am okay with dropping IE support and making mobile spottier i can tighten that up, but.... that's still a decent function without to much code smell. It does assume requiredTerms will always be an array and will already have been toLowerCase()'d so you may need to adjust if those requirements aren't met.



  • @Benjamin-Hall where does submitted come from? I'm thinking it might be better (more robust) to use .match(/\w+/g) instead of .split(', ') to extract an array of words from it.

    Other than that, this should basically work...

    function matchAny(submitted, requiredTerms) {
        "use strict";
        var input = submitted.split(', '); //var input = submitted.match(/\w+/g);
        return input.some(function (x) { return requiredTerms.indexOf(x) >= 0; });
    }
    

    For matchAll, the only difference is that you'd use input.every instead of input.some, if the criteria is that all of the submitted words must be in the list of required words. (Or was it that all of the required words must be in the list of submitted words? That wasn't really clear.)

    If you need the search to be case-insensitive, just add .toLowerCase() before you split submitted, and ensure that all of the items in requiredTerms are in lowercase so that they'll match.



  • @hungrier

    I just tried it quickly in the js console

    var requiredArray = ['cat','dog','monkey'];
    var inputArray1 = ['dog','horse'];
    var inputArray2 = ['dog','cat','monkey','horse'];
    requiredArray.some((x)=>inputArray1.indexOf(x) > -1) //true
    requiredArray.every((x)=>inputArray1.indexOf(x) > -1) //false
    requiredArray.every((x)=>inputArray2.indexOf(x) > -1) //true
    requiredArray.some((x)=>inputArray2.indexOf(x) > -1) //true
    var inputArray3 = ['horse','cow'];
    requiredArray.some((x)=>inputArray3.indexOf(x)>-1) //false
    requiredArray.every((x)=>inputArray3.indexOf(x)>-1) //false
    

    Once you have your arrays, you just need .every or .some with that indexOf condition



  • @anotherusername submitted comes from user input (basically the output of an <input> field. Right now it's unprocessed except case mangling. I have full control, so I can do any processing needed before feeding it to the validation function.



  • @Benjamin-Hall you're probably going to need something more flexible than ', ' to split on, then. Users will inevitably leave out commas, leave out spaces, add multiple spaces, ...

    Actually, can any of these terms contain embedded spaces?



  • @anotherusername The input is a single long block of text. Think paragraph/sentence form. I was planning to split the required terms and match one at a time. Both the required terms and the input will probably include spaces.

    Example input:

    The answer to the question is apple pie except when it's pink penguins.

    Example matching terms:

    "apple pie, purple penguins"

    matchAny -> true
    matchAll -> false



  • @Benjamin-Hall and, the input is the value you're calling submitted? So, will submitted and requiredTerms both be entered by hand, and need to be split?

    var submitted = "The answer to the question is apple pie except when it's pink penguins.";
    var requiredTerms = "apple pie, purple penguins";
    matchAny(submitted, requiredTerms); // should return true
    matchAll(submitted, requiredTerms); // should return false
    

    Is that correct?





  • That seems to do a large part of what you need. I think if you just matched word-for-word with a list of keywords using that and a decent enough threshold, it would be enough.

    Something like:

    function match(expectedWords, answer) {
        var parsedAnswer = answer.toLowerCase().split(' ');
        var wordMatches = [];
        expectedWords.forEach(function(expectedWord) {
            wordMatches.push(parsedAnswer.some(function(answerWord) { 
                var fset = new FuzzySet([expectedWord]); 
                var find = fset.get(answerWord); 
                return find && find[0] && find[0][0]>=0.9;  //or just expectedWord === answerWord
            })) 
        });
        return wordMatches.indexOf(false) === -1;
    }
    


  • @Benjamin-Hall so matchAny should return true when any of the required terms are found in the input, and matchAll should return true when all of them are found in it.

    What about partial matches? If requiredTerms contains "apple pie", should it match to "apple pies" also?

    var submitted = "The answer to the question is apple pies except when it's pink penguins.";
    var requiredTerms = "apple pie, purple penguins";
    matchAny(submitted, requiredTerms); // should return what?
    


  • @anotherusername My current idea is that the matched terms themselves should be exact. It's for a question/answer system, so these are vocabulary terms mostly. Otherwise, yeah. matchAny should be true if any of the requiredTerms are found in submitted, matchAny if all are (order doesn't matter). Your example should return False.

    @Maciejasjmj That looks interesting. I'll have to look at it more.



  • @Benjamin-Hall the criteria that terms can be multiple words makes the issue rather more complex, since it means that you can't split submitted on anything concrete. I think you'll have to split your terms apart instead, and check the index of each time each term occurs, testing that the characters before and after that match are not word characters. That will ensure that it doesn't match the term "pie" where submitted contains "pies", or "spied".

    Something like this:

    var submitted = "The answer to the question is apple pie except when it's pink penguins.";
    var requiredTerms = "apple pie, purple penguins";
    
    function matchAny(submitted, requiredTerms) {
      submitted = submitted.toLowerCase().replace(/^|\s+|$/g, ' ');
      requiredTerms = requiredTerms.toLowerCase().replace(/\s*,\s*/g, ',').match(/\w[^,]*/g);
      return requiredTerms.some(function (word) {
        for (var i = -1; (i = submitted.indexOf(word, i + 1)) >= 0; ) {
          // for each match, check that the character before and after the match is a non-word character
          if (/\W/.test(submitted[i - 1]) && /\W/.test(submitted[i + word.length])) return true;
        }
      });
    }
    
    function matchAll(submitted, requiredTerms) {
      submitted = submitted.toLowerCase().replace(/^|\s+|$/g, ' ');
      requiredTerms = requiredTerms.toLowerCase().replace(/\s*,\s*/g, ',').match(/\w[^,]*/g);
      return requiredTerms.every(function (word) {
        for (var i = -1; (i = submitted.indexOf(word, i + 1)) >= 0; ) {
          // for each match, check that the character before and after the match is a non-word character
          if (/\W/.test(submitted[i - 1]) && /\W/.test(submitted[i + word.length])) return true;
        }
      });
    }
    


  • @anotherusername Thanks! that seems to work fine (at least with my test cases). I'm grateful for people who know more than I do and are willing to help random strangers.



  • @Benjamin-Hall note that, if you get into modifying the code at all, you should know that first line in each function:

    submitted = submitted.toLowerCase().replace(/^|\s+|$/g, ' ');
    

    not only replaces sequences of multiple spaces with a single space character, but it also adds a space character at the beginning and end of the string. That is deliberate; otherwise there would need to be special cases for terms found at the very beginning or the very end of submitted. It needs to test the character before and after the match, so adding spaces at both ends of the string ensures that there will be a character there to test.

    You can modify that line to have it also remove punctuation, but make sure it still adds spaces at the beginning and end.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.