Detecting random typing



  • Hello,

    does anyone know an algorithm that decides whether an input is random typing (like cats on keybord, head on keybord or good old "asdf") or text of a language. Doesn't have to check the grammar or whether the words really exist. Something like inverse Markov chains, "macrodestic" might pass, "naizsubawi" not.

    Thanks in a advance,
    derari



  • You'd need to be able to figgure out what the keymap is, once you have that collect the relative positions of the latest key hit to the previous one, if you have a bunch of key hits that are next to each other there is a high chance, If you have key hits that are very far away from each other the chance is really low.



  • @Lingerance said:

    You'd need to be able to figgure out what the keymap is, once you have that collect the relative positions of the latest key hit to the previous one, if you have a bunch of key hits that are next to each other there is a high chance, If you have key hits that are very far away from each other the chance is really low.
    I want someone to implement this and then see how many common English words are kicked out by this.  I don't think it's a bad idea.  I just want to see what English words could conceivably be typed by a cat walking across my keyboard.



  • @belgariontheking said:

    I just want to see what English words could conceivably be typed by a cat walking across my keyboard.

    Depending on the tollerance, and presuming querty:

    • fads
    • loop
    • kill
    • kilo
    • jut
    • deer
    • sad
    • ass
    • look
    • greet
    • weed
    • seed
    • we
    Also, certain phrases may trigger it:
    • CDE (Common Desktop Environment)
    • lo (Loop-back device on Linux)
    • Re
    • JK (Just Kidding)
    • kj (kilojoule)
    • kl (kilolitre)
    • nm (Never Mind)
    • bg (Background, also a job control command available in bash, and possibly other shells)
    • hg (chemical element and source-code repository)

    This is non-exhaustive, but gives a rough idea. You'll note that most of these are very short words, so the longer the word the better chance it has.



  • @Lingerance said:

    You'd need to be able to figgure out what the keymap is, once you have that collect the relative positions of the latest key hit to the previous one, if you have a bunch of key hits that are next to each other there is a high chance, If you have key hits that are very far away from each other the chance is really low.

    It's more complicated than that, if you are detecting cat-like typing, as they have multiple legs.  Unless the keyboard is very small, or the cat is very large, the farthest keys are very unlikely to be pressed one after the other by the cat.  However, it's most likely going to be a cluster of keys followed by a cluster of keys not too far away, then two clusters in the opposite direction, then clusters much further away in roughly the direction of the initial movement.

    If you were looking for snake-like typing, one can use the algorithm you described to great effect.

    That having been said, knowing the keymap is key.  Without it, you have no chance.



  • @Lingerance said:

    @belgariontheking said:
    I just want to see what English words could conceivably be typed by a cat walking across my keyboard.
    Depending on the tollerance, and presuming querty:
    good legwork, but not what I had in mind.  I was thinking of words like lyre, which every letter continues off to the left.  Act would be another one.



  • You could check for words without vowels. (Or with long sequences of consonants) Having digits or punctuation in words is also suspicious, unless you consider 13375P34K a language.



  •  well you could call these guys http://bitboost.com/pawsense/ :)

     But otherwise i would look at speed of typing as well. I'm pretty sure the pattern of keypresses viewed in time will be different between typing normal text and humbug. Although it will probebly take a lot of tests and a lot of false positives before you can perfect something like that ;) but maybe combined with the above letter distance stuff in a sort of point system.

    Also, qwerty isn't the only kid on the block, some people also use azerty or dvorak, so the letter distance thing would be keyboard dependant.

     

    Personally, I think your better off building a bayesian spam-like database  to distinguish between real text and fake text, and even then it's going to be nigh-impossible.


  • Discourse touched me in a no-no place

    @ammoQ said:

    You could check for words without vowels.
    Hmm - that'll be a false positive for Welsh then...



  • @stratos said:

    some people also use azerty or dvorak
    But, in general, we don't care about them.



  •  @PJH said:

    @ammoQ said:
    You could check for words without vowels.
    Hmm - that'll be a false positive for Welsh then...

    I guess we have a new law then:

    Sufficiently advanced Welsh is indistinguishable from random typing.



  • @ammoQ said:

    Sufficiently advanced Welsh is indistinguishable from random typing.
     

    Like, my cat was looking for the track "Dwr Budr" on isohunt, but he was blocked by Pawsense. He was miffed, I can tell you that!



  • Here are a few thoughts:

     

    Make a list of all 2 letter combinations that are impossible in any language.  QW, YT, PD, etc.  If there are more than a few of these combinations in a good amount of text, then you can assume random typing.

    Check the average length of words in the text.  Random text should have longer sequences of characters between spaces .

    Check the distribution of letters.  I'd assume all languages would have each letter appear at a certain frequency, but random typing  would have all letters appear equally often.



  • @Salami said:

    Make a list of all 2 letter combinations that are impossible in any language.  QW, YT, PD, etc.
     

    Ytterbium. (any language)

    Opdonderen (Dutch)

    QW might be correct, though, but I think I have sufficiently demonstrated the infeasibility of your cunning plan.



  • @dhromed said:

    @Salami said:

    Make a list of all 2 letter combinations that are impossible in any language.  QW, YT, PD, etc.
     

    Ytterbium. (any language)

    Opdonderen (Dutch)

    QW might be correct, though, but I think I have sufficiently demonstrated the infeasibility of your cunning plan.

     

    @Salami said:

    If there are more than a few of these combinations in a good amount of text, then you can assume random typing.

    I think this will still work. Thanks for all the comments, you had a lot of good ideas :D



  • @dhromed said:

    QW might be correct, though, but I think I have sufficiently demonstrated the infeasibility of your cunning plan.
    I think there are some stores in America called Qwik Stop.  So if you want to keep your cat from writing about shitty convenience stores, this would work.



  • @dhromed said:

    @Salami said:

    Make a list of all 2 letter combinations that are impossible in any language.  QW, YT, PD, etc.
     

    Ytterbium. (any language)

    Opdonderen (Dutch)

    QW might be correct, though, but I think I have sufficiently demonstrated the infeasibility of your cunning plan.

     

    Qwerty is an English word, according to Merriam-Webster



  • @Lingerance said:

    This is non-exhaustive, but gives a rough idea. You'll note that most of these are very short words, so the longer the word the better chance it has.
     

    "stewardesses"

     Pretty sure that's one of the longest (if not the longest) common English word typed with contiguous letters on a QWERTY keyboard. It's well-known as the longest word typed only with the left hand (if you are typing properly). The equivalent words for the right hand aren't quite contiguous.



  • I don't understand why you don't want to use a dictionary or dictionaries. Why is that? Surely if > 50% of the typing can be found in a dictionary then the most probably conclusion is that it's not a cat. Sure there may be other ways, but this seems the simplest.


    You also need some context to your question. Is this in a word document, email or something? What I said is true for those, but while playing games "wwwwddddddssssssswwwwwddddxxxxxxxx" is a pretty likely combination and in excel it may be predominately numbers (which are close together). Neither of those are cats either.



  •  Okay, a bit more context:

    Given is a string that SHOULD contain a sentence of a language (around 2 - 50 characters, english in most cases, the language is known and precomputed information like "list of forbidden character pairs" is available).

    The task is to find out whether it is likely that someone did not enter something meaningfull, but just randomly. Analyzing grammer or content is not necessary, the algorithm should be light and fast.

     I thought about using a dictionary, but I figured that will increase the physical size of the algorithm quite much (especially if many languages are supported). However, I think using a dictionary only for short words might be a good idea.


Log in to reply