Computer vision system can be fooled by handwritten notes

El_Heffe

Researchers from machine learning lab OpenAI have discovered that their state-of-the-art computer vision system can be fooled by handwritten notes.

James Vincent / Mar 8, 2021 / front-page

OpenAI’s state-of-the-art machine vision AI is fooled by handwritten notes

Reading is believing.

stoborrobots

Based on that image, should the headline be "Computer vision system can be fooled by completely obscuring the object being identified with a piece of paper"?

kazitor

Can’t wait to try this one out at an ATM

Gąska

“We refer to these attacks as typographic attacks,” write OpenAI’s researchers in a blog post.

How about creators-are-lazy-bums-and-didn't-vet-their-training-set-correctly-so-now-the-model-has-hundreds-of-obvious-vulnerabilities attacks?

Like, just how dumb do you have to be to not figure out that having text on images is going to skew the results of training?

Kamil Podlesak

So, basically, this is just like the classic ebay scam where you order iPod for suspiciously low price and receive a picture of iPod. Or empty iPod box.
Which, apparently, does work and does confuse Natural Intelligence - often enough to be worth the effort.

Given that the purpose of Artificial Intelligence is to emulate Natural Intelligence... It's not a Bug, it's a Feature.

bobjanova

This is funny. But it's really just a mistake of weighting - it's correctly identified the text 'iPod' but it hasn't been properly trained to know that text alone is not a good primary indicator of an object.

If you already know the thing is an apple, and you read the label that says 'Cox', then you can say, yes it's almost certainly a Cox. But if I wrote 'Cox' in marker pen on an orange, it's still an orange. This should be an apple (or piece of paper) first, and 'iPod' isn't in those categories.

I guess there's a wider problem with the training that they won'r do negative or malicious anti-training, because researchers are nice optimistic people looking for success, not working out how to abuse their system.

boomzilla

@bobjanova said in Computer vision system can be fooled by handwritten notes:

This is funny. But it's really just a mistake of weighting - it's correctly identified the text 'iPod' but it hasn't been properly trained to know that text alone is not a good primary indicator of an object.

Yeah, I was wondering if there's any difference to the AI between "What do I see?" and "What did I read?" But maybe the communication out of it isn't good enough to for that to make sense even if it can.

dkf

@boomzilla said in Computer vision system can be fooled by handwritten notes:

Yeah, I was wondering if there's any difference to the AI between "What do I see?" and "What did I read?" But maybe the communication out of it isn't good enough to for that to make sense even if it can.

If you want there to be a difference, you train two networks separately to match the two things, and then a third network to combine the results. (You could also use some sort of expert system for the the results-combination task.)

cvi

“By exploiting the model’s ability to read text robustly, we find that even photographs of hand-written text can often fool the model.”

Apparently the network has a part that is able to "read" text. It just was never trained on data where the text is BS.

Well, this does bode well, however. Just get a cap that says "Good Guy" and the police state AI's won't be able to find you ever.

boomzilla

@dkf said in Computer vision system can be fooled by handwritten notes:

@boomzilla said in Computer vision system can be fooled by handwritten notes:

Yeah, I was wondering if there's any difference to the AI between "What do I see?" and "What did I read?" But maybe the communication out of it isn't good enough to for that to make sense even if it can.

If you want there to be a difference, you train two networks separately to match the two things, and then a third network to combine the results. (You could also use some sort of expert system for the the results-combination task.)

Clicking through to their blog, they talk about it having "robust reading capabilities" (paraphrased) so it kind of sounds like they've already done something like that. But there doesn't seem to be a way for it to say, "I read $this," vs "I see $this." But I'm assuming that in the case of the note, the confidence in reading the text overwhelms anything else it recognizes. Which isn't surprising, but it calls into question the nature of said "typographic attack."

Kamil Podlesak

@boomzilla said in Computer vision system can be fooled by handwritten notes:

@dkf said in Computer vision system can be fooled by handwritten notes:

@boomzilla said in Computer vision system can be fooled by handwritten notes:

Yeah, I was wondering if there's any difference to the AI between "What do I see?" and "What did I read?" But maybe the communication out of it isn't good enough to for that to make sense even if it can.

If you want there to be a difference, you train two networks separately to match the two things, and then a third network to combine the results. (You could also use some sort of expert system for the the results-combination task.)

Clicking through to their blog, they talk about it having "robust reading capabilities" (paraphrased) so it kind of sounds like they've already done something like that. But there doesn't seem to be a way for it to say, "I read $this," vs "I see $this." But I'm assuming that in the case of the note, the confidence in reading the text overwhelms anything else it recognizes. Which isn't surprising, but it calls into question the nature of said "typographic attack."

All joking aside - I think that the networks themselves make very clear distinction between PictureClassification<"iPod"> and parsed text Text<"iPod">, it's just that the output is overly simplified to simple text string. At worst, this loss of information is already somewhere inside the pipeline - in which case they should learn from their mistakes.

Vault_Dweller

As with most things AI, the problem is lack of context. What do you want it to show, and why? We automatically assume apple is correct, because of the stand-alone picture of the apple above it. But would your answer still be apple if asked what you saw and were shown only the second picture? Why is "iPod" wrong? I would argue it's more correct than "apple", since there is almost no apple in the picture. Even "paper" would be more correct.

boomzilla

@Kamil-Podlesak said in Computer vision system can be fooled by handwritten notes:

@boomzilla said in Computer vision system can be fooled by handwritten notes:

@dkf said in Computer vision system can be fooled by handwritten notes:

@boomzilla said in Computer vision system can be fooled by handwritten notes:

Yeah, I was wondering if there's any difference to the AI between "What do I see?" and "What did I read?" But maybe the communication out of it isn't good enough to for that to make sense even if it can.

If you want there to be a difference, you train two networks separately to match the two things, and then a third network to combine the results. (You could also use some sort of expert system for the the results-combination task.)

Clicking through to their blog, they talk about it having "robust reading capabilities" (paraphrased) so it kind of sounds like they've already done something like that. But there doesn't seem to be a way for it to say, "I read $this," vs "I see $this." But I'm assuming that in the case of the note, the confidence in reading the text overwhelms anything else it recognizes. Which isn't surprising, but it calls into question the nature of said "typographic attack."

All joking aside - I think that the networks themselves make very clear distinction between PictureClassification<"iPod"> and parsed text Text<"iPod">, it's just that the output is overly simplified to simple text string. At worst, this loss of information is already somewhere inside the pipeline - in which case they should learn from their mistakes.

Yes, that was what I was getting at.

Mason_Wheeler

So when the robots take over, all you'll need to do to keep safe is wear a T-shirt that says "fellow robot" in big bold letters.

topspin

@boomzilla said in Computer vision system can be fooled by handwritten notes:

But there doesn't seem to be a way for it to say, "I read $this," vs "I see $this."

Benjamin Hall

@topspin said in Computer vision system can be fooled by handwritten notes:

@boomzilla said in Computer vision system can be fooled by handwritten notes:

But there doesn't seem to be a way for it to say, "I read $this," vs "I see $this."

That was actually my first thought when I saw the OP.

Gąska

@bobjanova said in Computer vision system can be fooled by handwritten notes:

This is funny. But it's really just a mistake of weighting - it's correctly identified the text 'iPod' but it hasn't been properly trained to know that text alone is not a good primary indicator of an object.

How was it supposed to know it if (presumably) every single image in the training data set that had a word "iPhone" in it, was an iPhone?

I guess there's a wider problem with the training that they won'r do negative or malicious anti-training, because researchers are nice optimistic people looking for success, not working out how to abuse their system.

A more cynical person would say that researchers avoid hard work so they can publish faster, and to minimize the chance of realizing the project is a failure before publishing (because after publishing, who cares).

Gąska

@boomzilla said in Computer vision system can be fooled by handwritten notes:

@dkf said in Computer vision system can be fooled by handwritten notes:

@boomzilla said in Computer vision system can be fooled by handwritten notes:

Yeah, I was wondering if there's any difference to the AI between "What do I see?" and "What did I read?" But maybe the communication out of it isn't good enough to for that to make sense even if it can.

If you want there to be a difference, you train two networks separately to match the two things, and then a third network to combine the results. (You could also use some sort of expert system for the the results-combination task.)

Clicking through to their blog, they talk about it having "robust reading capabilities" (paraphrased) so it kind of sounds like they've already done something like that. But there doesn't seem to be a way for it to say, "I read $this," vs "I see $this." But I'm assuming that in the case of the note, the confidence in reading the text overwhelms anything else it recognizes. Which isn't surprising, but it calls into question the nature of said "typographic attack."

I didn't read any sources (hell, I haven't really read the article in OP), but I suspect the "robust reading capabilities" means some advanced OCR used as part of the basic shape recognition algorithm, which doesn't run parallel to but is one of the inputs for the category-recognition network. And then the whole network was trained on a data set where every image with the text "iPhone" in it was indeed an iPhone.

boomzilla

@Gąska maybe. But if you put those images in front of someone and told them, tell me what you see, they might do the exact same thing as the AI, when giving simple responses. Also, I don't think their AI is supposed to be "just" vision:

About

OpenAI is an AI research and deployment company. Our mission is to ensure that artificial general intelligence benefits all of humanity.

Mission
OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

And from the blog:

Multimodal neurons in artificial neural networks

We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that...

We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn.

HardwareGeek

@boomzilla said in Computer vision system can be fooled by handwritten notes:

outperform humans at most economically valuable work—benefits all of humanity.

By replacing their jobs and reducing the humans to abject poverty?

Captain

Yep. In human history, the vast majority of small rectangular things with 'iPod' written on it have been iPods. The AI did right by recognizing that and detecting it as a feature.

There are other features it could consider, as well. Such as: is the text hand written or machine generated? What font is it? Does it match what you'd expect to see on an iPod? In other words, once you start going down this path, you're doing fraud/counterfeit detection.

OK, that could be a useful skill. People do some level of it every day. (They're usually pretty bad at this sort of thing though).

Maybe AI will just encode human stupidity at the speed of quantum.

HardwareGeek

@Captain said in Computer vision system can be fooled by handwritten notes:

encode human stupidity at the speed of quantum.

Someday, maybe AI will get beyond this. But much of today's AI, especially game "AI", is really artificial stupidity.

boomzilla

@Captain said in Computer vision system can be fooled by handwritten notes:

Maybe AI will just encode human stupidity at the speed of quantum.

With blackjack. And hookers!

Gąska

@HardwareGeek said in Computer vision system can be fooled by handwritten notes:

But much of today's AI, especially game "AI", is really artificial stupidity.

AI and "game AI" are two completely separate concepts. AI almost always incorporates some form of machine learning, and generally strives to always give the best possible results. "Game AI" rarely learns anything and is intentionally programmed to be utterly incompetent at its task, so that players don't get their asses handed to them by NPCs all the time.

hungrier

@Mason_Wheeler said in Computer vision system can be fooled by handwritten notes:

So when the robots take over, all you'll need to do to keep safe is wear a T-shirt that says "fellow robot" in big bold letters.

The documentary Futurama has shown that in the future the detection won't be fooled that simply and will need a more advanced disguise:

Gąska

@hungrier the funny part is that in real life, robots would quickly make a connection that a cheap disguise = a human pretending to be a robot. So no disguise at all could become the most effective disguise.

Captain

@Gąska Except the conditional probability of being a human if you look human is pretty high, and the conditional probabilities of being a human if you look like a box is low... and there's a really, really long tail of things humans usually don't look like.

hungrier

@Captain If it has a long tail, it's unlikely to be human, but also unlikely to be a robot. Unless the tail is a power cord.

Gąska

@Captain said in Computer vision system can be fooled by handwritten notes:

@Gąska Except the conditional probability of being a human if you look human is pretty high, and the conditional probabilities of being a human if you look like a box is low... and there's a really, really long tail of things humans usually don't look like.

I imagine the conditional probability of a barrel with legs being a human is fairly high.

Captain

@Gąska http://images.google.com/images?um=1&hl=en&safe=active&nfpr=1&q=barrels+with+legs

boomzilla

@Captain LOL:

Watson

@Vault_Dweller said in Computer vision system can be fooled by handwritten notes:

Why is "iPod" wrong? I would argue it's more correct than "apple", since there is almost no apple in the picture. Even "paper" would be more correct.

The picture should score quite high for "wooden table".

topspin

@boomzilla said in Computer vision system can be fooled by handwritten notes:

@Captain LOL:

Things that remind you of thread is .

dcon

@boomzilla said in Computer vision system can be fooled by handwritten notes:

@Captain LOL:

This should also be filed under "People Who Deserve Any Abuse We Can Heap On Them" (Fat shaming is )

Watson

@Gąska

From another OpenAI blog post. barrel with legs aren't given as options.

da Doctah

@Benjamin-Hall said in Computer vision system can be fooled by handwritten notes:

@topspin said in Computer vision system can be fooled by handwritten notes:

@boomzilla said in Computer vision system can be fooled by handwritten notes:

But there doesn't seem to be a way for it to say, "I read $this," vs "I see $this."

That was actually my first thought when I saw the OP.

My second. First one was this:

https://miro.medium.com/max/700/0*z-f3m8nbHiQ0uqxn.jpg

Those Belgians, amirite?

Shoreline

@kazitor said in Computer vision system can be fooled by handwritten notes:

Can’t wait to try this one out at an ATM

Easy now, Benjamin Pranklin.

Flips

Ah, a good preserved article from the previous century

Did someone mention the solution? It's readable on Distill.

This research is probably the source, of the verge imagery. Includes many more examples and how to counteract this type of attack.

Has many interesting topics. Mostly about the results of their network. Seems it organizes relational images and texts, into 'concept-neurons'. and how useful it's representation is.

ps includes pics and a partial playground for the kids, like me

[edit]
Pff, skipped over boomzilla his post, somehow. But this article is still worth mentioning it, because of it's value.

Tsaukpaetra

@Flips Goddamn your punctuation is causing a shitton of errors. Is this what "feeling triggered" is?

Gribnit

@Tsaukpaetra I see no problems here.

kazitor

@Shoreline said in Computer vision system can be fooled by handwritten notes:

@kazitor said in Computer vision system can be fooled by handwritten notes:

Can’t wait to try this one out at an ATM

Easy now, Benjamin Pranklin.

Is this some sort of David Unaipun?

topspin

@Tsaukpaetra said in Computer vision system can be fooled by handwritten notes:

@Flips Goddamn your punctuation is causing a shitton of errors. Is this what "feeling triggered" is?

He's flipping your bits in all the right ways.