OCR for badly damaged images
-
Currently we have an employee manually enter data from the MDRs (material data reports) that we get (mostly physical paper sheets attached to the material). We only capture a few fields and it only takes a few hours for several weeks worth of material. However, a growing number of customers are wanting the entire MDR and that would seriously impact this employees time.
We have contacted all of our vendors and only about 30% are willing to/already have digital copies that we can access. We can probably push our weight around and get another 20%-30% of them to offer something that we can use digitally. But the remaining ones (which of course are also the worse ones to process) we have no real chance of getting digitally.
So we have been looking at alternative solutions for getting this information into our system. Does anyone here have familiarity with OCR systems that can handle badly damaged/faded paper sheets? Or other suggestions on how to gather this data?
We don't have a firm timetable for this so we have time to develop something proper.
edit:
I should note for people not familiar with MDRs, they vary wildly (in look, the data is the same) between vendors as well.
-
@Dragoon And there isn't a broad standard, like the MDS for chemical safety? Those ones have pretty fixed content, but the format varies. And you can usually just look them up from the big vendors and download PDFs.
-
Yeah, an MDR is a little different than an MDS. An MDR covers the composition and processing of that exact batch (heat and lot) of material. So every single heat & lot has their own sheet.
I am not an expert in MDRs, but as I understand it there is only a loose standard on the information that needs to be on the sheets and there is no standard for format.
-
@Dragoon said in OCR for badly damaged images:
Yeah, an MDR is a little different than an MDS. An MDR covers the composition and processing of that exact batch (heat and lot) of material. So every single heat & lot has their own sheet.
I am not an expert in MDRs, but as I understand it there is only a loose standard on the information that needs to be on the sheets and there is no standard for format.
Well, that sucks. Sorry I can't be of more help. I know Google Books has done lots of work with OCR for historical books, many of which had...interesting...variance in quality. Maybe they've published something about that?
-
@Benjamin-Hall said in OCR for badly damaged images:
I know Google Books has done lots of work with OCR for historical books, many of which had...interesting...variance in quality. Maybe they've published something about that?
As far as I've heard they've been using (probably just a few among many other things) Tesseract OCR and OCRopus for their stuff.
And no idea how fit this is for this purpose.
-
@Dragoon said in OCR for badly damaged images:
Does anyone here have familiarity with OCR systems that can handle badly damaged/faded paper sheets?
A bunch of years ago I was on a project that was doing that (among other thing; the project was digitising some huge libraries and the quality control requirements were stringent) and… well… OCR is a PITA with a fairly high error rate.
-
Yeah, I have looked a little bit into Tesseract (haven't looked at OCRopus) and it might work with enough training data (and we do have decades of MDRs in storage) but it would probably be me training it and I don't have the time for that.
edit:
Should add in, that this was discussed at a CEO level and my time is worth more than a data entry employee. So if it comes down to it, they would rather higher someone at ~minimum wage to do data entry all day than pay me to get OCR working and keep it working.
-
Yeah, I did a some OCR work with the USGS over a decade ago when I was in college, I was really hoping that the tech had gotten a lot better over that time (and it has in certain respects). But that was my experience back than as well. It was a PITA with largely well behaved data.
-
@Dragoon said in OCR for badly damaged images:
Currently we have an employee manually enter data from the MDRs (material data reports) that we get (mostly physical paper sheets attached to the material).
So we have been looking at alternative solutions for getting this information into our system.
Wait, wasn't that a frontpage article once?
-
There might have been one, I don't recall, but if there was it wasn't from me.
-
@Dragoon the ending was that before completing the replacement, they realized the person manually entering data would be made redundant and lose the job, so they decided to lock up the program and never talk about it again.
-
@Gąska That story is called "The Indexer", but it was rather about an abuse of OCR to get directory listings. You are right about the ending though.
-
@Dragoon there are probably easier ways than trying to adapt the algorithms in the research papers I'll link here, but they're interesting:
-
My company's done some work with digitisation providers, and we were asked to add OCR capabilities to our product. When we were investigating we didn't find anything better than Tesseract. But with any handwritten or faded documents you're likely to need significant manual correction, at least.
-
@bobjanova said in OCR for badly damaged images:
But with any handwritten or faded documents you're likely to need significant manual correction,
What I understand from our customers is the manual checking and correcting is still significantly faster than entering everything by hand since most of the checking and additional data entry can be done based on the digital version of the document.
-
@Gąska Just found this at the front page of hacker news:
-
Thanks, I will take a look.
-
@bobjanova said in OCR for badly damaged images:
My company's done some work with digitisation providers, and we were asked to add OCR capabilities to our product. When we were investigating we didn't find anything better than Tesseract. But with any handwritten or faded documents you're likely to need significant manual correction, at least.
I tried using Tesseract several years ago on scanned handwritten documents and it was in no way reliable or useful. I think you might have to train it with a handwriting font or something of that nature to get useful results.
-
@sockpuppet7 said in OCR for badly damaged images:
@Gąska Just found this at the front page of hacker news:
error_bot feature requests is
-
@sockpuppet7 said in OCR for badly damaged images:
@Gąska Just found this at the front page of hacker news:
An OCR solution that doesn't cost $X0,000 and isn't GPL/AGPL? What is this world coming to?
-
Amazon turk?
-
@slapout1
That's racist!
-
@Luhmann Against Amazons or Turks?
-
@hungrier
Yes!