beamjockey ([info]beamjockey) wrote,
@ 2009-04-29 23:51:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
The Wrong Century for OCR
Optical Character Recognition doesn't work too well on seventeenth-century publications.

The entire scanned volume looks like a CAPTCHA.



(9 comments) - (Post a new comment)


[info]bibliofile
2009-04-30 05:52 am UTC (link)
So true. In Distributed Proofreaders, anything that's not set in standard metal-type type is automatically a type-in project. OCR is worse than nothing, for those projects.

(Reply to this) (Thread)


[info]beamjockey
2009-04-30 12:49 pm UTC (link)
Perhaps some correspondent here can tell us why Mr. Houghton's sheets look so wiggly. Something about the printing process, I'm sure. Was the pre-Industrial-Revolution type not standardized enough?

(Reply to this) (Parent)(Thread)


[info]thesaucernews
2009-04-30 03:49 pm UTC (link)
My guess would be that, in part, printed materials were prohibitively expensive enough that flaws were used as long as they were still legible.

It also looks to me like the text has been deformed by the paper. The substrate may have warped over the intervening centuries and taken the ink with it.

(Reply to this) (Parent)


[info]argonel
2009-04-30 02:04 pm UTC (link)
Hmm, my handheld does a pretty decent job at handwriting recognition, I wonder why it seems like no OCR programs have a handwriting mode to optimize their character recognition even if it reduces the accuracy on typed text.

Of course my handheld does get additional information about stroke direction that you don't get on a scanned document. I don't think this should be an insurmountable barrier though.

(Reply to this) (Parent)(Thread)


[info]kateyule
2009-04-30 03:29 pm UTC (link)
Stroke direction?

(Reply to this) (Parent)(Thread)


[info]anton_p_nym
2009-04-30 04:35 pm UTC (link)
Pen-based computers (aka "tablets") look at how the pen tip moves in order to determine what characters are being written; it looks at how the word was written, and not just the shape. OCR, to its disadvantage, loses that data and has to work out the word from the shape of the letters only.

-- Steve thinks it'd be really tough to get good handwriting recognition out of OCR, given that there are so many different styles of handwriting possible. (And that there are enough potential idiosynchrocies that handwriting style can often be traced back to the individual writer.)

(Reply to this) (Parent)


[info]bibliofile
2009-04-30 04:54 pm UTC (link)
Sometimes the printing was kinda messy, too.

(Reply to this) (Parent)


[info]nellherself
2009-05-15 08:09 pm UTC (link)
Actually there is a product called AddressScript by Parascript used in postal automation which can and does recognize machine print, handprint and cursive. And it does it just from images of addressed material in fractions of a second. No added information about stroke direction or anything else.

Its pretty impressive in operation, obviously it performs better with machine print (you can process 250 million addresses a day) but its speed and accuracy with handprint and cursive is very respectable. It can also handle oddball or distorted print fonts. It is also Very Expensive.

Addresses, of course, are a limited use case for OCR. They have all kinds of limits that trim down the recognition problem to manageable proportions. Still it would be interesting to see Parascript technology take on, oh say a 17th century publication.

Paper color can turn out to be more troublesome than bad handwriting or weird irregular fonts. You don't know real grief until you've tried to use an OCR to resolve text printed on security paper.




(Reply to this) (Parent)


[info]don_fitch
2009-05-14 04:26 am UTC (link)
Early books were generally printed on dampened paper, to help prevent the ink from spreading & blurring, or soaking through to the other side. The large sheets (each with eight pages of text, for quarto volumes [I _think_ I've got that right]) were then hung up for both the moisture and the ink to dry, during which process they tended to shrink unevenly. How unevenly, and in which directions, seems to have depended upon the grain, the cheapness of the paper, and the care taken by the printer. And yes, OCR simply does not (and probably never will) cope with this.

(Reply to this)


(9 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…