When you download e-books, particularly in e-reader formats other than PDF’s (epub format for Nooks and other generic readers, mobipocket for Kindles, etc.), they can be especially “dirty” to read. And by “dirty,” I mean they have a bunch of stray and bizarre characters randomly dispersed throughout the text.
Those characters are the result of the scan and OCR (optical character recognition) process.
See, a lot of these old books are scanned using regular, albeit pricey, scanners to get them into digital format. Once scanned, they are recorded digitally as pictures, not unlike a photograph you would take with a digital camera or your smart phone.
To leave the book as is at this point—pictures—would make the book huge in file size. I’m talking about many megabytes in size which would eat up the storage capacity on your e-reader rather quickly.
And, if left like that, it would be up to your e-reader to display those pages, those pictures, correctly.
E-readers are great for text. Pictures? Nope. They’re terrible. Even tablets have a hard time with them.
And because they’re pictures, you don’t have access to a lot of the goodies that come with e-readers, like word lookups, word and sentence highlighting, and bookmarks, to name a few.
So, whoever is scanning these old books, they typically run the “pictures” through an OCR software program. That’s simply a computer application that looks at a picture of text and translates what it sees into text format, the end result being a “document” just like you would have if you create a document in Microsoft Word, Notepad, etc.
And that’s where the stray and bizarre characters come from.
The OCR software reads every mark on a page and tries to match it to its dictionary of letters. So, if the owner of a book wrote on a page, the OCR software is going to try to convert that handwriting.
OCR software is great if the source image, the scanned page of text, is clean and unblemished. If the page has any sort of markings on it, even the faded age of the paper, the result is going to be an interesting mix of weird characters.
All this is to say that if you download an e-book and you find it hard to read because sentences are misplaced, characters are out of place, the spelling is atrocious, then you have yourself a perfect copy of a book that was very, very dirty when it was OCR’d (yup, I needed a verb there).
The best you can do is go to the site where you downloaded it, if it was free, and try searching for a different version of the book that someone (or some other company) scanned and converted.
Many times people will download these dirty versions, clean them up in an e-book editor, and re-upload them to sites. I do that often when I come across a book that’s really good but the only copy I can get is a less than perfect one.
If you’ve bought the book from a reseller, then you’re stuck with it. And if you are buying these books, definitely make it a practice to read the reviews before you buy a book. The reviewers don’t seem to hold back on comments about the format of these books.
I don’t like to slam people or companies, but I will recommend you stay away from books, particularly found on the Internet Archive or the Gutenberg Project websites, that have been scanned by Google. I have found, sadly more often than not, that books scanned by Google are the dirtiest ones out there.
On the Internet Archive site, books scanned by them always have an actual human that oversees the OCR process. I don’t know how much interaction these people have with the OCR process, but their books are always cleaner and a joy to read.
So, bottom line, read the reviews before buying a book. And if you’re downloading a free book, keep in mind that there may be a cleaner copy out there somewhere.
Pingback: Rebuild Your World | The I Am