« 10 treasures you can get to with Start > Run... | Main | My free knowledge management solution »

December 20, 2006

Google Book Search, meet Amazon's Mechanical Turk

According to this Washington Post story about Google's ambitious book-scanning initiative, Google is scanning 3,000 books a day, which works out to about a million books annually. This rate is no doubt phenomenal but brings its own set of problems along.

Scanning a book is easy, but optical character recognition is notoriously hard to get right. Google is said to be currently scanning only those books whose copyright term has expired, which would mean books published before the 1920s. Any books published more than a 100 years before that decade have peculiar artifacts like the integral-sign S or f-like S, illustrations signed with artistic scrawls and manuscripts featuring cursive handwriting. No matter how breathlessly the media waxes about Google's 5,000 Ph.D.'s, these are all hard problems in OCR today.

Consider the scale of Google's operation: let's say each book scanned this year had an average of 200 pages. That would make 200 million pages. Even an OCR program with five-nines (99.999%) accuracy would spit out 200,000 pages with mistakes. One way to achieve dramatically higher accuracy is to ask the millions of avid readers on the Web to double check the accuracy of Google's OCR program, kind of like the distributed outsourcing in Amazon's Mechanical Turk project. It's really not as bad as it sounds; it's merely about offering the right incentives. Thankfully, Google doesn't have to look far for ideas.

Google already does quite a bit of revenue sharing with its AdSense partners, in exchange for the chance to display ads based off content on third-party websites. Google's plan to monetize Google Book Search is to show ads beside the content of the books anyway. So how about taking the money gained from ads shown beside, say Alice In Wonderland, and share some of it with whoever helped check the accuracy of Google's OCR program on that book? The revenue sharing infrastructure is already in place and it'll really be just a question of building the right user interface to divvy up the books among people.

I don't have access to Google's vast set of query strings (and neither does the general public), but it doesn't seem like the revenue distribution would be that iniquitable across books if structured the right way. There could be a cap of say $100 per book to control for books that mention "sex" a lot. Other than that, let human judgement take its course and watch the dollars roll in!

Posted by Vishy at December 20, 2006 10:51 PM

Comments