Cataloguing, indexing, and correcting the OCR of digitized documents, libraries have often externalized certain activities to service providers with recourse to a low-price workforce in developing countries like Madagascar, India, or Vietnam. From now on, though, they could instead call on the masses of Internet users, that is, crowdsourcing, to realize tasks their own staff cannot handle. The development of crowdsourcing in libraries is particularly important in the domain of OCR correction. In fact, character recognition software that converts photos of digitized book pages into texts do not provide 100% reliable results and, depending on the quality of the original document, its digitization, its typography, the possible presence of handwritten notes, it may be necessary to correct the texts produced with the help of dictionaries. OCR correction is necessary to enable more efficient whole text searches of the digitized texts, better referencing of the contents by search engines, the production of eBook in EPUB or MOBI formats so they can be read on eReaders, data extraction through text mining technologies, or even scientific exploitations related to culturomics. This question of recourse to crowdsourcing is being asked more and more today of libraries, from the very largest of them to the very smallest. In order to bring them part of the solution and bring about an original conceptual contribution to crowdsourcing in libraries, we have written this state of the art, which comes from thesis work. It will offer conceptual elements to understand this phenomenon, a taxonomy and panorama of the initiatives, and analyses from library and information science points of view.
Source : https://hal.archives-ouvertes.fr/hal-01436766