#descr: Let's pool our computing ressources so as to make the software patent documents of the European Patent Office (EPO) digitally accessible. So far the information is available only in the form of single-page graphic documents. Fortunately there is a free OCR program that can produce usable text output from these graphic files. We have already acquired most of the relevant graphical data. You can help us by locally running an OCR script on these CDs. The OCR output will then appear on our website as well as a ring of cooperating websites, depending on your configuration. #title: Patent OCR Net Action #Wai: We have a set of currently 15 CDs with EPO software patent graphic files. Turning all the graphics into text may take a week per CD on a 586 computer. #Inr: In the future we will have more and more CDs (up to 800) and we may wish to run the recognition software on them over and over again, as we adapt the software and/or its input data. #ToC: Therefore we need participation of some people who take charge of a few CDs each and run the script on them. #Mde: Moreover, we may also need people with good network bandwith to help us download more graphic files for this and other projects. #IjW: In the long run, this volunteer network may be able to solve many worthwhile problems, wherever the upgrading of massive amounts of data is concerned, which, for whatever reason, were presented to the public in a crippled or otherwise unsatisfactory form. #Iur: In order to participate, you need the following #ALs: A Posix (Unix, Linux) operating system #wdl: With the Bash shell interpreter and shell environment. CygWin (Unix-like subsystem on Microsoft Windows supplied by Cygnus/Redhat) should also work. #Hon: Here you can already find many text versions of EPO software patents, some of which contain a section with OCR output from this action. But we are not providing the PDF graphics online, because we can't afford the fees which we have to pay when other people download our data. #Tmm: This collection contains some PDF files ready to be compared with the GOCR output. #Tnr: This directory contains some more scripts related to this action. #mtn: maintain the CDs and the list of participants, send the CDs out #apg: adapt gocr #msW: make it take b/w inverted graphics as input so as to eliminate the time-consuming pnminvert procedure #ell: extend GOCR so that it can use existing similar texts (e.g. corresponding patent applications, for which text files already exist) to improve its recognition efficiency. #wen: write configuration files that specify the structure of patent descriptions so that GOCR handles them better, e.g. recognises layout and text structures better. As needed, develop such configuration formats and/or improve gocr so as to use them and/or make them unnecessary in more and more cases. #wws: write a frontend that lets gocr directly interact with PDF files and insert OCR results of PDF images into the files as specified by Adobe's PDF format. #cpB: create CDs with non-EP patents, e.g. DE, FR, GB, SE, JP, US etc on them #San: Servers such as DepatisNet and others need to be studied. #Pan: Perhaps it is also possible to obtain some of this stuff for a reasonable price from the patent offices. #AWW: A few years ago the pricing was prohibitive: thousands of EUR/USD for the patents of one year. But this policy has probably changed. With enough perseverance, it should be possible to get everything very cheaply. #wrW: write an MSWin version of the bash script if possible #scm: some people have win-specific ocr programs that may be worth trying out for comparison. # Local Variables: ; # coding: utf-8 ; # srcfile: /ul/prg/src/mlht/app/swpat/swpatgirzu.el ; # mailto: mlhtimport@a2e.de ; # login: swpatgirzu ; # passwd: YYYYY ; # feature: swpatdir ; # dok: swpatfatri ; # txtlang: en ; # End: ;