This manual is for GNU Ocrad (version 0.18, 8 May 2009).
GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats. The pbm, pgm and ppm formats are collectively known as pnm.
Ocrad includes a layout analyser able to separate the columns or blocks of text normally found on printed pages.
Copyright © 2003, 2004, 2005, 2006, 2007, 2008, 2009 Antonio Diaz Diaz.
This manual is free documentation: you have unlimited permission to copy, distribute and modify it.
The character set internally used by ocrad is ISO 10646, also known as UCS (Universal Character Set), which can represent over two thousand million characters (2^31).
As it is unpractical to try to recognize one among so many different characters, you can tell ocrad what character sets to recognize. You do this with the `--charset' option.
If the input page contains characters from only one character set, say
`ISO-8859-15', you can use the default `byte' output
format. But in a page with `ISO-8859-9' and
`ISO-8859-15' characters, you can't tell if a code of 0xFD
represents a 'latin small letter i dotless' or a 'latin small letter y
with acute'. You should use `--format=utf8' instead.
Of course, you may request UTF-8 output in any case.
NOTE: 10^9 is a thousand millions, a billion is a million millions (million^2), a trillion is a million million millions (million^3), and so on. Please, don't "embrace and extend" the meaning of prefixes, making communication among all people difficult. Thanks.
The format for running ocrad is:
ocrad [options] [files]
Ocrad supports the following options:
There are a lot of image formats, but ocrad is able to decode only three of them; pbm, pgm and ppm. In this chapter you will find command examples and advice about how to convert image files to a format that ocrad can manage.
pngtopnm filename.png | ocrad.pngtopnm filename.png | ocrad -i.
gs -sPAPERSIZE=a4 -sDEVICE=pnmraw -r300 -dNOPAUSE -dBATCH -sOutputFile=- -q filename.ps | ocrad.pstopnm -stdout -dpi=300 -pgm filename.ps | ocrad,pstopnm don't recognize the `-dpi' option and produce an
image too small for OCR.
tifftopnm filename.tiff | ocrad.
djpeg -greyscale -pnm filename.jpg | ocrad.gzip -cd filename.pnm.gz | ocrad
lzip -cd filename.pnm.lz | ocrad
Ocrad is mainly a research project. Many of the algorithms ocrad uses are ad hoc, and will change in successive releases as I myself gain understanding about OCR issues.
The overall working of ocrad may be described as follows:
1) read the image.
2) optionally, perform some transformations (crop, rotate, scale, etc).
3) optionally, perform layout detection.
4) remove frames and images.
5) detect characters and group them in lines.
6) recognize characters (very ad hoc; one algorithm per character).
7) correct some errors (transform l.OOO into 1.000, etc).
8) output result.
Ocrad recognizes characters by its shape, and the reason it is so fast is that it does not compare the shape of every character against some sort of database of shapes and then chooses the best match. Instead of this, ocrad only compares the shape differences that are relevant to choose between two character categories, mostly like a binary search.
As there is no such thing as a free lunch, this approach has some drawbacks. It makes ocrad very sensitive to character defects, and makes difficult to modify ocrad to recognize new characters.
There are probably bugs in ocrad. There are certainly errors and omissions in this manual. If you report them, they will get fixed. If you don't, no one will ever know about them and they will remain unfixed for all eternity, if not longer.
If you find a bug in GNU Ocrad, please send electronic mail to bug-ocrad@gnu.org. Include the version number, which you can find by running `ocrad --version'.