The PARC Book Scanner illustrates the wide range of
early-stage image processing tools needed to support high quality image capture.
Note the importance of image calibration and restoration specialized to the
scanner. Image processing should, ideally, occur quickly enough for the
operator to check each page image visually for consistent quality. Tools are
needed for orienting the page so text is right side-up, desk wing the page,
removing some of the pepper noise, and removing dark artifacts on or near the
image edges. Software support for clerical functions such as page numbering and
ordering, and the collection of metadata, are also crucial to maintaining high
throughput.
In addition to these, it would be helpful to be able to
check each page image for completeness and consistency. Has any text been
unintentionally cropped? Are basic measures of image consistency — e.g.
brightness, contrast, intensity histograms — stable from page to page, hour
after hour? Are image properties consistent across the full page area for each
image? Are the page numbers— located and read by OCR on the fly — in an
unbroken ascending sequences, and do they correspond to the automatically
generated metadata? Techniques likely to assist in these ways may require
imaging models that are tuned to shapes or statistical properties of printed
characters. Perhaps it will someday be possible to assess both human and
machine legibility on the fly.
Restoration
The principal purposes of document image restoration are to
assist:
Fast & painless reading
OCR for textual content
DIA for improved human reading (e.g. format preservation)
Characterization of the document (age, source, etc)
To these ends, methods have been developed for contrast and
sharpness enhancement, rectification (including skew and shear correction),
superresolution, and shape reconstruction.
Rectification
The DIA community has developed many algorithms for
accurately correcting skew, shear, and other geometric deformations in document
images. It is interesting how inconsistently these have been applied to document
images provided by DL's. Although, uncorrected they are easily detectable by
eye and cause some users to complain, they do not affect legibility and reading
comfort except in extreme cases (for example more than 3 degrees of skew).
However, not all DIA toolkits that may later be run on these images will
perform equally well, so it could be a significant contribution to rectify all
document images before posting them on DL's. It is also possible — although it
is seldom discussed in the DIA literature — to “recenter” text blocks
automatically within a standard page area in a consistent manner. Again, it is
not clear that this, although a clear improvement in aesthetics, matters much
to either human or machine reading.
Analysis of Content
The analysis and recognition of the content of document
images requires, of course, the full range of DIA R&D achievements: page
layout analysis, text/non-text separation, printed/handwritten separation, text
recognition, labeling of text blocks by function, automatic indexing and
linking, table and graphics recognition, etc. Most of the DIA literature is
devoted to these topics so I will not attempt a thorough survey in this short
space.
However, it should be noted that images found in DL's, since
they represent many nations, cultures, and historical periods, tend to pose
particularly severe challenges to today’s DIA methods, which are not robust in
the face of multilingual text and non-Western scripts, obsolete typefaces,
old-fashioned page layouts, and low or variable image quality.
Accurate Transcriptions of Text
The central classical task of DIA research has been, for
decades, to extract a full and perfect transcription of the textual content of
document images. Although perfect transcriptions have been known to result, no
existing OCR technology, whether experimental or commercially available, can
guarantee high accuracy across the full range of document images of interest to
users. Even worse, it is rarely possible to predict how badly an OCR system
will fail on a given document.
Determining Reading Order of Sections
Determining the reading order among blocks of text is, of
course, a DIA capability critically important for DL's; it would allow more
fully automatic navigation through images of text. This, however, remains an
open problem in general, in that a significant residue of cases cannot be ambiguities
through physical layout analysis alone, but seem to require linguistic or even
semantic analysis.
However, the number of ambiguous cases on one page is often
small and might be made manageable in practice by a judiciously designed
interactive GUI presenting ambiguities in a way that invites easy selection (or
even correction). Such capabilities exist in specialized high-throughput
scan-and conversion service bureaus, but are not now available to the users of
any DL, allowing them to correct reading order and so improve their own
navigation.