Image extraction in digital documents

Chee Sun Won

Research output: Contribution to journalArticlepeer-review

7 Scopus citations

Abstract

Images included in documents usually provide information that may not be readily expressible by words. For example, academic articles with similar pictures may be of interest for researchers. We deal with the problem of extracting Images in digital document. Given a digital document, the optimal block size is first determined by finding the best fit of the horizontally projected graylevel pattern to a set of orthogonal basis vectors. Because the block with the optimal size is supposed to contain sufficient information to identify text regions, the proposed method is font-size independent regardless of the size of the words in the text lines. The blocks divided by the optimal block size are classified into one of image, text, and background blocks. This block classification result, in turn, is used for the initial configuration for blockwise document segmentation. The blockwise segmentation method is based on the maximum a posteriori (MAP) framework with a deterministic relaxation algorithm. After the blockwise segmentation, each boundary block in the image region is further divided into four subblocks and the class labels for these subblocks are updated. These subdivision and class updating processes are executed recursively until we have a pixellevel segmentation. Experimental results show that the proposed image extraction method yields 2.9% error rates for 232 documents in the Oulu database.

Original languageEnglish
Article number033016
JournalJournal of Electronic Imaging
Volume17
Issue number3
DOIs
StatePublished - 2008

Fingerprint

Dive into the research topics of 'Image extraction in digital documents'. Together they form a unique fingerprint.

Cite this