Glossary of Terms
Glossary of Terms
A software program from Adobe, Inc. who invented and popularized the PDF file, Acrobat is the program typically used to create PDF files (Portable Document Format). The Adobe PDF reader (viewing) software is readily available and free of charge.
ADF (Automatic Document Feeder)
A device attached to a scanner permitting the serial scanning of multiple documents. Modern high performance scanners have ADFs which automatically adjust for paper size and weight.
Performing one or more actions on multiple objects in a serial fashion. Batch processing can be manual or automated actions, i.e., batch processing of documents typically refer to the creation of a “batch” or group of documents during the scanning process. Batch processing of images typically refers to examination and/or manipulation of the image via software.
CD (Compact Disc)
A storage media which is optically read developed by Phillips in the 1970′s. CDs have a maximum storage capacity of 700 MB or 80 minutes of audio.
Computer Output to Laser Disk. A computer programming process that outputs electronic records and printed reports to laser disk instead of a printer. Can be used to replace COM (Computer Output to Microfilm) or printed reports.
The re-encoding of data to reduce file size. There are many compression algorithms which are generally separated into “lossy” and “lossless” compression (see definitions below). Because image files tend to be large most image file formats use some type of compression reducing disk space usage and transmission time over networks.
Transformation of data from one format to another i.e. paper to an electronic file, microfilm to an electronic file, changing the format of an electronic file, etc.
A generic term for raw or unprocessed information.
Straightening off-center (skewed) images. Important for human and automated reading, de-skewing can improve OCR accuracy. Skewing occurs during copying, scanning, and faxing. De-skewing corrections apply to current process or previous process, for example, you received a fax that was skewed, the current scanning process can de-skew the document.
A unit of data or information. A document is, typically, a physical (paper/hardcopy) object consisting of a single or multiple pages. Document nomenclature can be used for electronic, files (such as a Microsoft Word document) in most cases the electronic objects resemble or are intended to produce a hard copy.
Storing, managing, retrieving and distributing electronic images via a local computer or a network.
Typically retrieving a file from a remote computer, often used in the context of the Internet. Downloading can be local or via a network.
In scanning terminology refers to scanning both sides of a page on a single pass.
Dots Per Inch (Also See Resolution)A measurement of printer resolution although often misused as describing scanner or monitor resolution. Specifically, DPI refers to the number of dots per linear inch a printer is capable of producing. Up to a point the higher the DPI the better the resulting image.
DVD (Digital Versatile Disk or Digital Video Disk)
An optical storage medium similar to CDs but with higher level specifications for audio and video and increased storage capacity of up to 4.7 Gigabytes per layer. Available as single sided, single layer; double sided, single layer; double layer (8.5 GB), or 17 GB (double sided, double layer). Two new formats (HD-DVD and Blu-Ray) were recently released. These formats have roughly trebled the storage capacity but current cost of devices and blank DVDs has limited the data storage usage.
Virtually all DVD players and recorders also process CDs.
A paper document converted (scanned and OCRed) or originally created on a computer. Electronic document implies the ability to search for and/or manipulate document content versus an electronic image or document image which offers search capability either by file name and/or index or key field/words. There isn’t content per se since the content is an image or essentially a photograph.
Electronic Document Management
A software application or system performing some or all of the following functions; imaging, indexing, tracking, search and retrieval, and viewing electronic documents.
For paper or hard copy, pages or documents within some type of container. The container is typically also paper but of a thicker consistency than the documents. A collection of data on computer storage media that is uniquely identifiable and accessible.
Full-text Indexing and Search
Enabling retrieval of documents or portions of documents by any word or phrase content. Every word in the document is indexed along with the locations (pages) for each occurrence of the word.
Group 4 Compression
CCITT Fax Group 4 compression algorithm frequently used as a TIFF file option for black and white images. It is also used in Adobe Acrobat (PDF) files.
An image file format primarily used for web sites and monitor display. Uses LZW compression better for color and grayscale than black and white images. LZW is “lossless” which means it will not compress as well as JPEG, but will retain all of the image’s quality.
An image type defined by use of a range of gray shades to capture and image. More gray shades result in a better looking image but also dramatically increase file size. Typical ranges of grayscale’s run from 8 to 256.
A digitized picture of a hard copy document or portion of a document.
Image File Format
When a page is scanned, the page can be stored in a number of file types. The type should be chosen based on the desired use of the image, and the software that will be utilized. Different file formats commonly use different methods of compression as well, and some types of images compress better using some formats rather than others. Most common formats include; TIFF, GIF, JPEG, and PDF.
ICR (Intelligent Character Recognition)
A software process used to convert hand printing to text. Used primarily in structured forms this process works best when the form contains boxes for hand entry such as the hand printed entry for a name or address.
A form of data entry creating a linked database using alpha numeric input. A search of the indexed data will retrieve the relevant scanned document. The ultimate indexing is full text indexing where every word within a document is indexed.
Database fields used to categorize and organize documents. Often user-defined, these fields can be used for searches.
ISIS and TWAIN Scanner Drivers
Specialized applications used for communication between scanners and computers.
A “lossless” image compression format for binary (black and white) images. Compresses better than G4 by up to 25 percent. Also supports progressive encoding. Licensing issues have slowed its adoption for use.
A “lossy” image compression format for binary (black and white) images. A JBIG2 compressor identifies common objects (usually characters) in the image and creates a dictionary with references to those objects. Lossiness is induced by allowing similar objects to be represented by a single dictionary entry. This format is supported in PDF 1.4 and greater.
An image file format that is best suited for photographs. It supports “lossiness”, which means that it will throw away some detail in order to achieve better compression. It does not work well for text.
A data compression algorithm that allows reduction (loses) information during the compression process. Lost information has minimal effect on the quality of the image and/or can be recovered by interpolation from remaining data. Lossy algorithms tend to create smaller file sizes and depending upon the data (image, video, or audio) and the subsequent use or display of the data be minimally or undetectable.
A data compression algorithm that ensures no original data is lost during the compression process, often resulting in larger file sizes. Lossless compression becomes important when post processing of data could affect data quality. A prime example is photograph enlargement. If a portion of a photograph is used to create a new photograph that is then enlarged to be the same size as the original the effect of data loss may now become apparent to the naked eye.
Lempel-Ziv-Welch, a widely used compression algorithm.
OCR (Optical Character Recognition)
A software process that converts a all or a portion of a digital image into text. Accuracy has increased dramatically in the past few years but is still dependent upon factors such as quality of the original document, size of fonts, and the software application itself. Commercial grade OCR packages are dramatically more accurate and more expensive than those available for home or small business use. While differences in accuracy (from 94-95% to 98-99%) may seem small for a dramatic cost increase ($150 to thousands) the cost can be justified when one considers volume. 10,000 pages with an average of 400 words per page amounts to 4 million words or a potential reduction of 160,000 errors. The time savings proof reading is significant.
(Optical Mark Recognition)
Used primarily for structured forms this software process converts marks on a paper to information. An example is a form that is manually completed which includes check boxes the presence or absence of which is information.
PDF (Portable Document Format)
A file format developed by Adobe for used with their Acrobat product has become a standard for common file format. A PDF file can contain text, images, and graphics or a combination thereof. The Adobe Acrobat Reader is readily available free of charge there for providing an easy way for anyone to view an electronic file without regard for operating system or application software. The full Adobe Acrobat software suite creates PDF files and PDF forms (allowing on-line completion), security, and many other functions.
Pixel (Picture Element)
A single unit in an electronic image. This is not simply “dot” as in “dots per inch” a Pixel is a single unit or point in an image and data as to the color of that point. Color ranges from black and white to common PC displays offering up to 32 bits per pixel or over 16 million colors per Pixel.
When applied to images resolution is simply the amount of detail in an image. For print output and often scanned input resolution is defined as dots per inch (dpi). Scanned input is more properly defined as samples per inch (SPI) which is a measure of the number of points used when the scan is performed versus the data storage format of dpi. PC monitor or display resolution is commonly stated as Pixels per Inch of PPI. An absolute maxim is the higher the resolution the larger the file. Note: that file size to resolution is a geometric not linear progression.
A device that applies light to paper page sampling and digitally recording the presence or absence of marks. Depending upon the device sampling and recording can be black and white, grayscale or color. Modern high-performance scanners are capable of over 300 pages per minute. Most commercial grade scanners augment their inherent sampling and image processing capability through hardware and software on an attached computer.
A reproduction of an image reduced in size and quality to permit quick perusal of an or several images usually so that a specific image can be selected for closer inspection.
TIFF (Tag Image File Format)
An industry standard image file format. TIFF is a flexible and adaptable file format. It can handle multiple images and data in a single file through the inclusion of “tags” in the file header. Tags can indicate the basic geometry of the image, such as its size, or define how the image data is arranged and whether various image compression options are used.