E-mail link to Martin Tulic, Indexer

Valid HTML 4.01!

About indexing
Other indexers
Site map
Home > About indexing >

Automatic indexing

The popularity of Internet search engines has caused many people think of the process of entering queries to retrieve documents from the Web based as automatic indexing. It is not.

Automatic indexing is the process of assigning and arranging index terms for natural-language texts without human intervention. For several decades, there have been many attempts to create such processes, driven both by the intellectual challenge and by the desire to significantly reduce the time and cost of producing indexes. Dozens if not hundreds of computer programs have been written to identify the words in a text and their location, and to alphabetize the words. Typically, definite and indefinite articles, prepositions and other words on a so-called stop list are not included in the program's output. Even some word processors provide this capability. Nevertheless, computer-generated results are often more like concordances (lists of words in a document) than truly usable indexes. There are several reasons for this.

The primary reason computers cannot automatically generate usable indexes is that, in indexing, abstraction is more important than alphabetization. Abstractions result from intellectual processes based on judgments about what to include and what to exclude. Computers are good at algorithmic processes such as alphabetization, but not good at inexplicable processes such as abstraction. Another reason is that headings in an index do not depend solely on terms used in the document; they also depend on terminology employed by intended users of the index and on their familiarity with the document. For example: in medical indexing, separate entries may need to be provided for brand names of drugs, chemical names, popular names and names used in other countries, even when certain of the names are not mentioned in the text. A third reason is that indexes should not contain headings for topics for which there is no information in the document. A typical document includes many terms signifying topics about which it contains no information. Computer programs include those terms in their results because they lack the intelligence required to distinguish terms signifying topics about which information is presented from terms about which no information is presented. A fourth reason is that headings and subheadings should be tailored to the needs and viewpoints of anticipated users. Some are aimed at users who are very knowledgeable about topics addressed in the document; others at users with little knowledge. Some are reminders to those who read the document already; others are enticements to potential readers. To date, no one has found a way to provide computer programs with the judgment, expertise, intelligence or audience awareness that is needed to create usable indexes. Until they do, automatic indexing will remain a pipe dream.

Although automated indexing is a pipe dream, computers are nevertheless an essential tool used by (but not a replacement for) indexers.

See also   Software for indexing  

To top of page

Copyright © 2005 Martin Tulic. All rights reserved.