In any collection, physical objects are related by order. The ordering may be random or according to some characteristic called a key. Such characteristics may be intrinsic properties of the objects (e.g., size, weight, shape, or colour), or they may be assigned from some agreed-upon set, such as object class or date of purchase. The values of the key are arranged in a sorting sequence that is dependent on the type of key involved: alphanumeric key values are usually sorted in alphabetic sequence, while other types may be sorted on the basis of similarity in class, such as books on a particular subject or flora of the same genus.

In most cases, order is imposed on a set of information objects for two reasons: to create their inventory and to facilitate locating specific objects in the set. There also exist other, secondary objectives for selecting a particular ordering, as, for example, conservation of space or economy of effort in fetching objects. Unless the objects in a collection are replicated, any ordering scheme is one-dimensional and unable to meet all the functions of ordering with equal effectiveness. The main approach for overcoming some of the limitations of one-dimensional ordering of recorded information relies on extended description of its content and, for analog-form information, of some features of the physical items. This approach employs various tools of content analysis that subsequently facilitate accessing and searching recorded information.

Description and content analysis of analog-form records

The collections of libraries and archives, the primary repositories of analog-form information, constitute one-dimensional ordering of physical materials in print (documents), in image form (maps and photographs), or in audio-video format (recordings and videotapes). To break away from the confines of one-dimensional ordering, librarianship has developed an extensive set of attributes in terms of which it describes each item in the collection. The rules for assigning these attributes are called cataloging rules. Descriptive cataloging is the extraction of bibliographic elements (author names, title, publisher, date of publication, etc.) from each item; the assignment of subject categories or headings to such items is termed subject cataloging.

Conceptually, the library catalog is a table or matrix in which each row describes a discrete physical item and each column provides values of the assigned key. When such a catalog is represented digitally in a computer, any attribute can serve as the ordering key. By sorting the catalog on different keys, it is possible to produce a variety of indexes as well as subject bibliographies. More important, any of the attributes of a computerized catalog becomes a search key (access point) to the collection, surpassing the utility of the traditional card catalog.

The most useful access key to analog-form items is subject. The extensive lists of subject headings of library classification schemes provide, however, only a gross access tool to the content of the items. A technique called indexing provides a refinement over library subject headings. It consists of extracting from the item or assigning to it subject and other “descriptors”—words or phrases denoting significant concepts (topics, names) that occur in or characterize the content of the record. Indexing frequently accompanies abstracting, a technique for condensing the full text of a document into a short summary that contains its main ideas (but invariably incurs an information loss and often introduces a bias). Computer-printed, indexed abstracting journals provide a means of keeping users informed of primary information sources.

Description and content analysis of digital-form information

The description of an electronic document generally follows the principles of bibliographic cataloging if the document is part of a database that is expected to be accessed directly and individually. When the database is an element of a universe of globally distributed database servers that are searchable in parallel, the matter of document naming is considerably more challenging, because several complexities are introduced. The document description must include the name of the database server—i.e., its physical location. Because database servers may delete particular documents, the description must also contain a pointer to the document’s logical address (the generating organization). In contrast to their usefulness in the descriptive cataloging of analog documents, physical attributes such as format and size are highly variable in the milieu of electronic documents and therefore are meaningless in a universal document-naming scheme. On the other hand, the data type of the document (text, sound, etc.) is critical to its transmission and use. Perhaps the most challenging design is the “living document”—a constantly changing pastiche consisting of sections electronically copied from different documents, interspersed with original narrative or graphics or voice comments contributed by persons in distant locations, whose different versions reside on different servers. Efforts are under way to standardize the naming of documents in the universe of electronic networks.

Machine indexing

The subject analysis of electronic text is accomplished by means of machine indexing, using one of two approaches: the assignment of subject descriptors from an unlimited vocabulary (free indexing) or their assignment from a list of authorized descriptors (controlled indexing). A collection of authorized descriptors is called an authority list or, if it also displays various relationships among descriptors such as hierarchy or synonymy, a thesaurus. The result of the indexing process is a computer file known as an inverted index, which is an alphabetic listing of descriptors and the addresses of their occurrence in the document body.

Full-text indexing, the use of every character string (word of a natural language) in the text as an index term, is an extreme case of free-text indexing: each word in the document (except function words such as articles and prepositions) becomes an access point to it. Used earlier for the generation of concordances in literary analysis and other computer applications in the humanities, full-text indexing placed great demands on computer storage because the resulting index is at least as large as the body of the text. With decreasing cost of mass storage, automatic full-text indexing capability has been incorporated routinely into state-of-the-art information-management software.

Text indexing may be supplemented by other syntactic techniques so as to increase its precision or robustness. One such method, the Standard Generalized Markup Language (SGML), takes advantage of standard text markers used by editors to pinpoint the location and other characteristics of document elements (paragraphs and tables, for example). In indexing spatial data such as maps and astronomical images, the textual index specifies the search areas, each of which is further described by a set of coordinates defining a rectangle or irregular polygon. These digital spatial document attributes are then used to retrieve and display a specific point or a selected region of the document. There are other specialized techniques that may be employed to augment the indexing of specific document types, such as encyclopaedias, electronic mail, catalogs, bulletin boards, tables, and maps.

Semantic content analysis

The analysis of digitally recorded natural-language information from the semantic viewpoint is a matter of considerable complexity, and it lies at the foundation of such incipient applications as automatic question answering from a database or retrieval by means of unrestricted natural-language queries. The general approach has been that of computational linguistics: to derive representations of the syntactic and semantic relations between the linguistic elements of sentences and larger parts of the document. Syntactic relations are described by parsing (decomposing) the grammar of sentences (Figure 3). For semantic representation, three related formalisms dominate. In a so-called semantic network, conceptual entities such as objects, actions, or events are represented as a graph of linked nodes (Figure 4). “Frames” represent, in a similar graph network, physical or abstract attributes of objects and in a sense define the objects. In “scripts,” events and actions rather than objects are defined in terms of their attributes.

Indexing and linguistic analyses of text generate a relatively gross measure of the semantic relationship, or subject similarity, of documents in a given collection. Subject similarity is, however, a pragmatic phenomenon that varies with the observer and the circumstances of an observation (purpose, time, and so forth). A technique experimented with briefly in the mid-1960s, which assigned to each document one or more “roles” (functions) and one or more “links” (pointers to other documents having the same or a similar role), showed potential for a pragmatic measure of similarity; its use, however, was too unwieldy for the computing environment of the day. Some 20 years later, a similar technique became popular under the name “hypertext.” In this technique, documents that a person or a group of persons consider related (by concept, sequence, hierarchy, experience, motive, or other characteristics) are connected via “hyperlinks,” mimicking the way humans associate ideas. Objects so linked need not be only text; speech and music, graphics and images, and animation and video can all be interlinked into a “hypermedia” database. The objects are stored with their hyperlinks, and a user can easily navigate the network of associations by clicking with a mouse on a series of entries on a computer screen. Another technique that elicits semantic relationships from a body of text is SGML.

Britannica Chatbot logo

Britannica Chatbot

Chatbot answers are created from Britannica articles using AI. This is a beta feature. AI answers may contain errors. Please verify important information using Britannica articles. About Britannica AI.

Image analysis

The content analysis of images is accomplished by two primary methods: image processing and pattern recognition. Image processing is a set of computational techniques for analyzing, enhancing, compressing, and reconstructing images. Pattern recognition is an information-reduction process: the assignment of visual or logical patterns to classes based on the features of these patterns and their relationships. The stages in pattern recognition involve measurement of the object to identify distinguishing attributes, extraction of features for the defining attributes, and assignment of the object to a class based on these features. Both image processing and pattern recognition have extensive applications in various areas, including astronomy, medicine, industrial robotics, and remote sensing by satellites.

Speech analysis

The immediate objective of content analysis of digital speech is the conversion of discrete sound elements into their alphanumeric equivalents. Once so represented, speech can be subjected to the same techniques of content analysis as natural-language text—i.e., indexing and linguistic analysis. Converting speech elements into their alphanumeric counterparts is an intriguing problem because the “shape” of speech sounds embodies a wide range of many acoustic characteristics and because the linguistic elements of speech are not clearly distinguishable from one another. The technique used in speech processing is to classify the spectral representations of sound and to match the resulting digital spectrographs against prestored “templates” so as to identify the alphanumeric equivalent of the sound. (The obverse of this technique, the digital-to-analog conversion of such templates into sound, is a relatively straightforward approach to generating synthetic speech.)

Speech processing is complex as well as expensive in terms of storage capacity and computational requirements. State-of-the-art speech recognition systems can identify limited vocabularies and parts of distinctly spoken speech and can be programmed to recognize tonal idiosyncracies of individual speakers. When more robust and reliable techniques become available and the process is made computationally tractable (as is expected with parallel computers), humans will be able to interact with computers via spoken commands and queries on a routine basis. In many situations this may make the keyboard obsolete as a data-entry device.

Storage structures for digital-form information

Digital information is stored in complex patterns that make it feasible to address and operate on even the smallest element of symbolic expression, as well as on larger strings such as words or sentences and on images and sound.

From the viewpoint of digital information storage, it is useful to distinguish between “structured” data, such as inventories of objects that can be represented by short symbol strings and numbers, and “unstructured” data, such as the natural-language text of documents or pictorial images. The principal objective of all storage structures is to facilitate the processing of data elements on the basis of their relationships; the structures thus vary with the type of relationship they represent. The choice of a particular storage structure is governed by the relevance of the relationships it allows to be represented to the information-processing requirements of the task or system at hand.

In information systems whose store consists of unstructured databases of natural-language records, the objective is to retrieve records (or portions thereof) on the basis of the presence in the records of words or short phrases that constitute the query. Since there exists an index as a separate file that provides information about the locations of words and phrases in the database records, the relationships that are of interest (e.g., word adjacency) can be calculated from the index. Consequently, the database text itself can be stored as a simple ordered sequential file of records. The majority of the computations use the index, and they access the text file only to pull out the records or those portions that satisfy the result of the computations. The sequential file structure remains popular, with document-retrieval software intended for use with personal computers and CD-ROM databases.

When relationships between data elements need to be represented as part of the records so as to make more efficient the desired operations on these records, two types of “chained” structures are commonly used: hierarchical and network. In the hierarchical file structure, records are arranged in a scheme resembling a family tree, with records related to one another from top to bottom. In the network file structure, records are arranged in groupings known as sets; these can be connected in any number of ways, giving rise to considerable flexibility. In both hierarchical and network structures, the relationships are shown by means of “pointers” (i.e., identifiers such as addresses or keys) that become part of the records.

Another type of database storage structure, the relational structure, has become increasingly popular since the late 1970s. Its major advantage over the hierarchical and network structures is the ability to handle unanticipated data relationships without pointers. Relational storage structures are two-dimensional tables consisting of rows and columns, much like the conceptual library catalog mentioned above. The elegance of the relational model lies in its conceptual simplicity, the availability of theoretical underpinnings (relational algebra), and the ability of its associated software to handle data relationships without the use of pointers. The relational model was initially used for databases containing highly structured information. In the 1990s it largely replaced the hierarchical and network models, and it also became the model of choice for large-scale information-management applications, both textual and multimedia.

The feasibility of storing large volumes of full text on an economical medium (the digital optical disc) has renewed interest in the study of storage structures that permit more powerful retrieval and processing techniques to operate on cognitive entities other than words, to facilitate more extensive semantic content and context analysis, and to organize text conceptually into logical units rather than those dictated by printing conventions.

Query languages

The uses of databases are manifold. They provide a means of retrieving records or parts of records and performing various calculations before displaying the results. The interface by which such manipulations are specified is called the query language. Whereas early query languages were originally so complex that interacting with electronic databases could be done only by specially trained individuals, recent interfaces are more user-friendly, allowing casual users to access database information.

The main types of popular query modes are the menu, the “fill-in-the-blank” technique, and the structured query. Particularly suited for novices, the menu requires a person to choose from several alternatives displayed on the video terminal screen. The fill-in-the-blank technique is one in which the user is prompted to enter key words as search statements. The structured query approach is effective with relational databases. It has a formal, powerful syntax that is in fact a programming language, and it is able to accommodate logical operators. One implementation of this approach, the Structured Query Language (SQL), has the form

select [field Fa, Fb, . . ., Fn]

from [database Da, Db, . . ., Dn]

where [field Fa = abc] and [field Fb = def].

Structured query languages support database searching and other operations by using commands such as “find,” “delete,” “print,” “sum,” and so forth. The sentencelike structure of an SQL query resembles natural language except that its syntax is limited and fixed. Instead of using an SQL statement, it is possible to represent queries in tabular form. The technique, referred to as query-by-example (or QBE), displays an empty tabular form and expects the searcher to enter the search specifications into appropriate columns. The program then constructs an SQL-type query from the table and executes it.

The most flexible query language is of course natural language. The use of natural-language sentences in a constrained form to search databases is allowed by some commercial database management software. These programs parse the syntax of the query; recognize its action words and their synonyms; identify the names of files, records, and fields; and perform the logical operations required. Experimental systems that accept such natural-language queries in spoken voice have been developed; however, the ability to employ unrestricted natural language to query unstructured information will require further advances in machine understanding of natural language, particularly in techniques of representing the semantic and pragmatic context of ideas. The prospect of an intelligent conversation between humans and a large store of digitally encoded knowledge is not imminent.