Written (group) versus oral (individual) tests
- Also called:
- psychometrics
The oral test is administered to one person at a time, but written tests can be given simultaneously to a number of subjects. Oral tests of achievement, being uneconomical and prone to content and scorer unreliability, have been supplanted by written tests; notable exceptions include the testing of illiterates and the anachronistic oral examinations to which candidates for graduate degrees are liable.
Proponents of individually administered intelligence tests (e.g., the Stanford-Binet) state that such face-to-face testing optimizes rapport and motivation, even among literate adult subjects. Oral tests of general aptitude remain popular, though numerous written group tests have been designed for the same purpose.
The interview may provide a personality measurement and, especially when it is standardized as to wording and order of questions and with a key for coding answers, may amount to an individual oral test. Used in public opinion surveys, such interviews are carefully designed to avoid the effects of interviewer bias and to be comprehensible to a highly heterogeneous sample of respondents.
Appraisal by others versus self-appraisal
In responding to personality inventories and rating scales, a person presumably reveals what he thinks he is like; that is, he appraises himself. Other instruments may reflect what one person thinks of another. Because self-appraisal often lacks objectivity, appraisal by another individual is common in such things as ratings for promotions. Ordinary tests of ability clearly involve evaluation of one person by another, although the subject’s self-evaluation may intrude; for example, he may lack confidence to the point where he does not try to do his best.
Projective tests
The stimuli (e.g., inkblots) in a projective test are intentionally made ambiguous and open to different interpretations in the expectation that each subject will project his own unique (idiosyncratic) reactions in his answers. Techniques for evaluating such responses range from the intuitive impressions of the rater to complex, coded schemes for scoring and interpretation that require extensive manuals; some projective tests are objectively scorable.
Speed tests versus power tests
A pure speed test is homogeneous in content (e.g., a simple clerical checking test), the tasks being so easy that with unlimited time all but the most incompetent of subjects could deal with them successfully. The time allowed for testing is so short, however, that even the ablest subject is not expected to finish. A useful score is the number of correct answers made in a fixed time. In contrast, a power test (e.g., a general vocabulary test) contains items that vary in difficulty to the point that no subject is expected to get all items right even with unlimited time. In practice, a definite but ample time is set for power tests.
Speed tests are suitable for testing visual perception, numerical facility, and other abilities related to vocational success. Tests of psychomotor abilities (e.g., eye–hand coordination) often involve speed. Power tests tend to be more relevant to such purposes as the evaluation of academic achievement, for which the highest level of difficulty at which a person can succeed is of greater interest than his speed on easy tasks.
In general, tests reflect unknown combinations of the effects of speed and power; many consist of items that vary considerably in difficulty, and the time allowed is too limited to allow a large proportion of subjects to attempt all items.
Teacher-made versus standardized tests
A distinction between teacher-made tests and standardized tests is often made in relation to tests used to assess academic achievement. Ordinarily, teachers do not attempt to construct tests of general or special aptitude or of personality traits. Teacher-made tests tend instead to be geared to narrow segments of curricular content (e.g., a sixth-grade geography test). Standardized tests with carefully defined procedures for administration and scoring to ensure uniformity can achieve broader goals. General principles of test construction and such considerations as reliability and validity apply to both types of test.
Special measurement techniques
Sociodrama and psychodrama were originally developed as psychotherapeutic techniques. In sociodrama, group members participate in unrehearsed drama to illuminate a general problem. Psychodrama centres on one individual in the group whose unique personal problem provides the theme. Related research techniques (e.g., the sociometric test) can offer insight into interpersonal relationships. Individuals may be asked to specify members of a group whom they prefer as leader, playmate, or coworker. The choices made can then be charted in a sociogram, from which cliques or socially isolated individuals may be identified at a glance.
Research psychologists have grasped the sociometric approach as a means of measuring group cohesiveness and studying individual reactions to groups. The degree to which any group member chooses or is chosen beyond chance expectation may be calculated, and mathematical techniques may be used to determine the complex links among group members. Sociogram-choice scores have been useful in predicting such criteria as individual productivity in factory work and combat effectiveness.
Development of standardized tests
Test content
Item development
Once the need for a test has been established, a plan to define its content may be prepared. For achievement tests, the test plan may also indicate thinking skills to be evaluated. Detailed content headings can be immediately suggestive of test items. It is helpful if the plan specifies weights to be allotted to different topics, as well as the desired average score and the spread of item difficulties. Whether or not such an outline is made, the test constructor clearly must understand the purpose of the test, the universe of content to be sampled, and the forms of the items to be used.
Tryouts and item analysis
A set of test questions is first administered to a small group of people deemed to be representative of the population for which the final test is intended. The trial run is planned to provide a check on instructions for administering and taking the test and for intended time allowances, and it can also reveal ambiguities in the test content. After adjustments, surviving items are administered to a larger, ostensibly representative group. The resulting data permit computation of a difficulty index for each item (often taken as the percentage of the subjects who respond correctly) and of an item-test or item-subtest discrimination index (e.g., a coefficient of correlation specifying the relationship of each item with total test score or subtest score).
If it is feasible to do so, measures of the relation of each item to independent criteria (e.g., grades earned in school) are obtained to provide item validation. Items that are too easy or too difficult are discarded; those within a desired range of difficulty are identified. If internal consistency is sought, items that are found to be unrelated to either a total score or an appropriate subtest score are ruled out, and items that are related to available external criterion measures are identified. Those items that show the most efficiency in predicting an external criterion (highest validity) usually are preferred over those that contribute only to internal consistency (reliability).
Estimates of reliability for the entire set of items, as well as for those to be retained, commonly are calculated. If the reliability estimate is deemed to be too low, items may be added. Each alternative in multiple-choice items also may be examined statistically. Weak incorrect alternatives can be replaced, and those that are unduly attractive to higher scoring subjects may be modified.
Cross validation
Item-selection procedures are subject to chance errors in sampling test subjects, and statistical values obtained in pretesting are usually checked (cross validated) with one or more additional samples of subjects. Typically, it is found that cross-validation values tend to shrink for many of the items that emerged as best in the original data, and further items may be found to warrant discard. Measures of correlation between total test score and scores from other, better known tests are often sought by test users.
Differential weighting
Some test items may appear to deserve extra, positive weight; some answers in multiple-choice items, though keyed as wrong, seem better than others in that they attract people who earn high scores generally. The bulk of theoretical logic and empirical evidence, nonetheless, suggests that unit weights for selected items and zero weights for discarded items and dichotomous (right versus wrong) scoring for multiple-choice items serve almost as effectively as more complicated scoring. Painstaking efforts to weight items generally are not worth the trouble.
Negative weight for wrong answers is usually avoided as presenting undue complication. In multiple-choice items, the number of answers a subject knows, in contrast to the number he gets right (which will include some lucky guesses), can be estimated by formula. But such an average correction overpenalizes the unlucky and underpenalizes the lucky. If the instruction is not to guess, it is variously interpreted by persons of different temperament; those who decide to guess despite the ban are often helped by partial knowledge and tend to do better.
A responsible tactic is to try to reduce these differences by directing subjects to respond to every question, even if they must guess. Such instructions, however, are inappropriate for some competitive speed tests, since candidates who mark items very rapidly and with no attention to accuracy excel if speed is the only basis for scoring; that is, if wrong answers are not penalized.