Computer-adaptive test - AbsoluteAstronomy.com

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing.

How CAT works

CAT successively selects questions so as to maximize the precision of the exam based on what is known about the examinee from previous questions. From the examinee's perspective, the difficulty of the exam seems to tailor itself to his or her level of ability. For example, if an examinee performs well on an item of intermediate difficulty, he will then be presented with a more difficult question. Or, if he performed poorly, he would be presented with a simpler question. Compared to static multiple choice

Multiple choice

Multiple choice is a form of assessment in which respondents are asked to select the best possible answer out of the choices from a list. The multiple choice format is most frequently used in educational testing, in market research, and in elections-- when a person chooses between multiple...

tests that nearly everyone has experienced, with a fixed set of items administered to all examinees, computer-adaptive tests require fewer test items to arrive at equally accurate scores. (Of course, there is nothing about the CAT methodology that requires the items to be multiple-choice; but just as most exams are multiple-choice, most CAT exams also use this format.)

The basic computer-adaptive testing method is an iterative

Iteration

Iteration means the act of repeating a process usually with the aim of approaching a desired goal or target or result. Each repetition of the process is also called an "iteration," and the results of one iteration are used as the starting point for the next iteration.-Mathematics:Iteration in...

algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

with the following steps:

The pool of available items is searched for the optimal item, based on the current estimate of the examinee's ability
The chosen item is presented to the examinee, who then answers it correctly or incorrectly
The ability estimate is updated, based upon all prior answers
Steps 1–3 are repeated until a termination criterion is met

Nothing is known about the examinee prior to the administration of the first item, so the algorithm is generally started by selecting an item of medium, or medium-easy, difficulty as the first item.

As a result of adaptive administration, different examinees receive quite different tests. The psychometric technology that allows equitable scores to be computed across different sets of items is item response theory

Item response theory

In psychometrics, item response theory also known as latent trait theory, strong true score theory, or modern mental test theory, is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is based...

(IRT). IRT is also the preferred methodology for selecting optimal items which are typically selected on the basis of information rather than difficulty, per se.

In the USA, the GRE

Graduate Record Examination

The Graduate Record Examinations is a standardized test that is an admissions requirement for many graduate schools in the United States, in other English-speaking countries and for English-taught graduate and business programs world-wide...

General Test and the Graduate Management Admission Test

Graduate Management Admission Test

The Graduate Management Admission Test is a computer-adaptive standardized test in mathematics and the English language for measuring aptitude to succeed academically in graduate business studies. Business schools use the test as a criterion for admission into graduate business administration...

are currently primarily administered as a computer-adaptive test. A list of active CAT programs is found at CAT Central, along with a list of current CAT research programs and a near-inclusive bibliography of all published CAT research.

A related methodology called multistage testing

Multistage testing

Multistage testing is an algorithm-based approach to administering tests. It is very similar to computer-adaptive testing in that items are interactively selected for each examinee by the algorithm, but rather than selecting individual items, groups of items are selected, building the test in stages...

(MST) or CAST

Computer adaptive sequential testing

Computer-adaptive sequential testing is another term for multistage testing. A CAST test is a type of computer-adaptive test or computerized classification test that uses pre-defined groups of items called testlets rather than operating at the level of individual items. CAST is a term introduced...

is used in the Uniform Certified Public Accountant Examination

Uniform Certified Public Accountant Examination

The Uniform Certified Public Accountant Examination is the examination administered to people who wish to become Certified Public Accountants in the United States....

. MST avoids or reduces some of the disadvantages of CAT as described below. See the 2006 special issue of Applied Measurement in Education for more information on MST.

Advantages

Adaptive tests can provide uniformly precise scores for most test-takers. In contrast, standard fixed tests almost always provide the best precision for test-takers of medium ability and increasingly poorer precision for test-takers with more extreme test scores.

An adaptive test can typically be shortened by 50% and still maintain a higher level of precision

Accuracy and precision

In the fields of science, engineering, industry and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual value. The precision of a measurement system, also called reproducibility or repeatability, is the degree to which...

than a fixed version. This translates into a time savings for the test-taker. Test-takers do not waste their time attempting items that are too hard or trivially easy. Additionally, the testing organization benefits from the time savings; the cost of examinee seat time is substantially reduced. However, because the development of a CAT involves much more expense than a standard fixed-form test, a large population is necessary for a CAT testing program to be financially fruitful.

Like any computer-based test

Computer-based testing

A Computer-Based Assessment , also known as Computer-Based Testing , e-assessment, computerized testing and computer-administered testing, is a method of administering tests in which the responses are electronically recorded, assessed, or both. As the name implies, Computer-Based Assessment makes...

, adaptive tests may show results immediately after testing.

Adaptive testing, depending on the item selection algorithm

Algorithm

, may reduce exposure of some items because examinees typically receive different sets of items rather than the whole population being administered a single set. However, it may increase the exposure of others (namely the medium or medium/easy items presented to most examinees at the beginning of the test).

Disadvantages

The first issue encountered in CAT is the calibration of the item pool. In order to model the characteristics of the items (e.g., to pick the optimal item), all the items of the test must be pre-administered to a sizable sample and then analyzed. To achieve this, new items must be mixed into the operational items of an exam (the responses are recorded but do not contribute to the test-takers' scores), called "pilot testing," "pre-testing," or "seeding." This presents logistical, ethical, and security issues. For example, it is impossible to field an operational adaptive test with brand-new, unseen items; all items must be pretested with a large enough sample to obtain stable item statistics. This sample may be required to be as large as 1,000 examinees. Each program must decide what percentage of the test can reasonably be composed of unscored pilot test items.

Although adaptive tests have exposure control algorithms to prevent overuse of a few items, the exposure conditioned upon ability is often not controlled and can easily become close to 1. That is, it is common for some items to become very common on tests for people of the same ability. This is a serious security concern because groups sharing items may well have a similar functional ability level. In fact, a completely randomized exam is the most secure (but also least efficient).

Review of past items is generally disallowed. Adaptive tests tend to administer easier items after a person answers incorrectly. Supposedly, an astute test-taker could use such clues to detect incorrect answers and correct them. Or, test-takers could be coached to deliberately pick wrong answers, leading to an increasingly easier test. After tricking the adaptive test into building a maximally easy exam, they could then review the items and answer them correctly—possibly achieving a very high score. Test-takers frequently complain about the inability to review.http://edres.org/scripts/cat/catdemo.htm

Because of the sophistication, the development of a CAT has a number of prerequisites.http://www.fasttestweb.com/ftw-docs/CAT_Requirements.pdf The large sample sizes (typically hundreds of examinees) required by IRT calibrations must be present. Items must be scorable in real time if a new item is to be selected instantaneously. Psychometricians experienced with IRT calibrations and CAT simulation research are necessary to provide validity documentation. Finally, a software system capable of true IRT-based CAT must be available.

CAT components

There are five technical components in building a CAT (the following is adapted from Weiss & Kingsbury, 1984 ). This list does not include practical issues, such as item pretesting or live field release.

Calibrated item pool
Starting point or entry level
Item selection algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
Scoring procedure
Termination criterion

Calibrated Item Pool

A pool of items must be available for the CAT to choose from. The pool must be calibrated with a psychometric model, which is used as a basis for the remaining four components. Typically, item response theory

Item response theory

is employed as the psychometric model. One reason item response theory is popular is because it places persons and items on the same metric (denoted by the Greek letter theta), which is helpful for issues in item selection (see below).

Starting Point

In CAT, items are selected based on the examinee's performance up to a given point in the test. However, the CAT is obviously not able to make any specific estimate of examinee ability when no items have been administered. So some other initial estimate of examinee ability is necessary. If some previous information regarding the examinee is known, it can be used, but often the CAT just assumes that the examinee is of average ability - hence the first item often being of medium difficulty.

Item Selection Algorithm

As mentioned previously, item response theory

Item response theory

places examinees and items on the same metric. Therefore, if the CAT has an estimate of examinee ability, it is able to select an item that is most appropriate for that estimate. Technically, this is done by selecting the item with the greatest information at that point. Information

Information

Information in its most restricted technical sense is a message or collection of messages that consists of an ordered sequence of symbols, or it is the meaning that can be interpreted from such a message or collection of messages. Information can be recorded or transmitted. It can be recorded as...

is a function of the discrimination parameter of the item, as well as the conditional variance and pseudoguessing parameter (if used).

Scoring Procedure

After an item is administered, the CAT updates its estimate of the examinee's ability level. If the examinee answered the item correctly, the CAT will likely estimate their ability to be somewhat higher, and vice versa. This is done by using the item response function from item response theory

Item response theory

to obtain a likelihood function

Likelihood function

In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...

of the examinee's ability. Two methods for this are called maximum likelihood estimation and Bayesian estimation. The latter assumes an a priori distribution of examinee ability, and has two commonly used estimators: expectation a posteriori and maximum a posteriori. Maximum likelihood

Maximum likelihood

In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

is equivalent to a Bayes maximum a posterior estimate if a uniform (f(x)=1) prior is assumed. Maximum likelihood is asymptotically unbiased, but cannot provide a theta estimate for a nonmixed (all correct or incorrect) response vector, in which case a Bayesian method may have to be used temporarily.

Termination Criterion

The CAT algorithm

Algorithm

is designed to repeatedly administer items and update the estimate of examinee ability. This will continue until the item pool is exhausted unless a termination criterion is incorporated into the CAT. Often, the test is terminated when the examinee's standard error of measurement falls below a certain user-specified value, hence the statement above that an advantage is that examinee scores will be uniformly precise or "equiprecise." Other termination criteria exist for different purposes of the test, such as if the test is designed only to determine if the examinee should "Pass" or "Fail" the test, rather than obtaining a precise estimate of their ability.

Pass-Fail CAT

In many situations, the purpose of the test is to classify examinees into two or more mutually exclusive

Mutually exclusive

In layman's terms, two events are mutually exclusive if they cannot occur at the same time. An example is tossing a coin once, which can result in either heads or tails, but not both....

and exhaustive categories. This includes the common "mastery test" where the two classifications are "pass" and "fail," but also includes situations where there are three or more classifications, such as "Insufficient," "Basic," and "Advanced" levels of knowledge or competency. The kind of "item-level adaptive" CAT described in this article is most appropriate for tests that are not "pass/fail" or for pass/fail tests where providing good feedback is extremely important.) Some modifications are necessary for a pass/fail CAT, also known as a computerized classification test (CCT)

Computerized classification test

A computerized classification test refers to, as its name would suggest, a test that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify...

. For examinees with true scores very close to the passing score, computerized classification tests will result in long tests while those with true scores far above or below the passing score will have shortest exams.

For example, a new termination criterion and scoring algorithm must be applied that classifies the examinee into a category rather than providing a point estimate of ability. There are two primary methodologies available for this. The more prominent of the two is the sequential probability ratio test

Sequential probability ratio test

The sequential probability ratio test is a specific sequential hypothesis test, developed by Abraham Wald. Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem...

(SPRT). This formulates the examinee classification problem as a hypothesis test that the examinee's ability is equal to either some specified point above the cutscore

Cutscore

A cutscore, also known as a passing score or passing point, is a single point on a score continuum that differentiates between classifications along the continuum...

or another specified point below the cutscore. Note that this is a point hypothesis formulation rather than a composite hypothesis formulation that is more conceptually appropriate. A composite hypothesis formulation would be that the examinee's ability is in the region above the cutscore or the region below the cutscore.

A confidence interval

Confidence interval

In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...

approach is also used, where after each item is administered, the algorithm determines the probability that the examinee's true-score is above or below the passing score. For example, the algorithm may continue until the 95% confidence interval

Confidence interval

for the true score no longer contains the passing score. At that point, no further items are needed because the pass-fail decision is already 95% accurate, assuming that the psychometric models underlying the adaptive testing fit the examinee and test. This approach was originally called "adaptive mastery testing" but it can be applied to non-adaptive item selection and classification situations of two or more cutscores (the typical mastery test has a single cutscore).

As a practical matter, the algorithm is generally programmed to have a minimum and a maximum test length (or a minimum and maximum administration time). Otherwise, it would be possible for an examinee with ability very close to the cutscore to be administered every item in the bank without the algorithm making a decision.

The item selection algorithm utilized depends on the termination criterion. Maximizing information at the cutscore is more appropriate for the SPRT because it maximizes the difference in the probabilities used in the likelihood ratio. Maximizing information at the ability estimate is more appropriate for the confidence interval approach because it minimizes the conditional standard error of measurement, which decreases the width of the confidence interval needed to make a classification.

Practical Constraints of Adaptivity

ETS

Educational Testing Service

Educational Testing Service , founded in 1947, is the world's largest private nonprofit educational testing and assessment organization...

researcher Martha Stocking has quipped that most adaptive tests are actually barely adaptive tests (BATs) because, in practice, many constraints are imposed upon item choice. For example, CAT exams must usually meet content specifications; a verbal exam may need to be composed of equal numbers of analogies, fill-in-the-blank and synonym item types. CATs typically have some form of item exposure constraints, to prevent the most informative items from being over-exposed. Also, on some tests, an attempt is made to balance surface characteristics of the items such as gender

Gender

Gender is a range of characteristics used to distinguish between males and females, particularly in the cases of men and women and the masculine and feminine attributes assigned to them. Depending on the context, the discriminating characteristics vary from sex to social role to gender identity...

of the people in the items or the ethnicities implied by their names. Thus CAT exams are frequently constrained in which items it may choose and for some exams the constraints may be substantial and require complex search strategies (e.g., linear programming

Linear programming

Linear programming is a mathematical method for determining a way to achieve the best outcome in a given mathematical model for some list of requirements represented as linear relationships...

) to find suitable items.

A simple method for controlling item exposure is the "randomesque" or strata method. Rather than selecting the most informative item at each point in the test, the algorithm randomly selects the next item from the next five or ten most informative items. This can be used throughout the test, or only at the beginning. Another method is the Sympson-Hetter method, in which a random number is drawn from U(0,1), and compared to a k_i parameter determined for each item by the test user. If the random number is greater than k_i, the next most informative item is considered.

Wim van der Linden and colleagues have advanced an alternative approach called shadow testing which involves creating entire shadow tests as part of selecting items. Selecting items from shadow tests helps adaptive tests meet selection criteria by focusing on globally optimal choices (as opposed to choices that are optimal for a given item).

External links

http://www.iacat.org International Association for Computerized Adaptive Testing
Concerto: Open-source CAT Platform
CAT Central by David J. Weiss
Frequently Asked Questions about Computer-Adaptive Testing (CAT). Retrieved April 15, 2005.
An On-line, Interactive, Computer Adaptive Testing Tutorial by Lawrence L. Rudner. November 1998. Retrieved April 15, 2005.
Special issue: An introduction to multistage testing. Applied Measurement in Education, 19(3).
Computerized Adaptive Tests - from the Education Resources Information Center
Education Resources Information Center
ERIC - the Education Resources Information Center - is an online digital library of education research and information. ERIC is sponsored by the Institute of Education Sciences of the U.S. Department of Education...

Clearinghouse on Tests Measurement and Evaluation, Washington, DC

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

How CAT works

Advantages

Disadvantages

CAT components

Calibrated Item Pool

Starting Point

Item Selection Algorithm

Scoring Procedure

Termination Criterion

Pass-Fail CAT

Practical Constraints of Adaptivity

See also

Additional sources

External links