Second Language Acquisition - Theory and Pedagogy: Proceedings of the 6th Annual JALT Pan-SIG Conference.
May. 12 - 13, 2007. Sendai, Japan: Tohoku Bunka Gakuen University. (pp. 65 - 74)

Criterion-referenced test administration designs and analyses
Japanese Abstract
by Takaaki Kumazawa (Kanto Gakuin University)


This paper mentions the differences between norm-referenced and criterion-referenced tests and introduces one possible criterion-referenced administration design. Two forms of a 25-item multiple-choice criterion-referenced vocabulary test were developed and administered to two groups of Japanese university EFL students (n=87) for diagnostic and achievement purposes in a counterbalanced pretest/posttest design. The dependability indexes for these tests were low or moderate and an item analysis of the criterion-reference tests suggests there was a slight increase in score gain after a period of 13 weeks of instruction. This suggests that most of the students mastered a modest amount of the target vocabulary.


criterion-referenced tests, test analyses, intervention construct validity study Japanese Abstract

Norm-referenced and criterion-referenced testing within EFL curriculum

Glaser (1963) was credited for distinguishing between norm-referenced tests (NRTs) and criterion-referenced tests (CRTs) in order to draw attention to the need for a different family of tests for use in classroom settings. Popham and Husek (1969) elucidated the differences: NRTs are tests "used to ascertain an individual's performance against the performances of other individuals using the same measuring device" (p. 2). This definition suggests that NRTs can serve as psychological measurements, estimating examinees' scores comparing their relative scores with those of other examinees in the distribution by using statistic procedures such as Z-score or t-score rankings. In other words, in NRTs, an examinee's score is determined on the basis of where his/her score is located within a distribution of other examinees. In this sense, it is a "norm" or group referenced test since the interpretation of the test score is based on where each score lies within a normal distribution.

[ p. 65 ]

Popham and Husek (1969) defined CRTs as a psychological measurement device "used to ascertain an individual's status with respect to some criterion, i.e., performance standard. It is because the individual is compared with some established criterion, rather than other individuals, that these measures are described as criterion-referenced" (p. 2). In CRTs, decisions are made on the basis of examinees' test scores with reference to a certain criterion. The term "criterion" has two connotations. It is used as a domain of a score or construct that is being measured using a test. It also implies a set cut-off point. Therefore, decisions are made based on the extent to which students master a domain, whether or not they exceed the set cut-off point, or a combination of both.
In order to define the role(s) of testing in a given curriculum, it is helpful to conceptualize curricula in terms of a model. Brown (1995, p. 20) proposed a systematic curriculum development model (see Figure 1). Within that model, testing is the third phase and is sandwiched between the (2) goal/objective setting, and (4) material development phases.

Figure 1

Figure 1.A systematic curriculum development model proposed by Brown (1995, p. 20)

Generally, within a language program, tests are used for making four types of decisions related to: (a) proficiency, (b) placement, (c) diagnosis, and (d) achievement (Bachman, 1990; Brown, 1995). While NRTs are used for making proficiency and placement decisions, CRTs are used for making diagnostic and achievement decisions.
Administrators can conduct needs analyses and use the information obtained to formulate curriculum policies on the types of students they accept into their program and determine the levels of proficiency according to how students are placed. Proficiency and placement tests help decide which students should be accepted into school and placed in a certain level. Teachers can also conduct a needs analysis and use the obtained information to design sound instructional objectives. Diagnostic and achievement tests should be used to evaluate the effectiveness of the teaching to the objectives and classroom materials.

[ p. 66 ]

CRT administration designs

Most CRTs include both diagnostic and achievement tests that are administered before and after instruction. Within Brown's model (1995), CRTs should be developed in the testing phase and administered as diagnostic and achievement tests in the teaching phase to facilitate instruction. Diagnostic tests and achievement tests generally occur in a pretest/posttest format. There are essentially four possible administration designs: (a) posttest only, (b) pretest/posttest with one form, (c) pretest/posttest with two forms, and (d) pretest/posttest with two counterbalanced forms (adapted from Popham, 2003).
"If no diagnostic test is administered, teachers have no information on what students can do before instruction."

It is all too common to administer only a posttest as an achievement test in order to calculate students' final grades in the sequence described in Figure 2. This design has two limitations: (a) no evidence of student score gain or reputed learning is measured and (b) no teach-to-test instruction occurs. In other words, we do not really know whether students learned anything from a course or whether the information that was taught matched the test content. If no diagnostic test is administered, teachers have no information on what students can do before instruction. An achievement test should show the extent to which students understand a designated content area by the end of a class. By comparing students' scores on pretests and posttests, teachers should be able to get some picture of what students have learned in class.
A unfortunate but frequent practice is to design an achievement test right before the day of the final examination. In addition, all too often, there is a lack of congruence between what is taught and what is tested. As Figure 1 suggests, CRTs should be developed before actual teaching so that teachers can teach the test content in class as part of the material they cover. Effective teaching-to-test instruction can occur in this way (Popham, 2003). It is difficult enough just to design dependable and valid CRTs. It is even more difficult to make them a night before the final exam.

Figure 2

Figure 2.A flawed educational model with a posttest only design

"If students know that a test administered as a diagnostic test is also going to be used as an achievement test, they may only study the parts of the class content that are on the test."

The pretest/posttest design using one form as in Figure 3 solves the problems arising from posttest-only designs. By comparing students' test scores before and after instruction, teachers can at least partly determine what students have learned in class. If teachers set a cut-off point, the B-index can be calculated to see how each item is contributing to the pass/fail decisions that are often made with CRTs (Brown, 2003, p. 15). But with this design, the most suitable indicator of sound criterion-referenced items – the difference index (DI) – can be easily calculated with this design in order to determine the extent to which students may have learned the item contents as a result of instruction. DI is defined as the item facility on the particular item for the posttest minus the item facility for that same item on the pretest (Brown, 2003, p. 14). In Griffee's 1996 study of 50 Japanese university students, DI was reported to show the extent to which students learned the items over 10 months. However, this design suffered from a drawback known as pretest reactivity (Popham, 2003, p. 152). If students know that a test administered as a diagnostic test is also going to be used as an achievement test, they may only study the parts of the class content that are on the test. In addition, given that there is a limit to what can be tested with one form, teachers cannot test a wide range of class content.

Figure 3

Figure 3. A single form pretest/posttest design

[ p. 67 ]

Using a design with different pretest/posttest forms can minimize pretest reactivity effects, and teachers can test a wide range of class content with dual CRT forms. However, this design also entails a pitfall. If the difficulties of the two CRT forms differ, then it becomes difficult to estimate students' achievement simply by subtracting their test scores on the diagnostic test from their test scores on the achievement test.

Figure 4

Figure 4.A pretest (Form A) / posttest (Form B) design

One feasible solution to this problem is to adopt a counterbalanced pretest/posttest design. In other words, the class is divided into two groups and half of the students take one form each as a pretest, then in the posttest the forms are switched so that no students are tested on the same material. Although this design does not solve all the problems that have been mentioned above, it does minimize them. In addition, intervention construct validity studies can be carried out with this design. Although it is possible to conduct a study with a pretest (Form A)/posttest (Form A) design, this results in undesirable reactivity effects.
It is therefore best to do a study with a counterbalanced pretest/posttest design. If the CRTs measure the desired construct and instruction was effective, students' scores should increase significantly between the pretest and the posttest. The score gain can be used as a basis for one of the validity arguments for the construct validity of a given CRT (Brown & Hudson, 2003, p. 225). Another advantage of this design is that the results obtained from diagnostic tests can be used to revise the CRTs so that more refined CRTs can be administered as achievement tests. There are only two studies published in English that have adopted this design and reported the DI (Brown, 1993; 2001). The studies reported that only a slight increase was observed in terms of the DI value over a one semester period.

Figure 5

Figure 5. A counterbalanced pretest/posttest design with two forms

[ p. 68 ]

Research questions
The purpose of this study is to conduct an intervention construct validity study using a pretest/posttest design with two forms counterbalanced. Thus, two CRT forms were developed and administered to two groups of students both as pretests and posttests in a proficiency-based curriculum. To this end, the following research questions were formulated:
  1. To what extent were the two CRT forms dependable in both administrations?
  2. To what extent did the students master the vocabulary items on the two forms of the CRTs?



This study involved 87 first-year Japanese university students in a high-ranking private university in the Kanto area (N=87). They took a general English class that focused on reading and listening skills. A placement test was administered to make decisions about a proficiency-based curriculum in the program, streaming the students into two course levels. The students were divided into two groups to carry out a counterbalanced design. Since the test involved a listening component and microphones were not used, it was important that students in the same room took the same form of the test. The lower proficiency group majoring in tourism was designated Group A (n=44) and the midrange group majoring in law was Group B (n=37). One student from the second group dropped the class mid-semester.


Two teachers set the semester objectives by referring to the class goals that had already been set by the administrators. One part of these goals included learning more academic English vocabulary. Before instruction, two CRT forms were developed to assess the students' mastery of some of the vocabulary items that appeared in the assigned textbook.
The two teachers worked together to design lesson plans. Each test form consisted of 25 multiple-choice items. The target skills for this class were two receptive skills so the teacher thought that multiple-choice items were suitable to test students' receptive skills. Six items were included in both forms to help "anchor" the scores. A typical sample item appears below:

1. A linguist studied how parents talked to their young children.

The exactly same sentence from the textbook was given and the target vocabulary item was underlined. Students were instructed to select the best option (A, B, C, or D) which was the closest to the meaning of the underlined target word.

[ p. 69 ]

Testing procedure

Form A and Form B were administrated to Group A and Group B as a pretest. The teachers informed the students that their pretest scores would have no effect on their final grades and explained the purpose of the diagnostic test administration. Although test score sheets were returned to students, the question sheets were all collected to avoid information leakage. With respect to the classroom instruction concerning the target vocabulary items, the teachers used the same lesson plans and provided the students with the corresponding Japanese translations and English synonyms. The students were asked to study the vocabulary items included in the lists provided in the class because the words would be tested on the day of final examination. At the end of the semester, the test form that students did not take at the beginning of the class was administrated. Their test scores were used to decide 15% of their final grades. Students were given 15 minutes to complete each test.


The responses of the students were dichotomously scored (converted to correct or incorrect responses) and then processed in spreadsheets. Missing blanks were treated as incorrect responses. Descriptive statistics for all the items were then calculated.
A norm-referenced reliability statistic known as the KR-20 (Brown, 2005a) was first employed. Norm-referenced reliability was used to estimate how much error contributed to the examinee scores. Brown (1990) developed a short-cut formula to estimate the index of dependability, which was also used to estimate the test consistency of CRTs.
Dependability is different from reliability in that it concerns the consistency of absolute decisions, not relative decisions. The coefficient obtained from Brown's short-cut formula was exactly equivalent to the generalizability coefficient for absolute decisions obtained from a decision study in generalizability theory (see Brown, 2005b for generalizability theory). This point is further described in Brown's study (1990) of criterion-referenced test consistency. Not to mention, fit was another term for test consistency in the tradition of item response theory. Two criterion-referenced item statistics were considered especially important: the DI and the B-index.


Table 1 displays the descriptive statistics. Because Group A (n=44) was less proficient, their mean was 10.14 for the 25-item test; however, Group B (n=37) was more proficient and obtained a slightly higher mean of 13.19 – just over half of the test items. The KR-20, a norm-referenced reliability coefficient, respectively yielded .06 and .40, on Form A and Form B for the pretests. The dependability indexes were estimated based on the coefficients derived from the KR-20. These indexes of .05 and .37 were lower than the KR-20 coefficients.

[ p. 70 ]

At the end of the semester, Group B, in which 36 students took Form A as a posttest, obtained a mean of 17.75 for the test – averaging 71% of the items correct. One student in that group obtained a perfect score of 25. Group A took Form B as a posttest and obtained a mean of 12.48 - averaging just under half of the items correct. The standard deviation for this group had the largest value of 3.47, indicating that some students studied for this test but others did not. The KR-20 coefficients were .49 and .57 and the dependability indexes were .46 and .53. The data obtained from Form A and Form B used as pretests and as posttests were added accordingly, and the combined means for the pretests and the posttests were 11.53 and 14.85. Thus, a slight increase in the mean scores was observed. With the exception of one kurtosis value of -1.04, normality was not a problem.

Table 1. Descriptive statistics for two forms of a 25-item English vocabulary test administered to two groups of Japanese university students in 2006.
n Minimum Maximum M SD Skewness Kurtosis KR 20 φ
Pre Form A
(Group A)
44 6 14 10.14 2.42 0.16 -1.04 .06 .05
Pre Form B
(Group B)
37 7 20 13.19 2.98 0.31 -0.11 40 .37
Pre Forms A & B
(Groups A & B)
81 6 20 11.53 3.08 0.42 -0.06
Post Form A
(Group B)
36 11 25 17.75 2.91 0.05 0.53 .49 .46
Post Form B
(Group A)
44 6 20 12.48 3.47 0.16 -0.17 .53 .57
Post Forms A & B
(Groups A & B)
80 6 25 14.85 4.16 -0.09 -0.42

Note. φ = phi dependability index

Table 2 summarizes the criterion-referenced item statistics for both forms. Ideally, while IF should be close to .00 in pretests, IF should be close to 1.00 in posttests so that DI values can be maximized. The IF values for items 10 and 16 in Form A and for items 5, 8, 12, 16, 18, and 24, in Form B used as pretests, was excessively high, indicating that the students knew the vocabulary items before instruction.

[ p. 71 ]

The IF values for items 21 and 22 in Form A and for items 4, 10, 11, 13, 14, 15, 20, and 21, in posttest-Form B, were unreasonably low, indicating that the students did not learn the vocabulary items even after instruction.
The B-index was calculated by subtracting IF for the bottom 70% of the students from the top 30% of the students. The B-index is sensitive to the location of the test cut-off point. The cut-off point for this test was arbitrarily set at 70%. In other words, the students who scored higher than 15 out of 25 points passed the posttest. The B-index values close to 1 indicate an item has highly differential power, while the B-index values close to -1 signify the opposite. Note that the B-index for all the items in Form A taken by Group A as a pretest were negative because none of the students exceeded the cut-off point of 15. In the other administrations, only a few items had negative B-index values. In this study DI was considered the most important criterion-reference item statistic because it was an indicator of the number of items that the students had learned since the pretest, presumably as a result of instruction. Ideally, DI should be close to 1 showing that the students learned the item. However, it should be noticed that 14 items in Form A had negative DI values. These unwanted results were probably due to the difference in the proficiency levels between Group A and Group B.

Table 2. Criterion-referenced item statistics for two forms of a 25-item English vocabulary test administered to two groups of Japanese university students in 2006.
Table 2

Note. The asterisk * denotes vocabulary items which appeared in a parallel form.


Now let us reflect on the research questions in terms of the study results.
1. To what extent were the two CRT forms dependable in the two administrations?

Based on the phi dependability indexes, with the exception of the form taken by Group A as a pretest, the other test forms were found to have moderate dependability values from .37 to .53. Since most of the students in Group A scored low on the pretest, it was ideal as a diagnostic test since it revealed most students had not yet learned the items. However, because there was not much variance observed in the test scores, the dependability of the test was probably low. Statistics can be an indicator for deciding the quality of items; nevertheless, especially when the sample size and criterion-referenced item number are both small, teachers should examine the content carefully in order to decide whether or not items are really measuring the target objectives of the class.
2. To what extent did the students master the vocabulary items on the two forms of the CRT?

The pretest/posttest design with two counterbalanced forms enables teachers to determine to some degree the effectiveness of their instruction. Such designs focus on two indicators: DI and score gain. To calculate DI, the same items have to be administered as pretests and posttests. Recall that DI for the items in the posttest given to Group B had negative values. Because the proficiency level of Group A and Group B differed, this was not surprising. Ideally the DI statistic should be used when the proficiency levels of two groups are almost equal. To resolve the problem in this study, each class should have been split into halves.
The other indicator of student learning is score gain. This is a simple but useful method for getting some picture of the effectiveness of curriculum. Recall that the means for the combined pretests and posttests were 11.53 and 14.85, respectively. This suggests that some degree of learning may have occurred in the interval between the tests. That also can be one of the arguments for the construct validity of the CRTs.

[ p. 72 ]

"It is recommended that teachers make CRTs before instruction so that successful teach-to-test instruction can be accomplished. It is also recommended that two forms of any CRT be developed in order to test a wider range of content in a counterbalanced design."


CRT development is a crucial part of curriculum development because it offers a snapshot of what is being learned by students. It is recommended that teachers make CRTs before instruction so that successful teach-to-test instruction can be accomplished. It is also recommended that two forms of any CRT be developed in order to test a wider range of content in a counterbalanced design. When interpreting test scores, it is also hoped that teachers will examine DI and score gains by comparing pre- and post-data to evaluate the effectiveness of their teaching.
Two possible limitations of this study concern validity and test format. The phi dependability indexes of the CRTs were reported and one validity argument was provided. However, the issue of validity itself was not fully investigated. Ways to validate CRTs are discussed in Nasca (1988) and Haertel (1985). For classroom tests, teachers often use a variety of test formats, but this test relied solely on multiple-choice items. It would be interesting to apply generalizability theory to investigate how much test formats as a facet, can contribute to the total variance. Since teachers use CRTs frequently in class, more studies on CRTs need to be done in the future.

[ p. 73 ]


Bachman, L. F. (1990). Fundamental considerations in language teasing. Oxford: Oxford University Press.

Brown, J. D. (1990). Short-cut estimators of criterion-referenced test consistency. Language Testing, 7(1), 77-97.

Brown, J. D. (1993). A comprehensive criterion-referenced language testing project. In D. Douglas & C. Chapelle (Eds.), A new decade of language testing: Collaboration and cooperation. (pp. 163-184). Ann Arbor, MA: University of Michigan.

Brown, J. D. (1995). The elements of language curriculum: A systematic approach to program development. Boston, MA: Heinle & Heinle.

Brown, J. D. (2001). Developing and revising criterion-referenced achievement tests for a textbook series. In T. Hudson & J. D. Brown (Eds.), A focus on language test development. (Technical Report #21, pp. 205-228). Honolulu: University of Hawai'i, Second Language Teaching and Curriculum Center.

Brown, J. D. (2003). Criterion-referenced item analysis (The difference index and B-index). Shiken: JALT Testing & Evaluation SIG Newsletter, 7 (3), 13-17. Retrieved November 6, 2007 from

Brown, J. D. (2005a). Testing In language programs: A comprehensive guide to English language assessment. New York: McGraw-Hill College.

Brown, J. D. (2005b). Generalizability and decision studies. Shiken: JALT Testing & Evaluation SIG Newsletter, 9 (1), 12-16. Retrieved November 6, 2007 from

Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519-521.

Griffee, D. (1995). Criterion-referenced test construction and evaluation. In J. D. Brown & S. O. Yamashita (Eds.), Language testing in Japan (pp. 20-28). Tokyo: Japanese Association of Language Teaching.

Haertel, E. (1985). Construct validity and criterion-referenced testing. Review of Educational Research, 55 (1), 23-46.

Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational measurement (3rd ed.) (pp. 13-103). New York: American Council on Education & Macmillan.

Nasca, D. (1988, March 17). An educators' field guide to CRT development and use in objectives based programs. ERIC Document #ED293878. Retrieved November 6, 2007 from

Popham, W. J. (2003). Test better, teach better: The instructional role of assessment. Alexandria, VA: Association for Supervision and Curriculum Development.

Popham, W. J. & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement. 6 (1), 1-9.

2007 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

Last Main Next

[ p. 74 ]