Curriculum Innovation, Testing and Evaluation: Proceedings of the 1st Annual JALT Pan-SIG Conference.
May 11-12, 2002. Kyoto, Japan: Kyoto Institute of Technology.

Assessing speaking in junior high school:
Issues for a senior high school entrance examination

by Tomoyasu Akiyama (The University of Melbourne)

This paper has three purposes. Firstly, it discusses these three assessment contexts: (1) the 1999 Tokyo Senior High School entrance examination English test, (2) a proposal to include speaking tests in entrance examinations, and (3) some information about the assessment of speaking skills in junior high schools in relation to Bachman and Palmer's notion of 'usefulness' (1996).
Secondly, it identifies issues in school-based assessment by junior high school English teachers through a questionnaire. It also reports the results of a Rasch analysis using empirical data derived from test trials undertaken by junior high school students.
Finally, this paper argues for the need to build up a 'task bank' for future versions of the senior high school exam. It also emphasizes the importance of introducing speaking tests in entrance examinations for senior high schools.

The 'usefulness' of the current English entrance examination

Nearly 80% of the 1999 version of the Tokyo Senior High School English Test focused on reading skills and grammar knowledge and it was in a multiple-choice format. To its credit, the test scores of were reliable. However, since the entrance examination that did not include any assessment of speaking skills it lacked authenticity.

Figure 1. The proportion of skills tested in the 1999 version of the English section of the Tokyo senior high school entrance examination.

[ p. 107 ]

The 'indirect speaking tests' in Section 2 of this test were low on interactiveness because students were only required to select an English sentence which situated a given scenario most appropriately. In terms of test impact, this test does not encourage students or teachers to focus on oral/aural skills. In terms of practicality, however, the current paper-and-pencil English examination test is highly practical. Hence the 1999 version of the test has two strong points (reliability and practicality) and four low points (construct validity, impact, authenticity and interactiveness), as depicted in Figure 2.
Figure 2
Figure 2 . The author's assessment of the usefulness of the 1999 English test.
". . . speaking tests tend have inherently many variables which reduce reliability . . . In terms of authenticity, however, the inclusion of the speaking tests could be a genuine boon."

Figure 3 represents the author's evaluation of a hypothetical test with a speaking component. Though it may have less formal reliability than the current English test, it would be superior in terms of authenticity. McNamara (1996) has noted that speaking tests tend have inherently many variables which reduce reliability, such as rater behaviour and interlocutors' variations. In terms of authenticity, however, the inclusion of the speaking tests could be a genuine boon since the test reflects the curricular content. As the inclusion of speaking tests could engage students in completing tasks interactively, such tests would be more interactive as well. Introducing speaking tests in the entrance examination would have great impact on teachers and students, as several studies (e.g. Shohamy, Donitsa-Schmidt, and Ferman, 1996; Cheng, 1997) suggest. As speaking tests require greater resources such to administer, the inclusion of a speaking component may be low in terms of practicality.
Figure 3
Figure 3. The author's assessment of the usefulness of the proposed English test.

[ p. 108 ]

As studies by Brindley (1999) indicate, the reliability of school based assessment tends to be low. The construct validity could potentially be high as Hamp-Lyons (1996) claims. She argues that portfolio assessment tends to have more task validity than traditional tests. Authenticity and interactiveness could be potentially high because school-based assessment can provide ample opportunity to speak. However, these judgments need to be made with caution because results may vary significantly depending on teachers and teaching styles. Practicality seems to be the main reason that tests do not currently have a speaking component.

Research questions
  1. How do junior high school teachers assess their students' speaking skills?
  2. What impact would the introduction of a speaking test in the entrance exams have on teaching?
  3. To what extent are tasks (speech, role-play, description and interview) different in terms of difficulty?
  4. To what extent do speaking test items and tasks correlate in terms of Rasch measurement?
  5. How well did the test population's performances fit in terms of Rasch measurement?
Data collection methods

Data collection 1: A questionnaire survey

A questionnaire survey was designed to address research questions 1 and 2. Approximately 600 questionnaires were distributed to junior high school English teachers in Tokyo. The questionnaire was completed by 199 junior high school teachers, representing a response rate of 33.

Data collection 2: Test trials

Five of the four the most popular tasks with the exception of information gap task (speech, role-play, description, and oral interview) were used for a trial test. All tasks had a 5 minute completion time, including explanations of the test procedures.


Test-takers were Japanese junior high school students ranging from age 14 (second year students) to age 15 (third year students). 219 students at twelve schools participated in this test trial. All students at each school undertook two of the four tasks, representing a total of 438 students performances.


The 13 interlocutors (12 Japanese English teachers at participants' school and the researcher) administered different tasks to the students.

[ p. 109 ]

Raters and scoring criteria

Five independent Japanese English senior high school teachers, with more than 10 years' teaching experience, rated students' performances from tape recordings. Scoring criteria consisted of 5 items (fluency, vocabulary, grammar, intelligibility and overall task fulfillment) The items were rated on a 0 to 5 points scale according to different levels of performance described for each item.


Questionnaire survey

Research Question 1 ascertained what percentage of English teachers assessed students' speaking ability using 'direct speaking tests'. Those who conducted 'direct speaking tests' amounted to 57.3% (n=114) and 42.7% (n=85) chose not to administer speaking tests. However, further analysis shows that the combination of other assessment methods, such as class observation and pencil-and-paper tests were frequently used. Results revealed that the majority of English teachers assessed students' speaking skills based on classroom observation with a combination of other methods.
Research Question 2 was to investigate what impact the introduction of speaking tests would have on Japanese English teachers. More than 75% of the teachers reported that speaking tests would have an impact on them, while 20% stated that little impact or no impact would occur to their teaching. Responses to this question showed that the introduction of speaking tests in entrance examinations would have a positive impact on teachers and their teaching activities, in that the majority of teachers would change their teaching styles towards improvement of students' communicative skills.
Now let us look at the test scores from a Rasch perspective.

Difficulty of items and tasks

Research Question 3 investigated the difficulty of tasks (items) on each task. The difference between the most difficult and the easiest tasks was approximately 1.5 logits.

Fit indexes across four tasks

Research Question 4 examines the quality of items, and the extent to which data patterns derived from the Rasch model differ from those of the actual data. Unexpected items that the Rasch model identifies are called either 'misfit' or 'overfit' items. The acceptable range of IMS here is from 0.70 to 1.30. Only item was identified as 'misfit', indicating a larger than the acceptable range of IMS in the sixth and seventh columns. This shows that the actual data patterns from Item 15 vary unacceptably in comparison with data patterns estimated by Rasch measurement. Thus the items on four tasks appeared to produce relatively similar response patterns, suggesting that the items across tasks assessed the similar construct.

[ p. 110 ]

Person fit indexes

The last question focuses on students' scores across the four tasks. This is particularly important, since this question leads to issues of accountability for students. 5.4 % of the students were identified as misfit students. This indicates that the percentage of misfit students exceeds the limit of the acceptable percentages of misfit students. It is important to investigate why this happened.
The combinations of tasks, which produced most misfit students the most frequently, were speech and interview followed by the combination of description and interview. Other task combinations produced fewer misfit students than the above two combinations. One possible explanation for this is that differences of task difficulty in combinations might have an impact on increasing misfit students.


Results of the questionnaire survey revealed that teachers' assessment methods varied, suggesting that it would be difficult to compare students' speaking ability across schools. The introduction of speaking tests might have a positive impact on approximately 80% of teachers, and most teachers maintained that they would change to a more communicative style of teaching. Thus, it can be argued that the inclusion of the speaking tests would have the potential to assist in bridging a gap between skills taught in classes and skills tested in entrance examinations, and between goals of the guidelines and assessment policy.
Results from test trials undertaken by junior high school students showed that all items except one fitted Rasch measurement, indicating that items on each task were effective in assessing the target construct. However, results also showed that the four tasks frequently used by English teachers were different in terms of difficulty. This means that students who undertake a variety of difficult tasks might not be assessed appropriately. Given that variables, including rater behavior and interlocutors, are inherent in performance tests, difficulty of tasks needs to be relatively equal in order to reduce variables. A concept of 'task bank', presented by Brindley (2001), could have important implications for the introduction of formal speaking tests in entrance examinations.


Implications for this study are that speaking tasks used in a classroom need to be trialled, and also investigated with Rasch measurement, given that school-based assessment represents half of selection procedures for students who wish to enter senior high schools. In junior high school contexts, a role play task bank, such as shopping situation, inviting friends to a party, or giving directions to a stranger could be developed. In order to not only administer speaking tests in a high stakes context, but also to enable teacher implemented assessment to be comparable across schools, it would be necessary to investigate tasks with Rasch techniques, based on empirical data, and to build up a 'task bank' with a relatively consistent quality of tasks.

[ p. 111 ]


Akiyama, T. (2001). The application of G-theory and IRT in the analysis of data from speaking tests administered in a classroom context. Melbourne Papers in Language Testing, 10, 1,1-22.

Alderson, J. C. & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.

Bachman, L. F. (1990). Fundamental consideration language testing. Oxford University Press

Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.

Brindley, G. (2001). Outcome-based assessment in practice: some examples and emerging insights. Language Testing, 18 (4), 393-407.

Cheng, L. (1997). How does washback influence teaching? Implications for Hong Kong. Language and Education, 11 (1), 38-54.

McNamara, T. F. (1996). Measuring second language performance. London and New York: Addison-Wesley Longman.

Messick, S. (1996). Validity and washback in language testing. Language Testing, 13 (3), 239-256.

Shohamy, E., Donitsa-Schmidt, S., & Ferman, I. (1996). Test impact revisited: washback effect over time. Language Testing, 13 (3), 298-317.

2002 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

[ p. 112 ]
Last Next