Curriculum Innovation, Testing and Evaluation: Proceedings of the 1st Annual JALT Pan-SIG Conference.
May 11-12, 2002. Kyoto, Japan: Kyoto Institute of Technology.

Developing an EAP test for undergraduates at a national university in Malaysia: Meeting the challenges

by Mohd. Sallehhudin Abd Aziz (Universiti Kebangsaan Malaysia)


This is an account of test design and development about a test for use in a single institution, Universiti Kebangsaan Malaysia in Malaysia. The main purpose of the test was to provide a suitable way to assess the language proficiency of undergraduates to establish whether they had sufficient English to undertake study in an academic programme in which some subjects were taught in English. This paper discusses test development activities, including a description of the purpose of this test, and a definition of its construct. It also outlines the tools of inquiry adopted in overviewing the quantitative and qualitative validation procedures used in this test.

Keywords: EAP test development, test validation, English for Academic Purposes, language proficiency testing.


The development of this test was prompted by three reasons. The first was the perceived English language inadequacy among Universiti Kebangsaan Malaysia (UKM) law students. Some students were having difficulty in following their subject areas, despite having passed the English section of the Sijil Pelajaran Malaysia (SPM) examination, which is equivalent to a O level paper in the UK.
The second reason for developing this test was that the existing SPM English Language Test seemed inadequate in assessing the language ability of prospective university students who decided to study locally, especially in the English extensive faculties such as the Faculty of Law. This is due to the fact the test was an achievement test based on a particular syllabus. That syllabus does not coincide with the actual needs of the students or prepare them with the language needs at the university level.
". . . assessment in Malaysia . . . [focuses] on reliability rather than validity."

Thirdly, the positive developments in language testing (especially with the advent of communicative language testing) do not seem to have had a profound effect on testing in Malaysia. Language testing in the local setting is still somewhat traditional in nature. The Malaysia Examinations Syndicate, the centralized examination body in charge of designing and developing standardized tests nationwide, seems to rooted in a traditional testing paradigm and the tests they develop do not appear to measure certain features of genuine language use. They seem to have more faith in the outdated and all-embracing models such as the Psychometric-Structuralist model. According to Test Development in ASEAN Countries (1986), assessment in Malaysia is still based on Classical Test Theory. Raatz (1985) indicates that the focus of that theory has always been on reliability rather than validity.

Purpose of the study

The purpose of the study was to design, construct and validate an English language test for incoming law students at the Faculty of Law of the Universiti Kebangsaan Malaysia.

[ p. 133 ]

The main aim of the test was to provide an accurate assessment to assess the language ability of the students to determine whether they have attained an adequate ability to undertake studies in the faculty. The specific aims of the study were twofold: (1) to investigate the validity of the test in terms of its content validity, construct validity, concurrent validity and predictive validity, and (2) to investigate the reliability of the test in terms of its inter-rater reliability.


This test was developed according to procedures recommended by Carroll (1980), Carroll and Hall (1985), and Weir (1990). These test development stages were adopted because they are currently accepted as the "best practice" in test development and validation. However, certain specific steps unique to this project were also adopted in developing this test. Table 1 represents the three main steps used in developing the test for this study.

Table 1: Test development stages/activities adopted in the study.

Let's highlight each stage briefly.

Stage 1: Design

At this stage, the main thing that had to be done was to obtain information on the purpose of the test and the prospective test takers. In this study, the test takers were the incoming law students at the Faculty of law, Universiti Kebangsaan Malaysia (UKM). The test purpose was to provide a proficiency scale of the ability of the students to cope with studies in law in English. The TLU domain was the first year academic setting of the Faculty of Law, UKM.
In view of this, the first stage of test development, is the identification of language needs. There are a number of ways in which these communicative needs could be identified, but for the purpose of this investigation needs analysis was used. Bachman and Palmer (1996) have recommended the use of needs analysis in identifying the language tasks in the relevant domain. Needs analysis as a matter of fact is the essence of the development of an EAP test. Robinson (1991, p. 7) maintains, "Needs analysis is generally regarded as criterial to ESP".

The construct

The construct for the proposed test was based on a framework of language use contexts of an academic setting at the Faculty of Law, Universiti Kebangsaan Malaysia. This approach to test construct has been adopted by individuals such as Candlin, Burton and Coleman (1980) in the design of tests to assess the proficiency of overseas-trained dentists in Britain. It has also been used by Low and Lee (1985) in an attempt to investigate the relationship between academic performance and second language problems as well as Hughes (1988b) in assessing language proficiency for academic purposes in a university in Turkey. Testing organizations such as Associated Examining Board (AEB) also used such an approach to test development in tests such as the TEEP.

[ p. 134 ]

". . .the definition of language ability should be fluid because there are different purposes of using language tests."
One fundamental reason why this construct of model of language ability was adopted in the present study is that the proposed test was basically an English for Academic Purposes (EAP) test. An EAP test essentially deals with a tightly-defined situation or setting (Douglas 2000). Our test was a needs-related test meant for a specific group, law students. Since the academic setting in the study was very specific, it made more sense to use a well-defined framework. If the proposed test were meant for general students from various faculties and disciplines, it would have been much better to use a general theoretical framework such as Bachman's (1990) or Bachman and Palmer's (1996) instead.
Another reason for adopting this particular construct was based on the notion that each linguistic situation is unique. Porter (1983, p. 192) argues, "There can't be single test of communicative proficiency for all comers. We must test to an analysis of the particular needs of a particular group". He expresses the view that the definition of language ability should be fluid because there are different purposes of using language tests. Porter (1991, p. 33) adds, " different needs of different learners may call for different types of language ability" and furthermore, the notion of language proficiency is different in each case.
The third reason why this particular framework of language ability was adopted is related to the question of whether the present theoretical models can be generalized beyond the specific testing situation. The models have become rather general and abstract and are increasingly elaborate and complex as our understanding of language ability deepens. McNamara states -
. . . attempts to apply a complex framework for modeling communicative language ability directly in test design have not always proved easy, mainly because of the complexity of the framework. This has sometimes resulted in a rather tokenistic acknowledgement of the framework and then disregard for it at the stage of practical test design. (p. 20)

A fourth reason why a particular theoretical model was not entirely adopted in the study is that there is still no consensus with regard to the definition of the constructs to be measured. There is still no overall proficiency model that is universally accepted.
According to Lantolf and Frawley (1988, p. 186), "A review of the recent literature on proficiency and communicative competence demonstrates quite clearly that there is nothing even approaching a reasonable and unified theory of proficiency".
In short, it is clear that there is a strong and valid argument for the construct of the proposed test to be context based. In the study, the identification of the contexts or the tasks in the target language use domain was done through the use of needs analysis.

[ p. 135 ]

Methods of data collection

Generally, needs analysis involves systematic gathering of information about the language needs of the learners. The needs analysis adopted for the study was based on Brown's (1995) interpretation that the systematic collection and analysis of relevant information is necessary to satisfy the language learning needs of the students within the context of the particular institution.

Figure 1. Data-gathering methods for the proposed test.

The main purpose of the needs analysis was to gather and identify the linguistic demands of the target language use situation of the first year law students in UKM. The results of the analysis were then analyzed and used to draw up the proposed test specifications. The data accumulated were grouped into modalities or skills. This was done because of the vast information gathered and also in order to attain a more manageable test design. Describing test specification is the last activity of the first stage of test development. In the study, the tasks for the specifications were chosen based on their importance as decided largely by the law students and subject informants via the needs analysis. These specifications later served as the test 'blueprint'.

[ p. 136 ]

Test specifications for the proposed test

Based on the needs analysis (via questionnaire, interview, observation and document analysis), it was decided that in general, reading was mainly for reading textbooks and other written sources in law. Writing was more important for taking notes in lectures and tutorials and for writing project papers. Speaking, as exemplified in the needs analysis was mainly for face-to-face interview, asking and answering questions, and for presentation purposes. Listening, on the other hand, was primarily for listening at lectures and tutorials. In short, the broad aims of the test were:

Table 2: Table of test objectives.
Subtest Objectives
Listening To assess candidates' ability to listen and understand law lectures and tutorials
Reading To assess candidates' ability to read and understand law textbooks and other written sources
Writing To assess candidates' written English for academic writing tasks in Law
Speaking To assess candidates' ability to speak English to take part in academic tasks in lectures and tutorials

Construction stage

The second stage of test development is the construction stage. The researcher constructed the instrument based on the test specifications (see Appendix 1 for test specifications). Clark (1975, p. 11) has always argued in favor of exact specification of tasks in terms of language and content and confidently speaks of replicating reality in a test's setting and operation. He adds, "A major requirement of direct proficiency tests is that they must provide a very close facsimile or 'work sample' of the real-life language situations in question, with respect to both the setting and operation of the tests and the linguistic areas and content which they embody".
Being a direct performance-based test, the proposed test tasks reflected as closely as possible the kinds of tasks needed in the target language use situation. For practical purposes, the test was designed based on the modality approach comprising listening, reading, writing and speaking subtests. However, it must be stressed that the tasks were not specific to the skills. They were tested integratively. The subtests and the rationale are described in detail below.

Moderation of the test by subject specialists

After the first draft of the test was generated, the researcher consulted four subject specialists. These were the law lecturers whose opinions were sought to ascertain the target language use situations. The specialists had between 5 to 8 years of experience teaching first year law courses. All of them have at least a masters' degree in law and three of them graduated from universities in the United Kingdom. One of the specialists is a graduate of International Islamic University, Malaysia. Again, the test was given to the specialists for moderation purposes. A validation check by specialists is an important requirement for any test development. One of the main tasks the subject specialists in this study was to ascertain whether the test content represented the kinds of tasks that the first year students had to undertake. They also had to considerer the clarity of instructions and whether the duration of time given the students to complete the test was sufficient. Weir (1990) has recommended that a test undergo a validation check by inviting professionals in the field (namely language and subject specialists) to comment on the suitability of texts, format and items.

[ p. 137 ]

On whether the characteristics of the test tasks reflected that of the target language use situations, the specialists held the view that the test focused on the important aspects of the target settings. With regard to clarity of the instructions for the test, all experts agreed that they were very clear. Nevertheless, the same vote of confidence could not be given to the evaluation with regard to the duration given to the students to complete the test particularly the writing section. Two of the specialists maintained that more time was needed to complete the test.
The initial feedback from the specialists provided invaluable information to the researcher pertaining to several aspects of validity [a priori validation of the test]. The responses from the content specialists guided the researcher to review the test. This resulted in some amendments to the test, especially with regard to the reading passages and the time allowed for test tasks to be carried out.

First pilot test

After the moderation of the test by the specialists and some modifications made to the instrument, a pilot test was conducted on 17 law students. Immediately after this, a simple questionnaire was given to the test takers to assess the difficulty of the questions, and see whether the allotment of time was adequate, and discern the appropriateness of the passages and clarity of the instructions.
In the first piloting, 41.2% the students believed that the listening subtest was 'very difficult' .The same percentage thought that the subtest was 'Moderately difficult' and 17.6% maintained it was 'Somewhat difficult'. Generally, the majority of the students felt that the time allotment for the test was quite reasonable. The students indicated that the instructions for other subtests were reasonably good. The clearest sets of instructions appeared to be those for the reading subtest. A great majority of the pilot examinees (82.3%) felt that instructions for the reading subtest were 'Highly clear'. 76.5% of the respondents' maintained that the instructions for the writing and the speaking subtests were 'Highly clear'.

Test revision

The feedback gathered from the students resulted in some changes made to the test. Based on their feedback, the instrument was once again revised and improved upon.
The next section discusses the activities in the third stage of the test development and highlights the validation procedures of the proposed test.

Validation stage

Stage three of the test development began with a second piloting, which was a full-scale application of the test. The subjects for the second piloting were 85 first year students from the Faculty of Law, UKM. They represented 90% of the first year students. The test was also given to four subject and two language specialists to be evaluated for validation purposes (The subject specialists were law lecturers from the Faculty of Law, UKM and the language specialists consulted were the English language teachers who have at least ten years of experience teaching English to the first year law students in UKM. They were also once coordinators for the English For Law courses for the Faculty of Law). A proper a posteriori validation of the test, establishing the test measurement characteristics, was also done at this stage.

[ p. 138 ]

Establishment of the test measurement characteristics

Based on the students' performance in the test and the students and specialists' responses to the questionnaire the measurement characteristics of the proposed test were then established to ascertain validity and reliability.

Content validity

To ascertain content validity Weir (1988, p. 26) recommends:
  1. Close scrutiny of the test items by experienced professionals; and
  2. The relating of the specification for the test to the final form of the test.
The content validation of this study was through the following means:
  1. Through the use of a systematic and empirical analysis of the target language use setting;
  2. Through the qualitative judgment by the experts on the content of the test; and
  3. Through the moderation of the test by the subject informants
Let us discuss each of them briefly.

Systematic analysis of the target language use situation

The language requirements of the first law students in UKM were systematically and empirically identified and analyzed by the researcher. Establishing the test content or the linguistic demands in the study involved not only the use of questionnaire and interview but also document analysis and observation. At a priori stage of the test development, questionnaires were given to those familiar with the target settings. Those asked to provide the input included the subject informants and the students themselves. All the law lecturers of the first year law program and 54 second year students were involved in the study.

Expert judgment on test tasks and test specifications by subject and language specialists

In addition to the systematic gathering of information pertaining to the students' language tasks in the target setting, the second method adopted in ascertaining content validity of the test is by getting the involvement of the subject specialists. Hudson and Lynch (1984b, p. 182) have suggested that the judgments in deciding whether the test covers the representative sample of the target language use situations are usually obtained from experts in the field. "That is, the test is examined to discern whether or not it includes all the sub-skills and elements of the domain and whether or not it is measuring those sub-skills properly.
In view of this, the researcher consulted subject and language specialists to give their opinions and to evaluate the content of the test vis-á-vis the test specifications. These specialists would have to be satisfied to the extent that when they looked at the test they would be satisfied that the test really measured what it said it was measuring. Their judgments of test content were based on a comparison of the test tasks to the test specifications. In evaluating the content of the test, the experts were given the complete test specifications document that had been clearly and precisely stated of the purpose of the test, the language skills and the areas to be tested.

[ p. 139 ]

Table 3: Experts' judgments on the extent to which the test tasks covered the areas being measured as stipulated in the test specifications.
Subtest To a great extent To some extent To a limited extent Not at all
Listening 67% (n = 4) 33% (n = 2)
Reading 83% (n = 5) 17% (n = 1)
Writing 83% (n = 5) 17% (n = 1)
Speaking 83% (n = 5) 17% (n = 1)

Based on the table above, it can be deduced that a great majority of the specialists agreed that the test tasks covered the areas being measured as stipulated in the test specifications. A total of 83% of them agreed 'To a great extent' that the test tasks in reading, writing and speaking subtests covered the areas being measured as specified in the test specifications. Another 17% of the experts decided that reading, writing and speaking subtests 'To some extent' covered the areas being measured as stipulated in the test specifications. None of the respondents chose either 'To a limited extent' or 'Not at all'. In short, the specialists claimed that the test task 'To a great extent' reflected the test specifications. This view of the specialists augurs well for the content validity of the test. In short, the responses from the specialists were highly positive. The majority of them argued that the test content as highly reflective of the specifications.

Moderation of the test

In addition to the systematic needs analysis, evaluation of the test by subject and language specialists to ascertain content validity, the subject specialists were also asked to moderate the test at the early stage of test development. They were given the test to moderate and to scrutinize it to make sure that the content of the test consisted tasks that were of the right bits and pieces of the TLU situation. Their feedback was important in improving the test. In the moderation process, the specialists' reactions to the test were very positive. They generally thought that the test was a good test overall. They generally believed that the test tasks covered the areas as stipulated in the test specifications. They also believed that the test tasks reflected the target language use domain. As regards to whether the characteristics of the test tasks reflected that of the target language use situation, the specialists held the view that the test was generally a good test.

Construct validity

According to Davidson et al. (1985), construct validation usually involves:
  1. a clear statement of theory,
  2. an a priori prediction of how the test(s) should behave given the theory; and
  3. following administration of the test(s), a check of the fit of the test to the theory.

If all three of these facets work well, the test can be said to have construct validity.

[ p. 140 ]

". . . external empirical data . . . [is] necessary but not sufficient to establish construct validity."
It can be inferred from Davidson et al. (1985) that that are basically two views with regard to establishing construct validity. The first view concerns with the external empirical data in which construct validity is viewed from a purely statistical aspect. It is seen as a matter of a posteriori statistical validation. The other view to construct validity gives more attention to non-statistical aspects of construct validity (qualitative analysis). It focuses more on a priori validation of the test. It sees the importance for construct validation at a priori stage of test development.
The view adopted by the study was that the concern for external empirical data was seen as necessary but not sufficient to establish construct validity. There was an equally important need for construct validation at a priori stage. As such, a priori and a posteriori construct validation should exist.

Quantitative analysis

According to Weir (1990, p. 23), "To establish the construct validity of a test statistically, it is necessary to show that it correlates highly with the indices of behavior that one might theoretically expect it to correlate with and also that it does not correlate significantly with variable that one would not expect it to correlate with". One of the quantitative methods used in the study to ascertain construct validity was to correlate different subtest scores with each other using the Pearson Product Moment Correlation formula. The main aim was to find out whether different subtests really tested different skills. If the subtests did not correlate highly then it showed that they were testing different skills. One reason for having different test components is that they measure something different. The correlations were expected to be fairly low possibly in the order of +. 3 - +. 5. In the study by Fok (1981), the correlations between the students' self-assessments and test were found to be at 0.3. High correlation between each subtest (+. 9) shows that the subtests are testing the same facet. The second statistical procedure that was used in the study to ascertain construct validity was to correlate each subtest with the overall test scores The third quantitative method that was employed to measure construct validity was through correlating each subtest with the total test scores minus self.

Correlating each subtest with other subtests

Table 4: Inter-correlation coefficients among the subtests.
Subtest Listening Reading Writing Speaking
Listening 0.56** 0.10 0.34**
Reading 0.56** 0.15 0.02
Writing 0.10 0.15 0.48**
Speaking 0.34** 0.02 0.48
** Correlation significant at 0.01 level.

The data from Table 4 indicates that reading writing subtest had the lowest correlation (0.10) with . A correlation of 0.02 between speaking subtest and reading subtest also suggested that these two tests operated differently. It is apparent that the correlations between the subtests were very low. These low correlations clearly indicate that the subtests were measuring different kinds of demands. The low correlations as seen in Table 3 augurs well for construct validity of the proposed test.

Continue to Part 2

2002 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

[ p. 141 ]

Last Next