Review and Evaluation of the Development of a

Listening Comprehension (LC) Section of the LSAT

 

 

Robert Bostrom

University of Kentucky

Robert French

Educational Testing Service

Philip Johnson-Laird

Princeton University

Cynthia Parshall

Measurement Consultant

                                                                                               
Review and Evaluation of the Development of a

Listening Comprehension (LC) Section of the LSAT

 

Table of Contents

 

Executive Summary                                                                                                                              3

Introduction                                                                                                                                          6

Previous Tests of Listening Comprehension                                                                                  6

The Test Development Documents and Materials                                                                         8

Theoretical Background on Listening                                                                                              9

                Advantages of Listening                                                                                                    9

                Disadvantages of Listening                                                                                               10

                Working Memory                                                                                                                10

                Listening and Reading                                                                                                        11

                Dialogic Listening                                                                                                                11

                “Lecture” Listening                                                                                                             11

                Memory                                                                                                                                 13

Review of the Field Tests of LSAC’s LC Test                                                                                13

Evaluation of LSAC’s LC Test –                                                                                                      

     Construct Definition and Test Specifications                                                                            15

                Definition of the Construct                                                                                                15

                Possible Modifications of the Construct                                                                         19

                Test Specifications                                                                                                              21

Evaluation of LSAC’s LC Test –                                                                                                      

     Item Illustrations                                                                                                                             21

                Spoken Language vs. Written Language                                                                        21

                Anticipating the Direction of Discourse                                                                          23

                Stimulus Structure and Points Tested                                                                              25

                Pragmatics                                                                                                                             28

Review of the Statistical Analyses of LSAC’s LC Test                                                 28

                Soundness                                                                                                                            29

                Construct Validity                                                                                                                29

                Effects on Examinee Sub-Groups                                                                                      31

Evaluative Questions About LSAC’s LC Test                                                                               32

What would a listening comprehension assessment add to LSAC?                                           32

                Interpretive Listening                                                                                                          32

                Sight vs. Sound                                                                                                                    33

                Emotion and Academics                                                                                                     33

What makes a listening comprehension test item easy or difficult?                                            37

Is there a difference between listening skills and reading skills for                                            

     our test-taking population?                                                                                                           37

Is there a way to test listening skills that does not correlate highly                                          

     with a test of reading skills?                                                                                                          39

How should requests for test accommodations be handled?                                                       42

Are there reasons to prefer either a paper-and-pencil or computer-based                                

     format for the delivery of a listening comprehension test?                                                      43

                Innovative Item Types                                                                                                        43

                Comparability                                                                                                                       44

                Other CBT Considerations                                                                                                 44

Next Steps                                                                                                                                             46

                Test Development Recommendations                                                                              46

                Research Study Recommendations                                                                                  48

Conclusion                                                                                                                                            49

References                                                                                                                                            50

 


Review and Evaluation of the Development of a

Listening Comprehension (LC) Section of the LSAT

 

Executive Summary

 

Listening is one of the fundamental sources of information, and is the most frequent of all the communicative acts – writing, speaking, reading, and listening.   This fact was established in Paul Rankin’s (1929) classic study, and Klemmer and Snyder (1972) provided corroboratory evidence by surreptitiously videotaping engineers in their offices.   The ability to listen, to understand what you have heard, and to reason from your understanding is also likely to be more critical to performance in law school than in many other disciplines. Hence, LSAC is in the process of developing a test that measures listening ability, which could be included in the LSAT.  The current LSAT contains four sections that contribute to an examinee’s score: one reading comprehension (RC) section, one analytical reasoning (AR) section, and two logical reasoning (LR) sections.  The project to develop a listening comprehension (LC) test began several years ago.  It has proceeded in several concurrent steps, including Ken Olson’s (2003) paper, LSAT Listening Assessment: Theoretical Background and Specifications, a Skills Analysis Survey, and both in-house and field testing of LC items.  To help LSAC with the development of the test, the Council convened a panel to advise them, and here we summarize the panel’s report.

 

The researchers’ aim was to develop, not a test of language proficiency, but a test of listening ability.  It was impossible for them to work in a logical sequence of steps – the synthesis of the background theory, the formulation of the construct to be measured, the writing of test items, and the testing of items in the field.  Instead, they had to work on these different tasks concurrently.  In our report, we make radical suggestions and we are as critical as possible of the current LC test – we have applied criteria that are far more stringent than those normally applied in the day-to-day creation of test items.  We believe the test LSAC has produced is quite innovative and in many ways of very high quality.

 

In our report, we discuss five other existing tests of listening comprehension and summarize the theoretical background on listening itself.  We review and evaluate the development of LSAC’s LC test, criticizing in detail some of the test items.  We assess the statistical analyses conducted on the LC field test data, and we suggest some additional tests.     

 

We consider the results obtained thus far in terms of a series of evaluative questions, suggested in our remit, and we summarize our answers here:

 

1)      What would a listening comprehension assessment add to the LSAT?

 

An assessment of listening comprehension would be useful because ability in listening is likely to be important in law school and may not correlate with other abilities currently assessed in the LSAT.  However, there appear to be different sorts of listening, and not all of them may add information to the LSAT, e.g., the ability to listen to lectures.

 

2)      What makes a listening comprehension test item easy or difficult?

 

This question calls for research.  Previous research on “logical reasoning” items, however, suggests that all aspects of an item contribute to its difficulty. One factor that certainly made some LC items difficult was that they failed to make clear what the listener’s purpose should be in listening to the stimulus.  The question, however, is a little premature.  LSAC’s primary goals should be to define different sorts of listening more precisely, and to get a better fit between these skills and test items.  

 

3)Is there a difference between listening skills and reading skills for our test-taking population?

 

No one knows.  Anecdotal observation suggests that some people who are good conversationalists, and probably good short-term listeners, are not necessarily good readers, and vice versa.  When LSAC researchers started their work, a plausible hypothesis was that reasoning from what you have heard differs from reasoning from what you have read.  They have made the major discovery that there is no such difference (at least of a sort that can be detected in the LSAT).  That is a striking result, even though it undermines the practical value of their current LC test.

 

4)Is there a way to test listening skills that does not correlate highly with a test of reading skills?

 

We suspect that tests less focused on reasoning of the sort that the current LC test assesses may yield differences in listening and reading skills, and that these differences may matter in law school.  We re-iterate that it may be helpful to give up the tacit assumption that listening is a single unitary skill.  We suggest a variety of possible topics for test items to examine from the awareness of a shift in topic to the role of intonation in disambiguation.

 

5)How should requests for test accommodations be handled?

 

We support waiving the LC for deaf examinees at least in the short term.  The use of amplification for hearing-impaired candidates may be helpful, but is probably of limited value.  For examinees with auditory processing disorders, the accommodation of additional time seems reasonable, but may call for research into its validity.

 

6)Are there reasons to prefer either a paper-and-pencil format or a computer-based format for the delivery of a listening comprehension test?

 

LSAC could administer the LC test in a paper-and-pencil format using a CD to deliver the audio files, i.e., using the same procedure as the field testing.  Computer-based testing, however, has several advantages from the management of sound files to a much greater flexibility in allowing innovative sorts of test items.  

 

We conclude our report by providing a set of recommendations, including a series of additional research activities.  Our overall judgment is that the present LC test that LSAC researchers have developed is viable.  That is, with revision it could be incorporated within the LSAT.  But, we have questions about the construct being measured, and, as the developers themselves have shown, the results correlate too highly with both RC and LR tests.  Hence, given the cost of creating the LC items, which call for audio recordings of monologues and dialogues and for their presentation in the test itself, it would not be sensible to include the current LC test in the LSAT.  Comparable information about the candidates is already assessed in the RC and LR tests.      

 

The developers have made a promising start. We believe that their LC test does not yet meet all the theoretical and practical desiderata, but we believe that it is both worthwhile and feasible to continue the project of developing such a test.

 

 


Review and Evaluation of the Development of a

Listening Comprehension (LC) Section of the LSAT

 

Introduction

 

The ability to listen, to understand what you have heard, and to reason from your understanding is probably more crucial to performance in law school than to performance in many other disciplines. In addition, listening is probably the most important of the major communicative skills (reading, writing, speaking and listening.)  

 

Even though today most of us are heavy consumers of media, listening is still one of the most fundamental sources of information.  It is certainly the most used. In one of the first studies of its type, Paul Rankin (1929) asked persons to report how much of their communicative activity was devoted to differing communication types.  His respondents reported that they listened 45% of the time, spoke 30% of the time, read 16% of the time, and wrote 9% of the time.  In a work situation without media present, Klemmer and Snyder (1972) studied the communicative activity of technical persons by surreptitiously videotaping engineering offices.  They found that 68% of the day was spent in communicative activity of some kind, and of that, 62% was “face to face.”  Klemmer and Snyder did not distinguish between listening and speaking, but it seems safe to say that at least half of the “face to face” time would have been spent in listening.

 

LSAC is in the process of developing test items that measure this ability and that might be included in THE LSAT.   The current LSAT contains four sections that contribute to an examinee’s score: one reading comprehension (RC) section, one analytical reasoning (AR) section, and two logical reasoning (LR) sections.  The project to develop a listening comprehension (LC) test began several years ago.  It has proceeded in several concurrent steps, including Ken Olson’s (2003) paper LSAT Listening Assessment: Theoretical Background and Specifications, a Skills Analysis Survey, and both in-house and field testing of LC items.  To help LSAC with the development of the test, the Council convened a panel to advise them, and what follows is the panel’s report.  The panel’s remit was outlined in two places: Exhibit 1 of the consulting agreement and on p. 10-11 of the document, Update on new item type research and development, dated 3-18-04.  We have used these two accounts to organize this report.

 

Previous Tests of Listening Comprehension

 

There are at least five previous tests of listening comprehension.  We briefly describe each of them below.

 

1. The BCC Listening Test (1995) is a revision of the Brown-Carlsen Listening Comprehension Test (1955).  It is normed for adults of all ages and college students, and tests the following sorts of abilities: recall of items, recognition of word meanings, following instructions, and comprehension of lectures including making inferences from them. The test takes about 45 minutes to administer using an audio tape, and contains 76 items.

 

         The “BCC” test differs from the original Brown-Carlsen instrument only in that material is presented on an audio tape rather than being read by the examiner.  All of the normative data are based on the earlier version of the test.  Both of these tests and the research following from them seemed to demonstrate that individuals do indeed vary in their ability to retain information from spoken messages, and that instructional efforts to improve this ability were often successful.  This research assumed that listening had occurred if a person retained information from spoken discourse, that a person was a better listener if he or she had a higher score on a test of retention, and that listening was a unique skill that was not related to other cognitive skills.  

 

2. The Sequential Tests of Educational Progress: Listening Comprehension (STEP) test was published in 1957 by ETS.  This test was designed for grade school students, and for college freshmen and sophomores.  It tests the following abilities: comprehension of main ideas, memory for sequences, recognition of word meanings, grasp of implications and relations between ideas, detection of inconsistencies, and more pragmatic notions such as the rhetorical effectiveness of a speaker.  The examiner reads each passage and the multiple-choice questions, and the examinees write down their responses on separate answer sheets.

 

The STEP test forms the basis of subsequent listening tests that are part of the National Teacher Examination, which is used as an instrument to assist in teacher certification.  This test originally consisted of four parts: knowledge of subject matter, knowledge of educational principles, writing, and listening.  ETS now calls the sequence the Praxis examination and the listening portion is optional in some states.  Because these tests are much the same as the STEP test, we consider them to be one measure.

 

         3.  The Kentucky Comprehensive Listening Test (KCLT) (from the Kentucky Listening Research Center, 1985).  This test is based on the simple linear model of memory proposed by Loftus and Loftus (1976), which postulates that listening depends on three memory mechanisms: a sensory register that briefly holds information from the senses, a working memory for maintaining information in the short-term, and a long-term memory.  They are assumed to be the primary mechanisms of retention in short-term listening (STL), short-term listening with rehearsal (STL-R), and long-term or so-called “lecture” listening.  The Loftus and Loftus model assumes that if individuals do well in the first two sorts of listening, then they are likely to do well in the third sort.  This assumption turns out to be false, however.  Short-term listening, as research has shown, is different from the other sorts of listening.

 

         Norms for the test were based on an initial sample of 10,000 college students and adults.  Subsequent results established differences among several adult groups, including students at the US Army War College (regular army colonels) and trainees at the Connecticut General Insurance Company. The test is currently in use at the University of Maryland’s “extended learning” program as a part of a basic course in listening.  Their experience has indicated that the test’s scales are well-adapted to a computer-based presentation.  

 

The KCLT contains a crude measure of interpretive listening in which a dialogue with emotive content is evaluated by respondents.  Much more remains to be done in the area of affective listening, however (Thomas and Levine, 1994).  The most serious fault with the KCLT is that it attempts to do too much in a short period of administrative time (45 minutes).  

 

The most pertinent finding from the KCLT for present purposes is that it too has shown correlations between listening and reading.  They occurred between lecture listening and ACT social studies scores, a test which is a good indicator of reading skills.

 

4. The Watson-Barker Listening Test (1984) is designed for college students and business professionals.  It is designed to assess five abilities, including grasp of content, listening to dialogues, listening to lectures, and listening for emotional connotations.  The test has multiple choice items based on spoken or written items.  Watson and Barker designed this test as a demonstrative assessment technique, primarily in teaching seminars for organizations.  They cite the Kentucky test as validation for the inclusion of dialogic and interpretive listening items.

 

5. The Steinbrecher-Willmington Listening Test (1993) was devised by the eponymous members of the Department of Communication, University of Wisconsin-Oshkosh.  The test, which is for university students of communication, takes about 45 minutes to administer.  It consists of 55 items presented on video tape, and the examinees respond on answer sheets.  It is designed to test critical listening skills, and the ability to detect empathic cues.  The items include speeches, conversations, and directions.

 

The Test Development Documents and Materials

 

LSAC made available to us various documents and materials, including CDs of the tests.   We here outline the main documents, and in subsequent sections we review and evaluate them.  Our sequential description of these materials documenting the development of the LC test should not obscure the fact that the theory, item types, and items had to be formulated concurrently.

 

The theoretical development of the LC exam section is primarily given in LSAT Listening Assessment: Theoretical Background and Specifications (Olson, 2003).  This paper provides background information on the general nature of listening; models of listening; listening as compared to reading; listening in the academic context; and listening in law schools.  The background material is then used as a framework for an initial set of LC test specifications and item types.

 

LSAC also conducted a Skills Analysis Survey in which law school students and faculty were asked about the importance of certain tasks in law school courses.  Analysis of the survey results showed that listening skills were generally rated as important or highly important, placing them in the second highest category (of four categories) of tasks’ importance for success in law school courses.

 

Issues of diversity and accommodation, as they pertain to listening and to the LSAT are addressed in two other documents.  The first of these documents is A Report on the Inclusion of Diverse Speakers on LSAT Listening Comprehension Assessment: A Survey of the Literature with Recommendations (Strassel, 2003).  The second document is Accommodations for Listening Comprehension.

 

LSAC Listening Comprehension Reviewer Guide (2002) was used at both the item writing and the item reviewing stages of development.  (Item reviewers were also expected to look for conformity to the theory in LSAT Listening Assessment: Theoretical Background and Specifications as well as other quality criteria typically used at LSAC).  The Reviewer Guide includes general guidelines related to the LC items along with specific guidelines for advance organizers, stimuli, and items. 

 

Further background discussion of the LC development is provided in LSAC Listening Comprehension Item Type Development and Field Testing, 1998 – 2003 (2004) along with a description of the actual item types developed and field tested.

 

Theoretical Background on Listening

 

The main theoretical rationale for the LC test is stated in Olson’s paper (2003), in which he analyzed the literature on listening comprehension.  His paper emphasizes the role of dialogue in legal education in contrast to passive listening to lectures.  LSAC Skills Analysis Survey of 41 law schools corroborated this role.  Olson’s paper makes the central point that listening occurs in “real time”.  Unlike readers, listeners cannot go back to check a point that they may have missed or forgotten.  The paper also outlines an informal list of what a listening comprehension test should measure, and considers various formats for tests.  It proposes three main sorts of test item: those that measure understanding and recall of content; those that call for examinees to have constructed a mental model of the discourse (in the sense of Johnson-Laird, 1983) and thus to be able to make inferences from what they have heard; and those that measure the understanding of context, such as the ability to make inferences about the speaker.   Olson’s analysis is scholarly and technically accomplished, and we add only a few points and amplify others.  We could consider listening in general, but we will focus on listening in legal education and as measurable by conventional testing techniques.

 

Advantages of Listening

 

Listening as opposed to reading has advantages. The ability to listen and to understand your native language appears to unfold largely under the control of a genetic program.    You do not need to be taught how to understand your spoken language.  Speech tends to use words that are more frequent in usage than those in writing, and to use simpler grammatical constructions than those in writing.  And intonation contour, which depends chiefly on the pitch of the fundamental frequency at which a speaker’s vocal cords vibrate and on timing, provides valuable cues to the syntactic analysis of sentences.  For example, the following written assertion is ambiguous in its syntax, and hence its interpretation:

                        After the crash the bus landed with its front two feet up in the air.

 

Readers may parse it to mean that buses have front feet, but its spoken intonation contour makes clear that it was the bus’s front that was two feet up in the air.  Intonation readily disambiguates such phrases, but poor listeners may be less likely to make use of these cues. 

 

Spontaneous speech conveys a speaker’s attitude and emotional state of mind in a way that writing does not.  Ekman (1998) makes a convincing argument for the universality in all cultures for six different facial expressions: happiness, disgust, surprise, sadness, anger, and fear.  This universality of human facial expressions leads us to believe that specific vocal elements might also be universal.  The speaker’s emotion is all that survives if the speech wave-form is cut into small segments, and then played back in a random order.

 

Disadvantages of Listening

 

Yet, speech also has disadvantages over writing as a mode of communication.    Everyone speaks their native tongue with an accent, and uses a particular vocabulary.   These dialects vary more in speech than in writing.  The acoustic conditions of listening probably vary more than the conditions of reading.   Likewise, defects in hearing probably go uncorrected more often than defects in vision (compare the frequency with which individuals wear spectacles and hearing aids).  As Olson (2003) emphasizes, if you fail to hear something correctly, misparse it, or retrieve the wrong meaning of a word, then you may have no opportunity to go back to listen again or to question the speaker.   The “surface form” of speech is normally forgotten in a matter of seconds.  Moreover, intonation contour is a double-edged weapon.  It controls which segments of speech listeners are likely to focus on, just as the speed of comprehension is dictated by the speed of speech.

 

Working Memory

 

            Although the two terms are almost interchangeable, “working memory” has tended to supersede “short-term memory” in the literature.  This is a result of the development of theories of how we hold information in memory for a short-term in an independent system now known as “working memory” (see, e.g., Baddeley, 1981, 1986, 1996).   A major source of differences from one individual to another is now known to be the processing capacity of their working memories.  And a major role for working memory is in the comprehension of language, e.g., in the parsing of sentences, and in establishing co-reference between different noun phrases.  These differences are likely to be more apparent in understanding speech (and in generating it) than in reading, because of the “real time” nature of speech.   Hence, embedded structures such as restrictive relative clauses impose a load on working memory, e.g., “The man the dog that Mary owned bit … died from rabies” (see, e.g., Lewis, 1999).  Likewise, structures with gaps in them call for listeners to recall the surface structure of previous clauses, e.g., “The dogs were bitten by fleas. The cats were too.”  Individuals differ in their ability to cope with such sentences (see, e.g., Garnham, 2001), but it is probably impossible to measure their performance in the LSAT.  A notation, such as writing, can act as a substitute for working memory.

 

Listening and Reading

 

            Listening and reading are likely to be on a par in making inferences about the content of the discourse.  Once individuals have understood discourse, they are likely to represent it in the same way whether they heard it or read it.  There is no reason to suppose that the original medium matters.  The only evidence to the contrary is that bilinguals are affected by the language in which arithmetical problems are couched (Spelke and Tsivkin, 2001).    Hence, with the benefit of hindsight, the focus of the LC tests on inferential tasks is likely to yield correlations with the tests of logical reasoning from written descriptions.

 

            Dialogic Listening

 

To participate in dialogue is different from passive listening.   Dialogue calls for you to understand what is said to you, and at the same time to formulate a response.  This dual task imposes a load on working memory, and so the process can be demanding, especially in a legal setting.  Considerable differences in ability to conduct cogent dialogues exist from one person to another.  There are many examples of speakers who have gone wrong because they lack the capacity to process the implications of a person’s answers whilst at the same time formulating their own responses.

 

Dialogic listening is probably more dependent on working memory than is “lecture,” or passive listening.  Evidence for this difference is found in research using scales that assess memory in the short-term rather than in the long-term.  This research has shown that short-term listening scales are better than long-term ones for discriminating between good and poor bank managers (Alexander, Penley and Jernigan, 1992).  In a comprehensive investigation in a large insurance company, good short-term listening proved to be an excellent predictor of upward mobility (Sypher, Bostrom, and Seibert, 1989).  Those who are good short-term listeners excel in making oral presentations, whereas good lecture listeners do not (Spitzberg and Hurt, 1983).  Good short-term listeners ask more questions in interviews than poor short-term listeners (Bussey, 1991). And short-term listening is a fundamental skill.   As a study using accelerated speech showed, it is qualitatively different from long-term listening (King and Behnke, 1989).

 

"Lecture" Listening

 

When researchers first began to measure listening skills, they operated in an academic setting in which information is transmitted through reading textbooks and listening to lectures.  So it is not surprising that the first listening tests focused on lectures.   The Brown-Carlsen and the STEP tests do test other abilities, such as knowledge of vocabulary, but they are essentially tests of the ability to listen to an oral presentation and to answer questions about it.  Researchers assumed that listening is a unique and measurable characteristic.  This assumption was sharply attacked in the mid 1960s by Charles Kelly.  He argued that if listening tests measured a single ability, then the Brown‑Carlsen and the STEP tests should be more highly correlated with one other than either of them should be with other measures of cognitive ability.   Yet, the tests of listening were not highly correlated with one other, but were highly correlated with tests of intelligence, which Gardner (1983) would call "verbal processing".  It followed, as Kelly argued, that what had previously been labeled “listening ability” was an aspect of intelligence, and that the Brown‑Carlsen and the STEP tests were different kinds of intelligence tests (Kelly, 1965, 1967).  In other words, the tests differed in form and measured different aspects of the same underlying ability.  This argument may also account for the correlation between the LC scores and the other tests in the LSAT.  Listening, and especially reasoning from spoken materials, may not be a separate ability from reading, and especially reasoning from written materials.

 

         Like the Brown-Carlsen and the STEP tests, the LC test seems to have focused on listening in an academic setting, and on retention and reasoning.  Retention of detail and supporting material is important, but inferences about the speaker’s purpose and overall plans are also crucial, and the LSAT has included these elements.  The LC test allows individuals to take notes, but the role of notetaking is problematic.  Some authorities hold that notetaking decreases retention because the listener relies on the availability of notes as a substitute for memory.  Others contend that notetaking is an active response and is therefore likely to improve retention.  We could find only one study in which notetakers were compared to non-notetakers, and this study found no differences between the two (Waldhart and Bostrom, 1981).. 

 

The evidence demonstrates that there are different sorts of listening.  Listening to a person in everyday conversation differs both in underlying processes and in its results from listening to a lecture (Bostrom, 1990; Bostrom and Waldhart, 1980).  Listening is therefore not a unitary skill.  Similarly, reading may not be a unitary skill.  To read a P.D. James detective story may engage different processes from those in reading The Economist.   There is some evidence that reading and writing skills are not closely associated (Bracewell et al., 1982).  The lack of a strong association leads us to expect a similar dissociation between listening and speaking.

 

In retrospect, the results of the Brown-Carlsen test, the STEP test, and the current LC test, suggest that listening, reading, and intelligence share common components, and that a test of reasoning can predict reading and listening abilities.  A closer examination of the data yields a more refined conception.  Although tests of listening, reading, and intelligence have elements in common, these abilities do differ.   Here we define intelligence as "verbal processing," in the same manner as Gardner (1983).  They depend on different underlying processes, and, as we mentioned earlier, listening itself calls for different skills depending on whether the listener is engaged in dialogue or listening to a lecture.  

 

Memory

 

         LSAC’s background theory allows for the existence of working memory and long-term memory.   Within long-term memory, however, there appears to be a distinction between “semantic” memory, which represents general and linguistic knowledge, and “episodic” memory, which represents specific autobiographical events (Squire, 1986).  The two systems interact (Chang, 1986), but semantic memory is grounded in general and probabilistic information, and is a component of long-term memory (Baddeley and Dale, 1968; Kintsch and Busche, 1969; and Squire, 1986).  Various proposals have been made about the structure of semantic memory (see, e.g., Collins and Quillian, 1972; Kintsch, 1980; McCloskey, 1980; Johnson-Laird et al., 1984).  Listening to lectures chiefly concerns laying down new elements in semantic memory, whereas listening to a conversational partner chiefly concerns laying down new experiences in episodic memory.  This distinction may underlie some of the differences in test results (see the previous subsection).

 

The linear view of memory (see Loftus and Loftus, 1976) postulates that the sensory register holds the speech wave-form long enough for the construction of a phonological representation and even perhaps for some morphological processing.  Its output passes to working memory for syntactic and semantic processing (see, e.g., Schulman, 1972), and can be rehearsed by passing through the so-called “phonological loop” (see, e.g., Baddeley, 1986).   Certain contents can be represented in long-term memory, either semantic or episodic memory.  It follows from this account that a failure to lay down a proper record in the sensory register has adverse consequences for syntactic and semantic processing, and similarly a failure in syntactic and semantic processing has adverse consequences for long-term memory.  It also follows that listening can be classified into at least three categories: short-term listening, short-term listening with rehearsal, and long-term listening (so-called “lecture” listening).  Studies have indeed discriminated amongst these three sorts of listening (Bostrom and Waldhart, 1980), and shown that short-term listening seems to have little relation to the cognitive abilities measured in intelligence tests.  We consider the implications of these results in a subsequent section of this report.

 

Review of the Field Tests of LSAC’s LC Test

 

In this section we review the series of field tests conducted by LSAC on the LC test.  This work occurred in a progression of phases and is described in several additional LSAC documents.  These documents are also discussed below.

 

Preliminary research into the LC exam was conducted using small samples and a prototype CBT interface.  This work was followed by three stages of paper-and-pencil testing, conducted using full-length LC sections.  In LSAC Listening Comprehension Item Type Development and Field Testing, 1998 – 2003 (2004) a description of the actual item types developed and field tested is provided.  (LSAT Listening Assessment: Theoretical Background and Specifications, 2003 provides a set of proposed item types).  This document also includes detailed, practical discussion of matters related to item content (including difficulty), issues relating to the recording of audio prompts, and logistical elements of the delivery formats.  Furthermore, the progression of item type field testing across multiple phases is given.

 

In 2000, LC items were usability tested in two different CBT formats.  The results of this research are presented in Development and Testing of an Innovative Listening Comprehension (LC) Interface, (Swygert and Contreras, in press).  For both CBT formats the prompt, or selection, was presented as a spoken sound file.  In one of the formats the item stem was additionally provided as a sound file, but in the other format the item, instead, was printed on the screen.  As expressed in the report, the underlying approach to screen design and the methods used for conducting usability tests both followed good design principles from the field of Human-Computer Interface (HCI) research (e.g., Gould, Bois, and Ukelson, 1997; Landauer, 1995; Nielsen, 1994; Tullis, 1997).  In addition to the overall goal of estimating the usability of an LC interface, this research was specifically concerned with determining if one of the two prototypes was clearly better than the other.  Evidence strongly indicated that use of a printed stem, rather than a spoken stem, was preferable.  Further LC development thus followed this approach.  The paper concludes with a note that additional usability testing is needed, to assess the effectiveness of recommended changes once they were made. 

 

The first of the paper-and-pencil field tests (termed Phase 0) used in-house LSAC volunteers as examinees and reactors.  While preliminary timing and item response data were collected, a primary emphasis of this stage was the collection of detailed reactions and feedback from the examinees on many aspects of the LC section.

 

The next paper-and-pencil field test (Phase 1) was conducted in April and November of 2002.  The examinees in this case were primarily advanced undergraduates.  At this stage moderate-sized groups were used (e.g., 20) and some preliminary statistics were computed.  Four LC sections (A – D) were all field tested.  A questionnaire was used to collect examinees’ reactions to aspects of the LC exam, rather than the direct feedback comments elicited in Phase 0.  The earlier, small sample approach tended to yield item-level comments while the questionnaire produced general reactions. 

 

The final stage of field testing utilized large samples of examinees who were similar to actual LSAT examinees.   No qualitative feedback was obtained in this stage, but an extensive set of statistical analyses was conducted on these data and reported in a variety of documents.  The details about this field test phase are provided in Potential New LSAT Item Types: Next Steps (2003).  The LC sections were incorporated into a larger innovative items field test endeavor (other item types investigated included Innovative Analytical Reasoning and Comparative Reading).  Two of the four test forms developed included LC sections.  Form 3 included sections A and B and was administered to 1,433 examinees across two administrations.  Form 4 included C and D and was administered to 2,174 examinees.  Some of the statistical analyses were conducted by LSAC and some by their contractor ACT.  Some were conducted on the full set of examinees, while others were conducted on a more demographically representative sub-set of examinees.  In one report (Potential New LSAT Item Types: Next Steps, 2003) the analyses were conducted on a single test form (Form 3) and a single test administration date (October, 2002).  This statistical analysis is evaluated in a later section.

 

The succession of data collection phases followed by LSAC in pilot testing and field testing the LC items was overall quite sound.  With each later phase the design of the data collection moved from smaller samples and more qualitative feedback to increasingly larger samples and more quantitative information.  In addition, the lessons learned at each phase were incorporated into the LC items, producing a series of iterative improvements.   However, the methods used to increase the difficulty of Test 4 were not fully documented, and may not have been entirely systematic. 

 

Evaluation of LSAC’s LC Test – Construct Definition and Test Specifications

 

In this section, we evaluate four field test forms of the LC test.  We discuss the definition of the construct, the explicitness of test specifications, the appropriateness of some of the elements of the theory to testing situations, and the approach of operationalizing those elements.  We offer suggestions about how the test construct might be modified, about how the specifications might be revised, and about how item development might be more systematically controlled so as to be more in line with the theory of listening adopted.  Our comments are informed by hindsight, and in no way impugn the research that has been carried out, which is highly competent by any standards.

 

Definition of the Construct

 

Test developers investigated definitions of the listening construct and did not find any in the literature that were satisfactory for their purposes. They thus approached the task of defining the construct by means of describing a range of sub-skills that listening seems to entail. The approach is described in Olson (2003, pp. 4-5): “the search for a general definition of listening is unlikely to prove fruitful. For our purposes, it will be more valuable to determine…what factors ought to be included in a hypothetical construct called ‘listening in the context of law school’.” For the purposes of assessment, “all that is needed is some way of identifying the paradigm cases of listening, or of the particular type of listening one is interested in assessing.”

 

The factors ultimately included were based primarily on three sources, a list of 9 listening abilities identified as very important to academic success by Powers (1986) (referred to in Olson (2003)), a list of 4 skills rated as moderately to highly important to success in law school, based on the Skill Analysis Survey conducted by LSAC, and observations of law school classes by LSAC staff.

 

The nine important listening abilities identified by Powers (1986) were:

  1. Identifying major themes or ideas
  2. Identifying relationships among major ideas
  3. Identifying the topic of a lecture
  4. Retaining information through notetaking
  5. Retrieving information from notes
  6. Inferring information from notes
  7. Comprehending key vocabulary
  8. Following the spoken mode of lectures
  9. Identifying supporting ideas and examples

The four listening tasks identified as important in a the law school context were:

  1. Identifying key points of lectures and class discussions
  2. Distinguishing precisely what a person has said and not said
  3. Identifying what is implicit in what a person has said
  4. Raising important questions and arguments in response to what others have said.

These abilities and tasks, supplemented by observations of law school classes, served as the basis for the 12 item types used in LSAC’s LC measure, each associated with a specific listening sub-skill. The item types were divided into three main categories:

 

I – Understanding Content

1.      Recalling information
Skill tested: distinguishing precisely what a person has said or has not said

2.      Identifying main point of a discourse
Skill tested: distinguishing main point from subordinate points

3.      Identify argumentative or rhetorical structure of a discourse
Skill tested: identifying major relationships among major ideas

4.      Identifying points of agreement or disagreement (dialogue)
Skill tested: basic understanding of a dialogue

5.      Identifying a turning point
Skill tested: same as in (4)

II – Understanding Implications

6.      Drawing inferences from facts presented
Skill tested: identifying what is implicit in what a person has said

7.      Extending content
Skill tested: finding alternative models satisfying a proposition

8.      Evaluating arguments
Skill tested: recognizing strengths and weaknesses in an argument

III – Understanding Context

9.      Drawing inferences about speaker
Skill tested: understanding speaker attitude, speaker purpose, and speaker beliefs

10.   Identifying an underlying dynamic (dialogue)
Skill tested: identifying similarities and differences between speakers

11.  Replying to a question posed by a speaker
Skill tested: identifying appropriate replies to a question

12.  Identify an appropriate response to a speaker
Skill tested: same as in (11)

 

This list constitutes the construct being measured and specifies the item types on the LC test. The three broad categories identify important broad elements of listening and reasoning and the final list has much to recommend it. Yet, the list and its connection with the field test items raises a number of issues that might have ramifications for future revisions to the test and test specifications.

 

First, although the sub-skills associated with the item types may well be crucial to success in law school, only some of them were identified as important specifically for listening in law school; others were identified as important in academic contexts in general. This point should be acknowledged in future documentation.

 

A related point is that the abilities identified by Powers (1986) (as Olson (2003) describes it) and in LSAC’s Skills Analysis Survey were characterized as lecture-related or course-related. Yet, the majority of field test items are not presented as portions of lectures, and many do not have qualities of authentic academic lectures, much less of lectures one might encounter in law school. Thus, to claim that the field test items provide evidence about lecture-related skills is to make certain assumptions. For example, it seems to assume that inferencing is the same whether the inference is about a relation between ideas in a lecture or about a relation between ideas in a casual conversation, and that even if different kinds of real world knowledge are relevant to the two situations, such differences do not make the tasks significantly different. We believe such assumptions should be made explicit in the documentation.

 

Second, the list of 12 item types omits explicit reference to other skills described in the documentation as central to listening. Olson hypothesizes that one of the skills identified by Powers (1986), “following the spoken mode of lectures,” actually “means keeping up with the lecture ‘in real time’ and not getting lost.” This skill “is arguably what distinguishes someone who is good at listening to extended spoken discourse from someone who is not.” It is also apparent that the notion of keeping up plays a crucial role in the field test items, yet it does not seem to be reflected in the 12 item types/construct definition. It may be that it is viewed as a superordinate skill that is presupposed by the various item types, but if so, that needs to be made explicit.

 

Since “keeping up” is so closely related to memory, memory is also a significant ability being tested, though that is not reflected by the list of item types. (“Recalling information” is associated on the list only with the ability to distinguish what a person has said and has not said, but memory plays a part in a far wider range of field test items than this.) In fact, the main skill tested by some items appears to be memory for details and very specific propositional relationships (taking notes was nearly impossible given the type of information being presented and the speed with which it was presented). Although memory is certainly a crucial element of listening, LSAC’s own statements about the construct suggest that they do not wish memory to be the dominant ability being tested. We wish the documentation made it clearer what the assumed relationship is between “keeping up” and memory, and what the standard is for keeping up. We return to this below.

 

Other listening abilities not referred to in the list include the understanding of the organization of information, the understanding of metalinguistic questions and comments, the understanding of irony and sarcasm, and the recognition of topic shift.  Yet, these are generally agreed to be features of speech that listeners need to be aware of in academic and general listening situations. Moreover, a number of these features play a more significant role in listening than in reading, so increasing these features in the LC measure could potentially affect correlations between the LC and RC measures.

 

(We recognize that Olson (2003) claims that item type 3, “Identifying the argumentative or rhetorical structure of a discourse,” ‘corresponds to “identifying relationships among major ideas”.’ But identifying relationships is only one element among many of the ability to understand the organization of information.)

 

Third, the presentation of three categories (Understanding Content, Understanding Implications and Understanding Context) suggests that each category is to inform the same construct, namely, listening ability.  In our view, however, the field test items seem to fall into two distinct categories: those that test listening ability and those that test reasoning on the basis of an aural stimulus.

 

Of the 60 items in Field Test 3, approximately one third of the items focus on reasoning, even excluding items that ask what can be inferred. That is, about one third of the items in Field Test 3 ask how a position or argument can be strengthened or weakened, what assumption is being made, or what evidence would support a position, etc. (See in particular items 3.1.6 (Field Test 3, section 1, item 6), 7, 10-12, 14, 16-18, 25-27 and 3.4.10, 12-14, 23, 25-27, 29-30.  A smaller proportion of items in Field Test 4 are of this type.) Yet, weakening an argument and strengthening an argument appear nowhere on Powers’ list of listening abilities or in the Skills Analysis Survey (these items are very different from raising questions or giving supporting examples).

 

LSAC motivates the inclusion of such items on the basis of observation of law school classes.  It is claimed that “evaluation of oral arguments is ubiquitous in legal settings. It is a fundamental task of judges and juries... A number of situations in law school are calculated to develop this skill...” (Olson 2003, p. 20).  Yet, we wonder whether dedicating close to one third of the LC field tests to such items is appropriate, since they tap only one of 12 skills.  More important, though, we wonder to what extent the field test items designed to test this ability can be said to test listening?

 

It is our feeling that the items that fall into category II (Understanding Implications) provide evidence about reasoning ability, but it is hard to see what inferences could be made about an examinee’s listening ability on the basis of such items.  While one can make inferences about listening ability on the basis of a correct answer to an item that asks which of five written options most weakens an argument that was heard, it is not clear what can be inferred about the listening skills of an examinee from an incorrect answer.  An examinee might have understood the stimulus perfectly, but reasoned incorrectly.  Thus, even ignoring the potentially confounding effect of reading, which is especially noticeable in many of the LC items with their complex and lengthy options, the field tests seem to provide evidence for two separate constructs, not one. (As a reviewer of this report notes, this point applies equally well to RC, but it is possible the phenomenon could be mitigated somewhat in LC, as discussed below.)

 

Even some items that on the surface test listening seem on closer inspection to focus on reasoning and reading. For example, 3.1.12 asks listeners to identify a lack of clarity in what a speaker says, which is an appropriate task for a listening assessment. Yet, to key the item an examinee must not only remember the two specific alternatives mentioned (once, and not reinforced) but then process the logical relationships in the written option “Do you mean that you’ll need my support for your proposal even if you do get your supervisor’s approval first, or only if you don’t?” and apply these to what is remembered about the stimulus. Reasoning and reading ability are clearly crucial skills involved here, possibly more so than listening.

 

Possible Modifications of the Construct

 

The dual nature of the construct seems to us problematic since it conflicts with our expectations about what the test would be. The authors of this report are not in perfect agreement about how best to deal with this issue, so we present a number of possible approaches LSAC might take in what follows.

 

Certainly the easiest remedy would be for LSAC to make no changes to their construct assumptions but only to their claims about what they are measuring.  Perhaps a test of listening and reasoning rather than just listening would be of value if correlations with reading could be lowered by taking some of the steps recommended below. 

 

LSAC could go further in the direction of reasoning and change the focus of the LC section from listening to aurally-based argumentation, to name the section something other than “Listening Comprehension”, and to define the construct as “Ability to reason on the basis of an aural stimulus”.

 

A different remedy would be to avoid testing reasoning at all in the listening section, aside perhaps from basic inferences about information and speaker intention. Yet, the idea of testing a prospective law school student’s ability to deal with oral argumentation seems consistent with the abilities valued in the Skills Analysis Survey. Thus another possibility is to develop more items that require reasoning skills but whose central focus and purpose is to provide information about listening ability. The difficulty of course would be in justifying claims that items test listening more so than reasoning.

 

Yet, a number of field test items seem successful in focusing on listening while also testing reasoning (though we would perhaps revise some of them for authenticity of language, structuring, etc.).  For example, the advance organizer to item 4.4.11 tells examinees that the speaker is drawing a comparison between coffee and wine, thus preparing them for the kind of information structure to expect. The stimulus presents the comparison and the item asks how the speaker’s reasoning is flawed. This is a reasoning  task that many listeners might have undertaken while they were listening to the stimulus, even without being asked by the item prompt – much of the reasoning takes place during listening and many test takers might have been able to key the item before reading any of the options.

 

Similarly, 4.4.20 seems a good candidate as an item type. The item itself was flagged and may not be a particularly good exemplar, but as a type it does the same sort of thing as 4.4.11 – as listeners process what they are hearing, many are likely to reach the same conclusion or summary statement that is reflected by the item key.  Item 4.1.21 might also be a good type. It is a conversation between two people about a lost briefcase. The item essentially tests whether listeners recognize what has been said and not said and what would be important to know that was not said.  

 

Obviously it is a subjective matter to judge the degree to which an item focuses on listening or reasoning. We would recommend, if it is decided to pursue this approach, that LSAC develops as explicit a set of guidelines as possible that describe the principles underlying items identified as testing reasoning, while being focused on listening. For example, one principle might be, “The reasoning needed to answer the question depends only on the stimulus information” (or, “the stimulus provides enough information for examinee to arrive at the key without seeing the options”); that is, the reasoning relies only (or mainly, if stems are written only) on what was heard.  Items such as 4.4.11, 4.4.20 and 4.4.21 seem to be of this kind.  While an incorrect answer to these items still allows the possibility that a test taker has reasoned incorrectly, the relevant reasoning takes place during listening – it is primarily what the test takers hear (and the prompt) that is relevant to reasoning.  In contrast, in items that (we claim) provide little evidence about listening, a test taker must read all of the options closely and integrate that information with what was heard and remembered. What test takers heard is important for the reasoning task, but what they read is just as important.

 

Items in which test takers are presented with two separate aural stimuli could provide another item type that would force listeners to reason while listening if test takers are forced to recognize some contrast in the information in the two stimuli. (The set of items 4.4.25-30 has something of this character, where two versions of an accident are described.  This is a very interesting item type, and although we might take issue with the particular execution of the item and with some of the points tested, the basic structure of the stimulus seems to us to present a very good opportunity to test listening comprehension.)

 

Such an approach, where information in the written options is not part of the reasoning, would not address the claim that reasoning based on what was heard is no different from reasoning based on what was read. Yet, it might allow more valid inferences about listening ability than many of the field test items seem to allow.

 

 

Test Specifications

 

The documentation describing what LSAC did is very thorough and reflects a great deal of thought and consideration. Almost all of the decisions that were made about the test design seem sensible, whether about delivery mode, timing, numbers of items, or number of times a stimulus would be heard.

 

It would have been helpful, however, to have seen the full classification of items. We would also recommend that in the future developers produce a test rubric precisely laying out in a single place the form of the test, details about the number of items of a particular type that may/must appear on test, features of item types that are not variable, the specific nature of stimulus material, the nature of instructions for test takers, and so on. Bachman (1990) provides a description of such a test rubric for language tests for ESL/EFL. Although much of this kind of information is available in the documents we reviewed, particularly LSAT Listening Comprehension Item Type Development and Field Testing, several elements are not explicitly provided that we feel ought to be included in future versions.

 

For example, no explicit guidelines are given about the language used in stimuli, the structuring and complexity of stimuli, or what degree of information density was considered appropriate in stimuli. We discuss the issues involved in more detail below.

 

Analogous questions arise when looking at item difficulty. An effort was made to make items on Field Test 4 more challenging than earlier items, but the documentation does not make clear how item developers went about the task, or if the methods they used were theory based.

 

Evaluation of LSAC’s LC Test – Item Illustrations

 

We discuss specific items here as a way of illustrating some of our claims above about construct definition and test specifications, as well as the sense that a number of items do not seem to reflect the theory underlying them.

 

While we spend considerable space below discussing and criticising aspects of the stimuli in LC items, perhaps to excess, we wish to acknowledge that aside from the issues we raise, most items are extremely well crafted and a vast majority of them performed well statistically. Criticism is easy in comparison to creation, and even well established tests are open to quite similar types of criticisms. We believe the test LSAC has produced is quite innovative and in many ways of very high quality. We view all of the issues below as relatively easy to deal with; they do not seem major impediments to the operationalization of an LC measure.

 

Spoken Language vs. Written Language

 

Olson (2003) discusses key differences between spoken language and written language, and many of the stimuli nicely incorporate some of the features thought to be characteristic of spoken language. It is clear that great efforts were made to make stimuli sound authentic. However, it was our perception that while many stimuli do have natural sounding speech (e.g., 3.1.1-6 and 4.1.25-30), some sound overly formal and lack the flavor of spoken language in diction and development. We believe that greater consistency could be achieved if test specifications include a fuller array of information about oral language.

 

LSAC tried to ensure that stimuli had natural sounding language (what we refer to here as “diction”, i.e., articulation, word and phrase choice, informality of tone, interjections, etc.) by requesting that item writers record item stimuli from notes rather than from fully scripted material. Thus many stimuli have, at least in part, the diction of authentic spoken language – idiomatic phrases, contractions, sentence fragments, etc.  Professional actors later rerecorded the stimuli for the administration of the field test items. The actors frequently delivered their lines in rather flat tones, however, which may have intensified the perception that some stimuli sound like written English being read aloud.

 

It seems fairly clear, however, that not all items were produced so naturally, (see for example 4.1.20, which is simply a list of three rules.). Moreover, it is not clear that when LSAC reviewed items produced by ACT they had specifications available to them about the kind of language that was appropriate in stimuli. This might have contributed to the written feel of so many of the field test stimuli.

 

In other cases some attempt was clearly made, but (we assume) the need to formulate the stimulus precisely so as to fit an associated reasoning task resulted in an unnatural sounding stimulus. In 4.1.7, for example, the speaker says the following: “Our local vegetable market, Greens and Things, sells some vegetables that are organically grown, but I like to buy vegetables that are organically grown, fresh and not expensive. The problem is, Greens and Things doesn’t sell any vegetables that are all three” [italics added]. Examinees are then asked, “What can one conclude from this?” The answer is “If there are any fresh, organically grown vegetables sold in Greens and Things, they are expensive.”

 

Authenticity of speech involves more than diction. The stimulus in 4.1.7 above expresses propositions and combinations of propositions that few people make in casual conversation. An utterance such as “The organic vegetables at Greens and Things are so expensive.” seems a more natural way of expressing the idea. The problem is that authentic spoken language does not easily lend itself to the kind of precise formulations needed to test reasoning, at least as reasoning has tended to be tested with reading passages. Perhaps the kind of reasoning being tested here ought to be reserved for longer monologues, where precision can be gained without sacrificing naturalness. See also 4.4.24.

 

Authenticity also involves elements of discourse structure: the density, sequencing and pace of information, and elaboration of points. We return to these below. Other features of authentic speech that might be increased in future versions include false starts, self-corrections, backtracking, repetition, and hesitation.

 

We believe the issue is important because it relates to the construct and possibly to correlations between listening and reading. If the language of the test items is very different from that used in a classroom, it might undermine the construct and face validity of the test. If the language and structure of the listening stimuli are closer to that of written texts than classroom speech, that might be partly responsible for the high correlations between the listening and reading items.

 

Anticipating the Direction of Discourse

 

It is stated in Olson (2003, p. 17) that a key feature of listening is the ability to anticipate “the direction in which a discourse is headed”.  We comment on two ways in which improvements might be made in helping listeners anticipate the direction in which a speaker is headed.

 

The first relates to advance organizers, statements that preceded stimuli and which provided listeners with contextual information about the speakers in a stimulus. We believe it was a very sound decision to use such advance organizers, and the use of advance organizers did provide suitable information about context and background of the speakers in some cases (e.g., 4.4.7). Often, however, they provided too little information. In normal academic situations, listeners have at least some passing familiarity with the people they are listening to (or speaking with), and some idea of the nature of what is being discussed and the purpose for the discussion. Yet, in many of the items, the advance organizers underspecify the nature of what examinees will hear and the nature of the task they will be asked to do. 

 

In a number of items it is not clear we are going to listen to an argument (e.g., 3.1.7, 3.1.9, 3.1.25, 4.1.10, 4.4.8). Often, it is not clear until the prompt informs us that is what we heard.  Yet, a listener’s purpose for listening might well determine what strategies and processes are used during listening. If a speaker’s purpose is not clear at the outset, it is arguable whether the item provides a fair tool for assessing listening proficiency or reasoning ability, since the listener might have implemented a different  listening strategy than that most likely to lead to success on items asking about the stimulus. Trying to remember specific facts and details (and trying to take notes on such) can be very different from listening for contrasting points in an argument.

 

This need for the listener to have a clear expectation about the direction of the discourse is particularly important for short, discrete items in which a word very early in the stimulus is critical for identifying speaker purpose and in keying the item. For example, hearing the word “ascribe” in 3.1.25 is crucial for keying the item. The word occurs at the very beginning of the stimulus and would be easy for a test taker to miss, making the item difficult to key correctly. A more informative advance organizer would alert examinees to pay closer attention to precise wording—and precise wording is often critical to arguments (again counter to the theoretical claim).

 

Another key in helping listeners anticipate where a speaker is headed is the use of rhetorical and discourse markers (which is connected to the presentation of information discussed above). Research has pointed to the importance of rhetorical organizers and discourse markers for following lectures. (See several papers in Flowerdew (1994). These consider listening from the perspective of the ESL/EFL student, but many of the findings apply to general listening situations.) Although the field test items focus on non-lecture type stimuli, many of the stimuli involve argumentation, which is a common information structure in classrooms. The use of rhetorical organizers and markers in a number of such stimuli seems insufficient, even though it was recognized in the documentation that anticipation is crucial.

 

Consider the stimulus to 3.1.13-18 in this vein. Listeners are told that the speaker is “giving a lecture about diet and nutrition to a general audience.” The advance organizer does not tell listeners to expect an argument, nor do they know what the context or purpose of the lecture is. The first sentence is “I don’t think there’s really any doubt anymore that a lot of noninfectious health problems are related to diet.” This would seem to be the thesis statement, and perhaps the point that will elaborated upon and supported with evidence. The speaker continues, however, by describing recommendations that nutritionists make nowadays, followed by the claim that there is evidence that those recommendations may be the wrong strategy. At this point it would appear the lecturer is about to cite facts or studies. Instead, the lecturer launches into a description of the diet humans had 10,000 years ago, before farming. While ultimately it all ties together, listeners must hold off from assuming they know the structure of information and cannot have confidence that they know where things are headed. Nor are any indicators given of what parts are most important for listeners to take away with them. The passage would work as a reading passage since readers can backtrack and reread, but to us it is not an exemplar of a typical lecture.

 

Academic lectures are given within the context of a semester-long course, so students know, if only in a broad sense, what it is they are expected to learn, what sort of information they will be responsible for, why they are listening, etc. None of that is possible on a test, but various accommodations can, and we maintain should, be made in a test of listening ability. Such accommodations include the use of discourse markers, repetition, reformulation, etc. Without this, the nature of the task seems unlike any task a student would encounter in a law school context.

 

Consider the stimulus to 4.4.1-6, a lecture on medical genetics.  It begins with two claims: (i) it is easier to study genetic disease in dogs than in humans, and (ii) to understand why it is easier, it is necessary to think about how dogs came into existence. A substantial amount of information about dogs and their offspring follows. It is not until about half way through the talk that humans are again mentioned, almost as an aside. The focus remains on dogs until the last two paragraphs when analogues are given between diseases in dogs and diseases in humans. The lecture concludes with a statement about how studying disease-related genes in dogs will make it easier to study gene-related diseases in humans.

 

The key to an item asking about the speaker’s main point focuses on human disease. This is at odds with what test takers are led to expect by much of the lecture content, the lecture structure, its rhetorical markers, and its most salient information. Listeners are not helped along the way in anticipating where the speaker is headed, and there is so much information that listeners must process and try to remember that one finished listening without a good sense of what the point of the whole thing was.

 

In a real academic lecture it is likely that the structure would have been more transparent, the main points would have been stated explicitly up front, and numerous rhetorical and discourse markers would have been used to help students keep up with the lecturer and anticipate where it was headed. Speakers generally want to be understood, and those making an argument want to be persuasive and will try and assist their listeners in this.

 

Stimulus Structure and Points Tested

 

What makes the lack of discourse and rhetorical markers especially noticeable is that some monologues, like 4.4.1-6, are fairly dense in information. There are a lot of propositions expressed. While the long monologues are totally coherent, the connections between propositions expressed in them often seem more typical of a written text than a spoken passage. Similarly, there is often a lack of elaboration of points. Since the longer stimuli contain so much information, they impose a heavy memory burden on listeners and make it difficult to anticipate the direction in which a speaker is headed.  In authentic lectures, lecturers stay with a single point for a while, they reinforce ideas, express them in different ways, give illustrative examples, and so. Much of that seems lacking in the field test long monologues.

 

The documentation does not suggest that conscious consideration was given to what degree of information density was considered appropriate in stimuli, what degree of elaboration or reinforcement was appropriate, what degree of detail was appropriate to include and/or test, or how pragmatic elements were to be tested. It is impractical and probably not necessary to precisely specify a range for each of these features, and for many tests such information might reside in the common culture and practices of test developers. Yet, LSAC’s Listening Comprehension measure is new, so there is no history on such matters. Some more explicit information on these questions thus seems important for generating items that are comparable to one another and consistent with the construct.

 

Moreover, in the field test items speakers often do not distinguish more important information from less important information. Test takers are forced to treat all information as equally important. This dictates the kind of listening strategies used, which for some might be to try to remember exact wording, which would be problematic on a practical level and would run counter to a theoretical claim underlying the test. It also likely impacts item performance. In a number of items, keying an item relies on having attended to some very specific portion of stimulus which was not marked as important (no rhetorical cues were given of its significance, for example).

 

As noted earlier, LSAC’s theory of listening places more emphasis “on keeping up rather than on recall of detail.” There may be much virtue in this idea, but implementing it in a test would seem to require a clear standard for “keeping up” and/or a link to classroom tasks—does it mean keeping up with very dense information structures, or with a higher-level flow of reasoning? What qualifies in the classroom as a satisfactory level of “keeping up,” and how can it be tested? And if a stimulus makes no concession to listeners in helping them anticipate what is coming, it makes it more difficult for a listener to keep up, but in a way that seems inauthentic relative to a law school context.

 

The importance given to “keeping up” is connected to the lack of importance given to the ability to recall details: in following a lecture, “the ability to recollect details…is a relatively unimportant ability…in a context like law school, where notetaking is common” and “recall of detail [is unimportant], both because detail does not seem to be what is most important in a dialogue and because the comparatively loose organization of a dialogue may make such recall even harder.” (p. 7)

 

The field test items can be viewed as consistent with this: the items do not generally test recall of details of fact, and the details that are tested are unlikely to be written down in notes since the stimuli typically do not lend themselves to notetaking (information comes too quickly and is not marked as important/unimportant, and most students have no experience taking notes on conversations and would probably see little reason for doing so). Yet, the field test items certainly do require test takers to recall very specific points that were not reinforced or elaborated upon (and about which they are not likely to have notes). See 3.1.5, 3.1.13-18, 3.4.19-24 and 4.1.13-18. 

 

Consider 4.1.16, based on a stimulus in which two candidates for mayor give campaign speeches.  The stimulus has an interesting structure that would seem to have great potential for testing listening comprehension. With respect to unemployment, Smith says that during his term “employment has increased, hundreds of new jobs have been created in the private sector, many more people are employed in this city now than [before]”. In his rebuttal, Chen says “let’s look at the record: A gain in the unemployment rate, which actually increased one whole percentage point.” Chen then goes on to discuss increases in price for city services, an increase in city spending, and so on.  The item presents analogies to the difference between Smith’s and Chen’s views on unemployment.

 

The item requires that test takers remember a single line of Chen’s, buried in the middle of the first paragraph of a three-paragraph speech. The line is not reinforced or marked as important and it is not returned to. It passes so quickly that a test taker would have to be very quick to have a made a note of this point without missing what came next. Surely this is a detail. Perhaps not a detail of fact, and it is consistent with Chen’s overall tone and message, but it is a detail nonetheless. Yet, the particular execution of the items seems flawed in a number of respects and to be inconsistent with some documentation claims.

 

Even some short items require recall of very specific information, 4.4.24 for example. Although the item was flagged, it serves well to illustrate the general point. The stimulus is a short passage spoken by a company president about whether the company should build a day-care center or a parking garage. There are three crucial propositions in the stimulus: (a) letters received by the president supported a day-care center, (b) a parking garage would benefit more employees, and (c) since most employees support a day-care center (evidence for which is the letters received), that is what the company should do. The question then asks which of five options would NOT be relevant to determining whether the president has reason for her conclusion.  It is necessary then to consider each option against the three propositions, which requires a fairly precise formulation of the propositions. To us this is a good item to test reasoning ability, but a poor item for testing listening ability; moreover, it clouds what exactly is considered a detail worthy of testing.

 

One of the theoretical claims made in the documentation is that there is a 60 second time lag between the presentation of a stimulus and activation of long-term memory, “which may depend on rehearsal … and organization schemes” (Olson 2003,  p. 10). Yet, stimuli often provide insufficient opportunity for rehearsal, hypothesis formation or conscious activation of organization schemes, since information is coming too rapidly and there is no reinforcement or paraphrasing or repetition. It is a common feature of authentic speech, especially in classrooms, that teachers reinforce information by rephrasing key ideas, using examples to reinforce points, and using other pedagogical tools. The lack of such reinforcement in many items reduces the authenticity of the items and increases questions about exactly what is being measured.

 

This notion of authentic classroom speech is relevant to the discussion even though the items on the field tests do not in the main have academic content. The purpose of the test is to provide evidence for inferences about potential students’ ability to follow lectures in law school. Yet, omission in test items of crucial elements that we assume are characteristic of the target situation (law school classes), such as rephrasing, repetition, reinforcement, or alternative presentation of important points, raises questions about whether the test items consistently demand the same type of listening as that called for in the target situation.


This is related to the perception noted above that, for a number of field test items, the point tested was not marked as significant in the stimulus, so might have seemed an unexpected point for testing. In authentic lectures, professors are likely to let students know what the important points are. There are numerous methods for conveying importance, including loudness, slowing the speed of delivery, repetition, rephrasing, giving multiple examples (with perhaps some variation of detail or focus), querying students on their understanding, and even making explicit statements of importance. Yet, none of the field test stimuli exhibit such features.

 

It is claimed (Olson 2003, p. 14) that “we do not typically recall the exact propositions making up the passage or talk...”. While research would seem to support this claim, we contend that a testing situation makes special demands on listeners that might to some extent mitigate this phenomenon.

 

We believe that a listener’s purpose for listening is likely to determine what strategies and processes are used during listening. So test takers ought to be informed in advance that exact wording will not be tested.

 

On the other hand, the principle should be realized in a consistent manner. A number of items appear intuitively to rely on close to the exact wording of the stimulus (for example: 3.1.5, 3.1.12, 3.1.27, 4.1.7, 4.1.11, and 4.1.20). For example, in item 4.1.11 the last line spoken by the library employee is “Yes, I do have a record of all three being returned ten days ago, but they were all new books, so they could only be checked out for a seven-day period.” Correctly answering the item depends on understanding the propositions “all three [books] were returned ten days ago” and “all three books could only be checked out for seven days.” Successful performance on such items would seem to demand that listeners retain fairly precise thematic and quantificational relationships.  Even if precise wording is only a demand of short dialogues but not longer passages, test takers ought to be made aware of the demands that will be placed upon them.

 

We also note here that certain stems used in the field test items are not transparent in what skill is being assessed, or what can be inferred from an incorrect answer. For example, one stem type that recurs asks which of the following questions would be most relevant/appropriate for someone to ask.  What seems inconsistent across tokens of this stem type is the standard for relevance/appropriateness. On occasion it seems what is being asked is “which of the following is a question that was not addressed in the stimulus” (e.g., 3.1.9). In other cases, though, test takers must evaluate the kind of information that answers to the question options would provide, and whether that information would undermine or support stimulus statements (e.g., 4.1.14).

 

Pragmatics

 

Pragmatics is given prominence in the documentation and a number of items nicely test elements of pragmatic understanding, for example 3.1.1 and 3.1.30. However, a number of pragmatically driven elements such as tone of voice, sarcasm, digression, emphasis, and attitude are made too little use of in items, even though all are crucial elements of listening and potentially might help distinguish listening comprehension from reading comprehension. In this we take issue with the claim of Powers (1986) (as cited by Olson,  2003) that attitudinal signals such as “tone of voice, sarcasm and humour” are among the least important listening activities in an academic context.

 

Review of the Statistical Analyses of LSAC’s LC Test

 

In this section we review the statistical analyses conducted on the data from the final stage of field testing.  LSAC documents related to statistical analyses include Potential New LSAT Item Types: Next Steps (Plumer, 2003) and Update on New Item Type Research and Development (2004), both of which summarize aspects of the field test design and their statistical results.  Further statistical analyses are provided in Data Analysis of LSAT 2003 Innovative Item Field Test Study – Field Test 3 (2003) and Data Analysis of LSAT 2003 Innovative Item Field Test Study – Field Test 4 (2003) and in Ethnicity by Gender Impact Analyses (2004), as well as the set of original tables and reports.

 

When an exam program elects to greatly revise a test, or to develop a new test as a component of the exam program, the statistical analyses to be conducted are often crafted around the following critical questions: 1. Does the test display overall soundness?  2. Does the test appear to display construct validity? (i.e., does it measure what it is intended to measure)?  3. Does the test have negative consequences for any examinee sub-groups? (i.e., does it have problems with DIF or impact?)  The statistical analyses and reports compiled by LSAC on the LC exam can all be seen as addressing these concerns. 

 

Soundness

 

The question of overall soundness can be considered through the classical test statistics that were computed on the LC items.  Accuracy and reliability of the LC test, as measured through SEM and KR- 20 statistics, show a test that is performing comparably to the other LSAT sections administered in the field test.  The discrimination indexes are relatively similar to those found in the other LSAT sections and in other high-stakes standardized exams, with most values falling between about .25 and .65.  Both the distribution of p-values and the examinee score distribution reflect a test that is a little easier than the other LSAT sections (particularly in Form 3).  However, they are again reasonably similar to other norm-referenced test programs.  Further information about item-level performance, including distractor performance, is given in the Strip Tables and Fifths Tables.  These results also suggest that the LC section as developed thus far is reasonably sound.

 

Soundness can also be considered through IRT analyses.  The IRT analyses of the LC items address overall model-data fit, as well as a consideration of the item parameter estimates.  An examination of the item characteristic curves (ICCs) suggests that the items have good model-data fit, within the ability range of -2.4 to +2.4 where the great majority of the examinees fall.  Overall, the items also display acceptable c (pseudo-guessing) parameters, reasonable a (discrimination) parameters, and an appropriate range of b (difficulty) parameters.  Thus, the IRT analyses conducted on the LC exam also support the overall soundness of the LC items.

 

Construct Validity

 

Whenever innovative items, or an innovative test, are added to an existing exam it is worthwhile to investigate what the new items are contributing.  The question of construct validity is much harder to answer than the question of test soundness, however.  If the new items show a high degree of similarity to previously existing items and item types, it may seem that change is unnecessary.  But, if the new items show very little in common with existing exam components, it may appear that the new items are measuring a construct which does not belong within the exam battery.  For these and other reasons, the analyses that contribute to this question must be interpreted within the light of the overarching exam program goals, not merely in terms of statistical criteria.

 

Two statistical analyses were conducted which provide preliminary information about the construct validity of the LC test.  First, several confirmatory factor analyses (CFA) were conducted to determine the relation between LC items and other items administered in the field test.  Results of the CFA indicated appropriate model-fit for the analyses containing LC set-based items, LC independent items, and RC items.  No model was run that consisted of just LC and LR items. 

 

Second, the construct of the LC section was examined by considering the relation between LC items and other LSAT items.  Correlations between the new LC items with the existing LR and RC items showed close but not identical relations.  Raw correlations for the LC set and independent items ranged from about .65 to .72 with both the RC and LR items, while the RC and LR items were correlated about .69 to .72 with each other.

 

The LC items, both set and independent combined, were also considered in terms of their relations to other LSAT sections using both raw correlations and correlations corrected for reliability.  As shown in Table 1, the strength of the relation between LC and RC, and between LC and LR was quite similar to the relation between RC and LR, across sections and forms.  (Correlations between the AR and other tests were lower in all cases.)

 

Table 1

Range of Raw Correlations and Correlations Corrected for Reliability

Between LC, RC, and LR

 

LSAT Sections

Raw Correlations

Correlations Corrected for Reliability

LC and RC

.68 - .70

.89 - .91

LC and LR

.66 - .70

.88 - .96

RC and LR

.69 - .72

.91 - .92

 

In Potential New LSAT Item Types: Next Steps the goal of the LC section is described as measuring “listening and reasoning abilities at the graduate level”, so perhaps this pattern of relations indicates a measure of success.  However, it may be possible, as we suggest elsewhere in this paper, to separate the LC test from both the RC and the LR tests to a greater degree.

 

To evaluate whether the test is measuring a unique and important construct, it is probably necessary to conduct additional validity studies.  Ideally, this would include a predictive validity analysis, correlating performance on the LC, along with existing LSAT sections, with later performance in law school courses.  An analysis that built on the findings of the Skills Analysis Survey could address those courses in which listening skills were most highly ranked in terms of their perceived importance for success in law school courses.

 

Effects on Examinee Sub-Groups

 

For the third analysis of a new test or test section, it is necessary to address the potential for new items or new item types to have negative consequences for particular examinee sub-groups, even if the items appear to be functioning appropriately at the total group level. 

 

Two items favoring Asian/Pacific Islander examinees were found to display C-level DIF.  In addition, this comparison showed the highest percent of B-level DIF flags.  A number of DIF and impact studies were conducted on the LC field test data.

 

DIF analyses, using the Mantel-Haenszel (MH) procedure, were conducted on all items in the field test.  These analyses were conducted by gender, ethnicity, and country.  For the most part, the percentage of items flagged as potentially displaying DIF on the LC sections was comparable to the percentages found on other field test sections and was within the acceptable pretesting bounds.  The only LC items to display C-level DIF (i.e., the more serious DIF) were for the Caucasian comparison with Asian/Pacific Islander.  This comparison will be discussed further below. 

 

Impact studies were conducted by gender and ethnicity.  These studies included interaction plots of the mean percent correct on the LC and other test sections as well as significance testing using ANOVAs and MANOVAs.  Gender comparisons showed no impact, as males and females had similar mean percent correct scores on all LC sections and forms.  For ethnicity studies, the primary finding of the impact studies appears to be an effect for Asian/Pacific Islander.  (A greater effect was found on Field Test 3 than Field Test 4.)

 

Additional analyses related to the impact seen for Asian/Pacific Islander examinees included gender by ethnicity plots of mean percent correct test scores.  These plots and analyses showed that while female Asian/Pacific Islander examinees showed an unexpected decrease in scores on the LC section, the far greater decrease was displayed by male Asian/Pacific Islander examinees.  In addition, analyses were conducted on examinees’ relative performance based on whether they had indicated that English was or was not their dominant language.  And, the kinds of test preparation activities examinees engaged in were used as another type of analysis to determine whether differences on this variable might explain Asian/Pacific Islander performance differences.  While the topic of this report is the LC exam, it is pertinent to mention that the impact findings for Asian/Pacific Islander examinees was also noted for the Comparative Reading items investigated on the field test.

 

Given that these analyses have not explained the performance difference found for this examinee subgroup, additional studies are warranted.  Further investigation into the potential for language background as an explanatory factor ought to be considered.  Examinees from non-dominant language and cultural groups have often been known to misrepresent their language skill on forms such as LSAT.  If examinees feel that their language skills are adequate, they may elect not to indicate “non-native speaker of English” status, as this claim often leads to bureaucratic tracking efforts and additional obligations on the part of the student.  Furthermore, such examinees may be members of cultural groups whose language norms are very different from those of members of the majority, English-speaking culture.  Turn-taking, logical flow, and patterns of silences are cultural characteristics of language that may be sufficiently different to affect examinees even if their written communication is comparable to the dominant cultural standard.  To the extent that the LC items are producing this kind of impact, a construct other than RC clearly appears to be being tapped. 

 

Studies of the examinees’ language and cultural backgrounds could be of help in identifying those Asian/Pacific Islander examinees whose language skills are target-like and those whose skills are not.  With this better distinction between groups, impact analyses may be more meaningful.  Researchers could consult studies that identified aspects of spoken language that differ across the dominant American culture and the Asian/Pacific Islander culture.  These differences could then be used to analyze the LC items to see whether such cultural differences are the source of the impact. 

 

These analyses may fit well into other construct refinements of the LC exam.  When LSAC is ready to produce operational LC sections, these differences could also be used in the informational and marketing materials for examinee preparation. 

 

Evaluative Questions About LSAC’s LC Test

 

In this section of the paper we consider the results obtained so far on LSAC LC test through a series of six evaluative questions. 

 

What would a listening comprehension assessment add to the LSAT?

 

An assessment of listening comprehension in the LSAT would be useful because it would give admissions officers more information about applicants.  Listening is probably the primary source of information for all of us, but listening skills are of different types.  Lecture listening assessment probably adds little to the LSAT, but short-term listening and interpretive listening are of great importance not only in law school but in organizational life in general.  These points are discussed in greater detail below.

  

         We are all processors of information, and it is axiomatic that good decisions (which lead to good actions) are those based on as much information as possible.  It is also axiomatic that all of us vary in our abilities to receive and store information.  We gain information by sensing our environment, listening to others, reading, and media consumption.    

 

Interpretive Listening

 

         One of the prime advantages of listening is that speech conveys information about the source, as well as the message.  Individuals display attitudes, involvement, and basic comfort, using tone of voice.  Dialogic and other interactions are typically affect-based.  In legal practice, attitudes are also important.  For example, Olson refers to the “ability to identify the other party’s settling point,” (p.  10), a concept that is strongly related to affective states.  Later, on the same page, he refers to the fact that one often has to keep two or more different “positions or agendas” in mind while listening.  These also concern both affective and cognitive components. 

 

         Sight vs. Sound

 

One problem with measuring the nonverbal aspects of listening stems from relying on Wolvin and Coakley’s definition of listening as a process involving “aural and visual stimuli.”  (Olson, 2003, p.8).  This is indeed the exact wording used by Wolvin and Coakely, but their intention was to include both verbal and nonverbal communication, rather than distinguish between sight and sound.  Two basic confusions are inherent in this interpretation: one, that nonverbal signals are entirely visual, and two, that nonverbal signals are primarily “emotional.” 

 

         Emotion and Academics

 

Affect, or emotion, is difficult to rule out of our daily life, even in the classroom.  Most speakers (even professors) find it difficult to present information without betraying whether or not they agree or disagree, approve or disapprove, or hold any kind of attitude about the subject matter under discussion.  The ability to know what the speaker thinks about the subject is a vital communication skill, and most often this decoding is derived from interpretation of the nonverbal messages sent by the speaker.  Popular publications that purport to teach us to “read body language” have been a great hindrance in the study of nonverbal communication, but the truth is that no one functions today in modern culture without a basic knowledge and understanding of the information involved in nonverbal signals.         

 

“Nonverbal communication”, as we use the expression, is not confined to the visual channel.  The manner in which the voice is used is a powerful element in communication, and calls for what is usually termed “interpretive” listening.   “Interpretive” listening is identical to vocalic decoding.  This is usually understood to mean the processing of emotional or affective content from a message, primarily from “tone” of voice, inflection, and other variations of voice.  Table 2 illustrates the many possible aspects of vocal variation and the interpretations of same (Addington, 1966). 

 


Table 2

Vocal Types and Personality Perceptions

 

----------‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑-----------‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑

Vocal Types         Speakers              Perceptions

‑‑‑‑‑‑‑‑‑‑-------------------‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑

 

Breathiness   Males              Younger, more artistic

                                     Females           More feminine, prettier, more petite, more effervescent, more

                                                  highly strung, and shallower

 

Thinness        Males              Did not alter listener's image of the speaker; no significant

                                                  correlations

 

                                     Females           Increased social, physical, emotional, and mental                                                                                                       immaturity, increased sense of humor and sensitivity

 

Flatness          Males              More masculine, more sluggish, colder, more withdrawn

                                     Females           More masculine, more sluggish, colder, more withdrawn

 

Nasality          Males              A wide array of socially undesirable characteristics

                                     Females           A wide array of socially undesirable characteristics

 

Tenseness      Males              Older, more cantankerous

                                     Females           Younger, more emotional, feminine, high strung, less

                                                  intelligent

 

Throatiness   Males              Older, more realistic, mature

                                     Females           Less intelligent, masculine, lazier, more boorish, unemotional,

                                                              ugly, sickly, careless, inartistic, naive, humble,

                                                  neurotic, quiet, uninteresting, apathetic

 

Orotundity     Males              More energetic, healthy, artistic, sophisticated, proud,

                                                  interesting, enthusiastic

 

                                     Females           Increased liveliness, gregariousness,

                                                              aesthetic sensitivity

 

Increased       Males              More animated and extroverted

Rate                 Females          More animated and extroverted

 

Increased       Males              More dynamic, feminine, aesthetically inclined

Variety           Females          More dynamic and extroverted

 

 


         A substantive body of research clearly indicates that interpretive listening is also strongly affected by an individual's ability to decode the nonverbal cues present in the exchange (Burgoon, 1994).   A graphic demonstration of the role of vocal interpretation is shown in an early study done by Mehrabian and Weiner (1987).   In this study, individuals heard a word from a recording and were asked if they could tell if the speaker had a positive or negative attitude.  The researchers prepared three different sets of words, describing the same event,  which, in and of themselves were positive, neutral, or negative.  The sets looked like this:  

        

                                     Positive word:       error

                                     Neutral word:       mistake

                            Negative word:     lie

 

Then the readers inflected the word with positive, neutral, or negative tone of voice.  For example, a positive reading would be done with a rising inflection, a negative one with falling inflection, and the neutral one with no inflection.  The listeners were divided into three groups.  The first was told to use content only, the second to use tone only, and the third to use both content and tone in their judgment.  The results for the first speaker are presented in Table 3 (the researchers used a second speaker with a second table, and its results are substantially the same as the first one).  Respondents used a seven-point scale, ranging from -3 to +3. 

 

The table illustrates a broad general effect: that vocal tone is generally more likely to overrule the content in the evaluation.  For example, in the condition where the respondents were asked to use both tone and content, the judgments of the word that had positive content and negative tone resulted in a negative judgment of -.87.  In that same condition, negative content and positive tone produced a positive judgment of 1.21.  Mehrabian and Weiner's respondents were able to use either tone or content, but when they were instructed to use both, tone generally overruled content.   

 


Table 3

Degree of Inferred Positive Attitude for Speaker A as a Function of the

Nine Content by Tone Stimulus Conditions and Instructions

 

 

Instructions

 

Contents

           

               Tone

 

 

Negative

Neutral

Positive

 

Use content only

Negative

Neutral

Positive

   -1.33

   -0.47

    1.03

   -1.00

   -0.17

    1.30

   -0.67

    0.35

    1.70

 

Use tone only

Negative

Neutral

Positive

   -2.47

   -2.07

   -1.37

   -0.03

   -0.67

    0.17

    1.40

    1.73

    1.63

 

Use tone and content

Negative

Neutral

Positive

   -1.77

   -1.67

   -0.87

    0.30

   -0.40

    0.40

    1.21

    1.10

    1.10

 

Mehrabian and Weiner's study, together with many more recent investigations, illustrates the importance of “interpretive” listening.  This “tone” illustrates the fundamental nature of “interpretive” listening, which is primarily a listening skill but is not visual in nature.  Often when nonverbal signals contradict the verbal ones, individuals typically accept the nonverbal as a more valid expression of the true feelings of the interactant (Burgoon 1994: Leathers, 1979).  Most investigations of nonverbal cues center on visual displays, such as facial expression, posture, and the like.  Others have investigated “vocalic” messages, such as pitch, intonation, and inflection. Visual cues have typically been shown to be of greater influence than the vocalic ones in most situations.  However, some studies show that vocalic cues are of more use in detecting deception than visual ones (Littlepage and Pineault, 1981; Streeter et al., 1977).

 

However, Keely-Dyreson et al. (1991) examined decoding differences in isolation.  They compared the ability of respondents to decode visual cues with their ability to decode vocal cues, and found that visual cues were more accurately perceived than vocal ones.  Some gender differences were also observed.  In short, the division of messages into “verbal” and “nonverbal” categories may be too simple.  Visual and vocal cues, both of which have been categorized as nonverbal messages, would seem to differ in important ways.  Comparisons of decoding abilities are rare.  What the relations might be among visual/nonverbal decoding ability, vocal/nonverbal decoding ability, and verbal decoding ability is not known. 

 


What makes a listening comprehension test item easy or difficult?

 

         This question calls for empirical research.  But, granted the similarity between the LC stems and those in the RC and LR tests, we can answer in part from research examining the difficulty of LR items in the GRE (see, e.g., Yang and Johnson-Laird, 1999, 2001, 2002).  This research has demonstrated experimentally three factors that affect the difficulty of LR items: the nature of the inferential task, the logical formulation of the stimulus and the correct option, and the nature of the lures.  It is easier, for example, to state what conclusion follows from a discourse than to provide a missing premise for a given conclusion.  The latter can be shown to be theoretically a more difficult task.  A given stimulus is easier to understand when it is couched in the form of conditional assertions (using “if”) than when the same content is couched in the form of logically equivalent disjunctions (using “or”).  And a given problem can be made easier (or harder) by the choice of lures.  In this way, experiments eliminated the difference in difficulty between a set of easy and a set of hard items (selected at random from the LR test).  The three variables are also likely to affect the difficulty of the current sorts of LC tests.  Doubtless there are other variables that would also do so.

 

However, item difficulty ought not be a primary focus for the initial stage of item development. In fact, several decisions that we suggest might, if implemented, make items less difficult. For example, clarifying the purpose of listening to a stimulus by making advance organizers more elaborate or explicit could well make items easier. Yet, we maintain that it would increase the construct validity of the test. It is our view that the primary aims for LSAC researchers should be to gain a better control over the construction of stimuli, to define different listening skills more narrowly, and to get a tighter fit between these skills and item types.  Once they have solved these problems, they could investigate specific hypotheses about the causes of difficulty.

 

Is there a difference between listening skills and reading skills

for our test-taking population?

 

         No one knows the answer to this question.  Anecdotal observation suggests that some people who are good conversationalists, and probably good short-term listeners, are not necessarily good readers, and vice versa.  But, it is harder to distinguish those people who, though not good talkers, are good at listening.  We suspect that amongst LSAC’s test-taking population there are differences in listening and reading skills, and that these differences may matter in law school. 

 

Can these differences be detected in the LSAT?   Again no one knows for certain the answer to this question.  The panel can at best offer only suggestions that might lead to tests of listening that are independent of performance on the reading comprehension test.  The assumption that “listening skill” is a unidimensional construct is part of the problem.  When listening tasks are separated into different types, correspondence with reading scores occurs with some and not with others.  Lecture listening (which may be the same as the LC scales) has corresponded closely with general academic skills and verbal definitions of intelligence.  We have examined some data from an early version of a listening test that was part of ETS’s National Teacher Examination (Educational Testing Service, 1984) and was derived from the STEP test.  ETS compared “statements and questions,” “dialogues,” and “talks,” which roughly correspond to short statements and answers, interpretive listening, and “lecture” listening.  Table 4 presents the intercorrelations of these three listening measures and other measures used in the overall examination:  a reading test, a “usage” test that measured grammatical choices, a sentence-completion test of comprehension, and an essay-writing test (Educational Testing Service, 1984).  “Usage” is thought to be a measure of writing skill in that it demonstrates knowledge of usage, such as grammatical rules.  Sentence comprehension is considered to be a subset of overall reading ability, in that it measures the intake of information in a single sentence rather than a longer passage.

 

The listening measures each seem to have a different relation to the reading scores.  This difference suggests that any relation between reading and listening should be tempered by the question: “What kind of listening?” and “What kind of reading?”

 

Table 4

Intercorrelations of Various Sections of the National Teacher Examination

_______________________________________________________________

Test Sect

 

                            Immediate   Dialogue    Lecture       Reading       Usage        Sent. Comp.

_______________________________________________________________                                         

Immediate   1.00                                     

Dialogue        .56            1.00

Lecture         .59              .48            1.00

Reading         .72              .58              .68            1.00

Usage           .63              .49              .58              .71            1.00  

Sent.Comp    .56              .44              .52              .65              .68            1.00

Essay   .46              .39              .41              .52              .53              .50

_______________________________________________________________

 

The brief listening items correlated .56 with the sentence comprehension items and .72 with the reading scores. The dialogue items correlated .44 with the sentence comprehension items and .58 with the reading test.  The lecture items correlated .52 with the sentence comprehension items and .68 with the reading score.  It is safe to say that no clear relation among the listening scores and the reading scores can be observed in this table. 

 

However, a correlation table does not always present a comprehensive view of the interrelations in the data.   Factor analysis yields a better idea of the relations among these scales.  Table 5 presents a simple factor analysis of the variables in Table 4 (Bostrom, 1996a).  It shows that the three types of measurement seem to differ from one another, as do the reading comprehension measures.  This five-factor solution shows that “Sentence comprehension” and usage make up one factor.  Brief items formed a factor of their own, as does the dialogue measure.  The fourth factor is “talks,” which is lecture listening.  The essay scores form their own factor, clearly different from all of the others.

 

 

Table 5

Factors Generated by Intercorrelations of ETS "Communication Skills" Assessments

        

                                                           I       II      III     IV      V

 

Brief Items                     .27     .85    .23    .24     .17

Dialogue                                  .20    .24    .19     .91    .14

Talks                                       .26    .25    .89     .18    .15   

Reading                                   .45    .52    .44     .27    .22

Usage                                     .68    .39    .26     .17    .25

Sentence Comprehension        .88    .17    .19     .11    .19

Essay                             .25     .17    .15    .14     .92

                  

            This simple analysis suggests that “listening” and “reading” are broad terms that may not produce meaningful comparisons in and of themselves.

            It seems reasonable to suggest that one way to separate listening from reading would be to focus on dialogue and ideally on its dual nature.  Can examinees choose the appropriate formulation of a “yes/no” question?   Can they keep track of who said what?    The items could also examine sensitivity to speakers’ attitudes and feelings, implicatures conveyed by the informal nature of speech, and the abilities that listeners spontaneously use, such as inferring the meaning of a word from its use in context (see Johnson-Laird and Wykes, 1977).  For instance, when Ali G (the comedian) asked Pat Buchanan whether the US had found any BLT’s in Iraq, Buchanan did not query the question.  He presumably inferred that “BLT’s” was a reference to some sort of WMD.  

 

            The next suggestion is to consider what other tests of listening ability have discovered, and to try to capitalize on them for items that might predict performance in law school. As Plumer (2003) wrote: “… it seems likely that we have not yet explored the full range of listening comprehension assessment.”  

 

Is there a way to test listening skills that does not

correlate highly with a test of reading skills?

 

On the assumption that the high correlation between the LC and RC tests is because they test abilities in common, we recommend that certain modifications be made to the LC test.  As we noted earlier, the field test items do not take advantage of some of the critical differences between listening and reading.  

 

 The modifications involve:

                        Recognition of conversational cooperation

                        Recognition of misunderstanding in dialogue

                        Awareness of whether an answer addresses the question asked

            Understanding of the use of pauses

            Understanding of the use of tone of voice, including intonation to convey                                    an attitude to a proposition, such as irony and sarcasm

Understanding of the role of intonation in syntactic disambiguation.

 

 

How should requests for test accommodations be handled?

 

LSAC has considered three potential types of accommodation requests and possible accommodations that might be provided for each of them (see the document, Accommodations for Listening Comprehension).  The first potential type of request would come from examinees who are deaf; the possible accommodation suggested is to waive the LC section of the exam.  The second potential type of request would come from examinees with hearing impairments.  The suggested accommodation is the provision of amplification as an assistive technology.  The third type of request would come from examinees with auditory processing disorders.  The suggested accommodation is the provision of additional testing time.

 

At least two representatives of the deaf/hearing impaired community have confirmed that waiving the LC section of the exam as a required portion of the LSAT would be an acceptable and appropriate accommodation.  One caveat was offered: if over time the LC section of the LSAT became highly valued by law school admissions offices, so that examinees without scores on this section were at a disadvantage, then serious consideration of an interpreted version of the section might need to be made.

 

For examinees with hearing impairments, the possible accommodation of amplification may provide some help, but is potentially of limited value.  Amplification makes an audio signal louder; however, if an examinee is hearing a poor representation of the original sound, making that poor representation louder will continue to provide an imperfect listening experience.  For at least some hearing impairments, waving the LC section may be a preferred accommodation to amplification. 

 

For examinees with auditory processing disorders, the accommodation of additional time seems reasonable and appropriate, although studies may be necessary to determine an appropriate amount of additional time.  LSAC might also consider investigating an accommodation for these examinees (and perhaps for hearing impaired examinees as well) where examinees had the option of playing an audio prompt more than a single time.  This accommodation could have some real world merit, but research into the validity of the impact of the accommodations would probably be necessary.

 

Are there reasons to prefer either a paper-and-pencil format or a computer-based format for the delivery of a listening comprehension test?

 

If LSAC wished to administer the LC test in a paper-and-pencil format using a CD to deliver the audio files, it would be logistically feasible.  In fact, with only a moderate amount of effort an LC exam could be included in the present paper-and-pencil administration format, by continuing the approach used in field testing the LC items.  In this procedure, a single CD player is used for an entire room of examinees with fixed item-level timings.  (Format options are considered in Section 4.1 of LSAT Listening Assessment: Theoretical Background and Specifications, 2003.)

 

An adaptation of this approach would be to make individual CD players and headsets available to each examinee, along with the paper-and-pencil test booklets.  One potential problem with this second approach is security.   It would be difficult for a test proctor to monitor a large group of examinees, and to ensure that no examinees turned a page in their booklets before the audio prompt.   It would be preferable to allow examinees to play each audio prompt on their own schedule, but it would be reasonably authentic to maintain group-level control over the timing of the spoken prompts.   In everyday life, the rate of speech is usually under the speaker’s control.  However, CBT removes this extra need for group-level timing, and in that mode the examinees could control the timing of the test items. 

 

A computer-based LC exam has several other advantages over a paper-and-pencil exam. It incorporates the management of the sound files and offers individual control over the volume of the speech.  It also offers the flexibility of incorporating a number of item types that might be impossible to administer using paper-and-pencil.   These alternative item types are often termed “innovative”.

 

Innovative Item Types

 

One major advantage of administering an LC test only in a computer-based format is that it makes possible quite different formats for tests, including innovative item types.  The multiple-choice format, of course, is inimical to the real-time nature of listening, and so it would be important to explore reliable methods of alternative assessment. 

 

A variety of innovative item types have been used in high-stakes, standardized tests that are administered in the CBT format.  One framework for categorizing innovative item types arranges them along five dimensions.  These are: 1) item format, 2) response action, 3) media inclusion, 4) level of interactivity, and 5) scoring method or algorithm (Parshall, Davey, and Pashley, 2000).  Item format defines the sort of response collected from the examinee.  Two major categories of item formats are selected response and constructed response, with multiple choice being the most common example of the former and written essays an example of the latter.  Response action refers to the means by which examinees provide their responses.  Keyboard entry, mouse clicks, and touch screens are common.  Media inclusion covers the use of elements such as sound or video in an item.  Level of interactivity describes the extent to which an item type reacts or responds to examinee input.  And, scoring method addresses how examinee responses are translated into quantitative scores.

 

For a test of listening, interactive items that would allow the examinees to engage in real-time dialogues in which they listen to a speaker and type their responses could be of great value.  Additionally, items formatted so that examinees either respond to “Yes/No” questions or make a single response that can be scored objectively could be appropriate.   These and other targeted methods of assessing could open the door to more veridical tests of listening abilities that do not correlate with reading skills.  One concern that devotees of multiple-choice items often voice is the loss of information if this format is replaced by a “Yes/No” test.  A difficult “Yes/No” item yields 1 bit of information. (A bit is a measure of information: an equiprobable choice between two alternatives transmits 1 bit, see, e.g., Shannon and Weaver, 1949).  A choice between five options often yields less information.  Consider, for example, an item for which one option is chosen by 85% of the examinees, a second option is chosen by 10% of the examinees, and a third option is chosen by 5% of examinees.  Is it more or less informative than the difficult “Yes/No” item?  Readers may be surprised to learn that this multiple choice item yields less information than the “Yes/No” item.   It conveys only about 0.75 bits. 

 

Comparability

 

When an exam program is administered in dual platforms, the comparability of scores in both modes must be verified (APA, 1986; ATP, 2000).  Studies have found that items may not function identically when they are administered in both the paper-and-pencil and the CBT mode.  An unpublished experiment carried out by one of us demonstrated that LR items were harder for examinees in a CBT format when they carried out LR problems in both formats (in a counterbalanced order).   

 

Other circumstances in which the comparability of test scores needs to be established include exams programs that, for security reasons, may use more than one item pool.  The maintenance of a single CBT item pool may also produce substantive changes in the pool over time, as the items in the pool are removed and new items used to replenish the pool. In all of these cases, it may be important to investigate test comparability.

 

Other CBT Considerations

 

Given that most of the LC item field testing conducted to date used a paper-and-pencil with CD administration mode, several issues will need additional consideration for operational CBT delivery (Parshall, Spray, Kalohn, and Davey, 2002). 

 

One of these issues has already been discussed: item timing.  The preliminary work conducted by LSAC suggests that the group-level timings assigned to the existing LC item types worked fairly well, with most examinees able to respond to most items, without often having to wait for the next item to begin.  However, item response times are often a little different in CBT than paper-and-pencil testing even with text-based items, so possible adjustments for a different mode may be appropriate.  In addition, timing changes may arise because the examinees will control item-level timings.

 

A few logistical matters will also need to be arranged.  For example, individual headsets will need to be available along with the computers at all CBT administration sites.  In addition, the greater electronic file size necessitated by audio, as compared to text, will mean that data transmission, storage, and delivery may need to be adjusted. 

 

         One practical concern would be the construction of taped dialogues in which voices conveyed different emotional content.  This construction would call for written scripts, the coaching of actors, and the recording of their interpretations of the scripts.   The results would need comprehensive validation, first with experts (probably from groups such as theater professors, professional actors, and directors) and then with a large group of diverse individuals.  Mehrabian and Weiner’s simple technique shows how different the dimensions can be. 

 

LSAC may also elect to conduct a few small-scale studies into the quality of the audio obtained under various sampling rates and data resolution rates.   An audio file sampling rate is the number of times per second that an analog audio wave is digitally captured.  For professional music CDs, the standard sampling rate is 44.1 kHz.  Higher quality sound, along with larger sound files, results from more frequent sampling.  An audio file that consists entirely of human speech, rather than music, could be reproduced quite well at a much lower sampling rate than would be necessary for a musical selection.  The minimum acceptable quality for speech recognition is probably 24.0 kHz (Aikin, 1985; Finelli, 1989; Huber, 1996; Moog, 1985; Li, 1997; Lombardi, 1997; Pan, 1993; Wiener, 1996).

 

The second parameter for storing digital audio data is the data resolution rate (technically referred to as the quantization rate).  The data resolution is determined by the amount of computer memory allotted to storing the discrete frequency values.  Greater precision is obtained with the larger values.  As with higher sampling rates, this results in larger files.  Typically, the data can be stored as 8-bit, 12-bit, or 16-bit values.  For example, sound files digitized with 8-bits have a 48 dB range (i.e., the difference in decibels from the softest to the loudest volumes) and sound something like typical cassette tapes.  For 16-bit files, the range is 96 dB, and the sound quality is comparable to that found on professional music CDs (Aikin, 1985; Finelli, 1989; Moog, 1985).

 

For LSAC’s purposes, the optimal values for these two parameters will be the lowest values that still produce fully comprehensible sound files.  This will yield good quality speech reproduction at the most manageable sound files available.

 

The CBT interface, for the LC section as well as the rest of the computerized LSAT, will need to be fully tested.  Good preliminary work in this regard was conducted by LSAC on the initial prototype CBT.  This work should be followed up, incorporating any new item types or other changes that LSAC may elect to make in the LC exam.

 

Next Steps

 

Our overall assessment is that the developers have made a promising start in their work on LSAC LC test.  However, they have yet to develop an LC test that meets all the theoretical and practical desiderata.  Nevertheless, we believe that it is both worthwhile and feasible to continue the project of developing such a test.  We have made a number of test development recommendations in earlier sections of this paper (particularly Evaluation of LSAC’s LC Test – Construct Definition and Test Specifications, Evaluation of LSAC’s LC Test – Item Illustrations, and Is there a way to test listening skills that does not correlate highly with a test of reading skills?).  In this section we provide further recommendations for the test development process.  In addition, we explicate a set of recommended research studies. 

 

Test Development Recommendations

 

Detailed suggestions about test development are listed here. 

 

 

Research Study Recommendations

 

To undergird the test development recommendations provided above, we strongly recommend further work in the development and defining of the underlying construct.  A variety of research suggestions are provided below towards this goal. 

 

Because LC items of the sort used in the field test are more expensive to create than, e.g., RC items, they are unlikely to be included in the LSAT unless they can be shown to measure aspects of listening that do not correlate with RC performance.   Hence, the major need is for items that measure aspects of listening (and dialogue) that do not correlate with RC or LR (see the section, Is there a way to test listening skills that does not correlate highly with a test of reading skills?).  Research is needed to find items that measure LC performance that do not correlate with RC or LR.  We recommend explorations of items based on dialogue, inferences about speaker’s attitudes, real time generation of responses, and inferences of word meaning from context.

 

We also recommend that tests of items should be carried out in two formats: one using real-time listening and the other using real-time written presentation, i.e., the sentences are presented one at a time on the computer screen.  Such items might measure aspects of ability unique to listening (even though they are visually presented).  

 

We further recommend research into the feasibility of using innovative items in a CBT format (see the section, Innovative Item Types).  Such items might make feasible the study of the real-time aspects of dialogue.  Examples of investigations into innovative item types that might be advantageous include:

·   Consider a richer context – ask more items in the same situation

·   Items in which the stimulus contains a written passage followed by a related audio passage

·   Source monitoring – asking about which person in a dialogue says what

·   Impact of “bad lectures” – present a flawed stimulus, and ask about “where it went wrong”

·   After a dialogue ask questions like – “why did a person repeat the question?”

·   Correlation between monologues and dialogues

 

We recommend additional research to support the construct validity of the LC test, including studies that correlate LC scores with specific law school course grades (see the section, Construct Validity).  One highly valuable construct validity study would begin with the derivation of a more specific list of important abilities (perhaps through analysis of specific law school tasks) than is provided by the skills analysis. Perhaps also ask law school professors to examine the field test items and indicate what abilities they tap and to what extent, and how important the abilities are.

 

Additional research into the construct as it applies to non-native speakers of English may also be of value.  There is a need for in depth impact analyses, particularly for Asian/ Pacific Islanders (see the section, Effects on Examinee Sub-groups).

 

Given that item response times are often somewhat different in CBT than paper-and-pencil, even with text-based items, an investigation into item-level timing should be conducted (see the section, Other CBT Considerations).  In addition, research on features associated with item difficulty, for example, location of point tested, density/structure of stimulus, level of vocabulary, etc. should be conducted.

 

A discourse analysis of the language used in first year law school classes is also recommended. This analysis could be used as a baseline for the kind of language in item stimuli as well as the nature of lectures (discourse structure, rhetorical features, information structures). This kind of information might help to differentiate listening from reading stimuli, again with the aim of lowering the correlation between the two. It might also increase understanding of item difficulty (if certain features correlate with greater item difficulty).

Research into the effectiveness of notetaking study may be useful. 

 

Finally, we suggest conducting accent studies on effect of regional accents; these may be more relevant than dialect studies.

 

 

Conclusion

 

         LSAC has made a promising start to the development of a listening comprehension test that would be part of the LSAT.  We recommend that the research be continued with the main aim of developing theoretically motivated items that assess skills relevant to success in law school.  These items should test abilities unique to various sorts of listening, including the listening that occurs when individuals participate in dialogue.  A successful test of such abilities is likely to yield scores that are not highly correlated with the tests of reading comprehension and logical reasoning that are already part of the LSAT and that may help to improve predictions about success in law school.

                                              

                                                                                             Robert Bostrom,

                                                                                             University of Kentucky

                                                                                             Robert French,

                                                                                             Educational Testing Service

                                                                                             Philip Johnson-Laird,

                                                                                             Princeton University

                                                                                             Cynthia Parshall,

                                                                                             Measurement Consultant

                                                                                             November 30th, 2004


References

 

Addington, David.  W.  (1968).  The relationship of selected vocal characteristics to personality perception.  Speech Monographs, 35, 492-505. 

Aikin, J. (1985). Digital sampling keyboards: What’s available, how they work, why they’re hot. Keyboard, 32-41.

Alexander, E. R., Penley, L. E., and Jernigan, I. E. (1992). The relationship of basic decoding skills to managerial effectiveness. Management Communication Quarterly, 6, 58‑73.

American Psychological Association Committee on Professional Standards and Committee on Psychological Tests and Assessment. (APA). (1986). Guidelines for computer-based tests and interpretations. Washington, DC: Author

Association of Test Publishers (ATP). (2000). Computer-Based Testing Guidelines.

Bachman, L. (1990) Fundamental Considerations in Language Testing. New York: Oxford University Press.

Baddeley, A. and Dale, H. (1968) The effect of semantic similarity on retroactive interference in long‑ and short‑term memory, Journal of Verbal Learning and Verbal Behavior, 5, 471‑420.

Baddeley, A.D. (1981). The concept of working memory: a view of its current state and probable future development. Cognition, 10, 17-23.

Baddeley, A.D. (1986). Working Memory. Oxford: Oxford University Press.

Baddeley, A.D. (1996). Exploring the central executive. The Quarterly Journal of Experimental Psychology, 49A, 5-28.

Bostrom, R. N. (1990). Listening behavior: Measurement and applications. New York: Guilford.

Bostrom R. N. (1996a). Aspects of listening behavior. In O. Hargie, (Ed.) Handbook of communication skills. (2nd Ed., pp 236‑259) London: Routledge.

Bostrom, R. N. and Bryant, C. (1980). Factors in the retention of information presented orally: the role of short‑term memory. Western Speech Communication Journal, 44, 137‑145.

Bostrom , R. N. and Waldhart, E. S. (1980). Components in listening behavior: the role of short‑term memory. Human Communication Research, 6, 211‑227.

Bostrom, R. and Waldhart, E. (1988). Memory models in the measurement of listening. Communication Education, 37, 1‑12

Brown, J., and Carlsen, R. (1955), Brown‑Carlsen listening comprehension test,. New York: Harcourt, Brace and World.

Burgoon, J. (1985). Nonverbal signals. In Knapp, M., and Miller, G. (Eds.) Handbook of interpersonal communication (2nd ed.) (pp. 344‑393). Beverly Hills, CA: Sage.

Bussey, J. (1991, April). Question asking in an interview and varying listening skills. (Paper delivered at the Annual Meeting of the Southern Communication Association, Tampa, Florida.

Chang, T. (1986). Semantic memory: facts and models. Psychological Bulletin, 99, 199‑220.

Collins, A., and Quillian, M. (1972). Experiments on semantic memory and language comprehension. In L. Gregg, (Ed.) Cognition in learning and memory (pp. 117‑137). New York: Wiley.

Educational Testing Service, (1984, February). Test Analysis: Core Battery. Unpublished statistical report. Princeton, NJ: Educational Testing Service.

Ekman, P. (1998). Afterword. In Darwin, C. (1872/1998), Expression of emotions in man and animals.(3d ed.; ed P. Ekman), pp. 363‑395. New York: Oxford. .

Ekman, P., and Friesen, W. (1969). Non‑verbal leakage and clues to deception. Psychiatry, 32, 88‑106.

Finelli, P. M. (1989).  MIDI, sampling, and computers.  In Sound for the Stage: A Technical Handbook (pp. 121-131). New York: Drama Book Publishers.

Flowerdew, J. (1994) Academic Listening: Research Perspectives. Cambridge: Cambridge University Press.

Gardner, H. (1983). Frames of mind:  The theory of multiple intelligences. New Yori: Basic Books.

Garnham, A. (2001). Mental Models and the Interpretation of Anaphora.  Hove, East Sussex: Psychology Press.

Gould, J.D., Boies, F.J., and Ukelson, J. (1997)  How to design usable systems. In Helander, M., and Landauer, T.K., and Prabhu, P. (Eds.). Handbook of human-computer interaction, 2nd, completely revised edition.  (pp. 231-254).  New York: Elsevier Science Publishers.

Huber, D. M. (1996). Get real: Using RealAudio to hear in real time from the Internet. EQ: The Project Recording and Sound Magazine, 6, 88-90.

Johnson-Laird, P.N. (1983) Mental Models.  Cambridge MA: Harvard University Press.

Johnson-Laird, P.N., Herrmann, D.J., and Chaffin, R.  (1984) Only connections:  a critique of semantic networks.  Psychological Bulletin, 96, 292-315.

Keely‑Dyreson, M. Burgoon, J. K., and Bailey, W. (1991). The effects of stress and gender on nonverbal decoding accuracy in kinesic and vocalic channels. Human Communication Research, 17, 589‑605.

Kelly, C. (1965). An investigation of the construct validity of two commercially published listening tests.Speech Monographs, 32, 139‑143.

Kelly, C. (1967) . Listening: a complex of activities or a unitary skill? Speech Monographs, 34, 455‑466.

King, P. E. and Behnke, R. R. (1989). The effect of compressed speech on comprehensive, interpretive, and short‑term listening. Human Communication Research, 15, 428‑443.

Kintsch, W. (1980). Semantic memory: a tutorial. In R.S. Nickerson (Ed.), Attention and performance VIII (pp. 595‑620),  Hillsdale, New Jersey: Erlbaum.

Kintsch, W., and Busche, H. (1969). Homophones and synonyms in short‑term memory. Journal of Experimental Psychology, 80, 403‑407.

Klemmer, E. and Snyder, F.  (1972).  Measurement of the time spent communicating.  Journal of Communication, 22, 142-158.  Landauer, T.K. (1995). The trouble with computers: usefulness, usability, and productivity.  Cambridge, Mass.: MIT Press.

Landauer, T.K. (1995). The trouble with computers: usefulness, usability, and productivity.  Cambridge, Mass.: MIT Press.

Lewis, R.L. (1999). Accounting for the fine structure of syntactic working memory: Similarity-based interference as a unifying principle. Behavioral and Brain Sciences, 22,105-106.

Li, Z-N. (Course Instructor) (1997). CMPT 365: Multimedia Systems [web page, summarizing course content]. School of Computing Science, Simon Fraser University.

Lombardi, V. (1997). Audio on the Internet [web page, updated version of Graduate Thesis]. Music Technology Program, NYU Graduate School of Education.

Leathers, D. and Emigh, T. (1980). Decoding facial expressions: a new test with decoding norms.Quarterly Journal of Speech, 66, 418‑36.

Loftus, G., and Loftus, E. (1976). Human memory: the processing of information. New York: Wiley.

Mislevy, R., Almond, R., and Lukas, J. (2003) A Brief Introduction to Evidence-centered Design. Research Report RR-03-16. Princeton, NJ:Educational Testing Service.

Moog, B. (1985). Sound sampling instruments, part 2: Technical specs and sound quality in digital sampling. Keyboard, 97.

Nielsen, J.  (1994).  Guerrilla HCI:  Using discount usability engineering to penetrate the intimidation barrier. In Bias, R.G. and Mayhew, D.J.  (Eds.). Cost-Justifying Usability.  Boston:  Academic Press.

Olson, K.  (2003).  LSAT Listening Assessment: Theoretical Background and Specifications.  LSAC Research Report 03-02, in press.

Pan, D. Y. (1993). Digital audio compression. Digital Technical Journal, 5, 1-14.

Parshall, C. G., Davey, T., and Pashley, P. (2000). Innovative item types for computerized testing. In W. J. van der Linden and C. A. W. Glas (Eds.), Computerized Adaptive Testing: Theory and Practice.  pp. 129-148. Norwell, MA: Kluwer Academic Publishers.

Parshall, C. G., Spray, J. A., Kalohn, J. C., and Davey, T. (2002). Practical Considerations in Computer-Based testing. New York: Springer-Verlag.

Plumer, G. (2003) Potential New LSAT Item Types: Next Steps.  LSAC (10/8/2003).

Powers, Donald E. (1986) Academic Demands Related to Listening Skills. Language Testing, 1-38.

Rankin, P.  (1929).  Listening ability.  Proceedings of the Ohio State Educational conference.  Columbus, OH: Ohio State University Press.

Schulman, H., (1972). Semantic confusion errors in short‑term memory. Journal of Verbal Learning and Verbal Behavior, 11, 221‑227.

Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication. Urbana: University of Illinois Press.

Spelke, E.S., and Tsivkin, S. (2001) Language and number: A bilingual training study. Cognition, 78, 45-88.

Spitzberg, B., and Hurt, T. (1983). Essays on Human Communication.  Lexington, MA:  Ginn and Coompany.

Squire, L., (1986). Mechanisms of memory. Science, 232 1612‑1619.

Streeter, L.A., Krauss, R.M., Geller, V., Olson, C., and Apple, W., (1977). Pitch changes during attempted deception. Journal of Personality and Social Psychology 35, 345‑350.

Sypher, B. D., Bostrom, R. N. and Seibert, J. H. (1989). Listening, communication abilities, and success at work. Journal of Business Communication, 26, 293‑303.

Thomas, L. T., and Levine, T. R. (1994). Disentangling listening and verbal recall: related but separate constructs? Human Communication Research, 21, 103‑127.

Tullis, T. (1997).  Screen design.  In Helander, M., (Ed.). Handbook of human-computer interaction.  (pp. 377-407).  New York:  Elsevier Science Publishers.

Waldhart, E.  S. and Bostrom, R.  N (1981, March).  Notetaking and listening skills.   Paper presented at the annual meeting of the International Listening Association, Washington, D.  C. .

Wiener, G. (Ed.) (1996). Frequently Asked Questions for rec.audio.pro [web page, found at: www.cis.ohio-state.edu/hypertext/faq/usenet/AudioFAQ/pro-audio-faq/faq.html].  PGM Early Music Recordings, Quintessential Sound, Inc.

Yang, Y., and Johnson-Laird, P.N. (1999) A study of complex reasoning: The case of GRE “logical” problems. Proceedings of the Twenty First Annual Conference of the Cognitive Science Society, 767-771.

Yang, Y., and Johnson-Laird, P.N. (2001). Mental models and logical reasoning problems in the GRE. Journal of Experimental  Psychology: Applied, 7, No. 4, 308-316.

Yang, Y., and Johnson-Laird, P.N. (2002) "If" is easier than "or" in the GRE.  In Gray, W.,and Schunn, C.D., (Eds) Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society, Fairfax, VA.  Mahwah, NJ: Erlbaum, 1055.