Douglas Fuchs and Lynn S. Fuchs, Peabody College of Vanderbilt University; Patricia G. Mathes, University of Texas--Houston Health Science Center; Mark W. Lipsey and P. Holley Roberts, Peabody College of Vanderbilt University
Learning Disabilities Summit: Building a Foundation for the Future White Papers
This paper is available in alternative formats: | Download Word | Download pdf |
Pages: | 1 | 2 | 3 | 4 | 5 | 6 |
Since Morgan's and Hinshelwood's pioneering work at the turn of the last century, there has been disagreement about the nature of LD. In the past two decades, as its prevalence and the associated costs of special education to local and state governments have escalated, these discussions have taken on a high-stakes tone. Many interested parties are now openly questioning the meaningfulness (and usefulness) of the LD construct. Researchers have played an important role in this discourse. Using researcher-identified samples, the NICHD group has repeatedly demonstrated that poor readers with and without an IQ-achievement discrepancy have more in common (e.g., phonological deficits) than not. On this basis, the NICHD group and others argue that the IQ-achievement discrepancy should not be a criterion in LD identification.
According to Gottlieb, MacMillan, Shepard, and Ysseldyke and their respective colleagues, however, many school districts deliberately disregard discrepancy information. In contrast to the NICHD group's research, Ysseldyke and his associates used practitioner-identified samples to explore whether low achievers with and without the label are different from each other. Across a series of studies, they reported no educationally important differences between the two groups. This provocative claim inspired many others to try to replicate their work. Findings have been inconsistent, and for good reason: Investigators have explored different performance domains (e.g., reading achievement vs. classroom behavior); chosen dissimilar measures within a given domain (e.g., reading comprehension vs. phonemic awareness); used contrasting definitions of LD (e.g., IQ greater than or equal to 90 vs. IQ greater than or equal to 70) and low achievement (e.g., teacher judgment vs. cutoff scores); involved demographically different student groups (e.g., low vs. middle vs. high socioeconomic status; urban vs. suburban vs. rural); and based their statistical comparisons on different metrics (e.g., degree of overlap vs. mean performance). Bottom line: There is no consensus as to whether the two groups of low achievers--those whom the schools have labeled and those who remain unlabeled--are distinguishable.
If a comprehensive review of the empirical evidence shows that students with the LD label cannot be distinguished from their LA, nonlabeled classmates, then it would seem only reasonable to support the abolition of this disability category. After all, the logical alternative would be to declare all LA students learning disabled, an assertion that we believe would make little economic, political, or legal sense. On the other hand, if a systematic review of research shows that the school-identified LD group performs more poorly, in both a statistically significant and educationally meaningful sense, then we can assume that the two groups represent different populations of students. Such a result may lend weight to the view that students with the LD label have different educational needs, in degree or kind, which might be addressed only within special education (e.g., Mather & Roberts, 1994; National Joint Committee on Learning Disabilities, 1994).
With these and other questions in mind, we have identified and quantitatively synthesized the extant literature in the domain of reading. We have chosen this domain for several reasons. First, a majority of studies comparing LA students with and without the LD label focus on reading. Second, most children with LD are identified as such because of chronic reading problems. Third, reading difficulty strongly affects overall school achievement (e.g., Stanovich, 1986).
In searching the scientific literature on reading, we coded each study that met our inclusion criteria and we analyzed the resulting data base. In the following sections, we summarize these methods and our results. We provide detailed information on the development of our coding system. For a thorough description of the literature search and data analysis, see Fuchs, Fuchs, Mathes, Lipsey, and Roberts (2001).
Our goal was to identify all published and unpublished studies in which the reading achievement of LD and LA nondisabled students could be compared. A study was defined by its participants: If two or more studies were conducted on the same students, the studies were counted as one. In a similar way, a single article could report more than one study if it included different samples of students with LD.
For inclusion, a study had to meet five criteria:
To identify studies that met these criteria, we undertook a comprehensive search of journal articles, Educational Resources Information Center (ERIC) documents, and dissertations in Dissertations Abstracts International (DAI) produced between January 1975 and December 1996. This search comprised three phases: a manual search of journals, two computerized database searches (ERIC and DAI), and an ancestral search of titles in the references of identified investigations. Eighty-six studies met our inclusion criteria.
To systematically derive information from the studies, we developed a coding form in two phases. As we initially read the studies, it was unclear which study characteristics would eventually prove worthy of coding. Therefore, in Phase 1, we described many study features, knowing some would later be discarded. We began by reading a considerable portion of the research and becoming familiar with the typical range of study features described. We then developed a first-draft coding form with which we independently coded a sample study--Shinn, Ysseldyke, Deno, and Tindal (1986). After debriefing, we developed a second draft and accompanying code book. Then, we independently coded four studies, including Shinn et al. (1986) for a second time. After coding each study, we again discussed each item on the coding form. Throughout this process, definitions of codes were refined and decision rules about handling ambiguous situations were determined.
At this point, we began coding studies. However, within a couple of weeks, unacceptably low levels of interrater agreement indicated a need for more precise definitions, so the coding form was revised again. As a result, 30 articles that had already been coded with the second draft had to be recoded. A 16-page coding form emerged from Phase 1 (contact the first author for the final coding form). Using this iteration of the form, five studies were coded with interrater agreement of 90% or better on each study.
Then, the remaining journal articles and ERIC documents were coded independently. During this coding process, to check whether the raters were continuing to code in the same way, they completed independently the same set of 13 studies. Agreement on each exceeded 85%.
Recognizing the temptation to make reasonable inferences about information not clearly presented in studies, we instituted a no-guessing rule: If uncertainty arose about how to code an item, it was left blank. Later, an author determined the code. If questions still remained, the codes were discussed until consensus was achieved.
Approaching data entry, it became apparent that the 16-page coding form was too detailed; it contained codes inappropriate or irrelevant for many studies. Therefore, in Phase 2, the form was reduced to 45 codes that would be entered into the computer. During this scaling-down process, we added one code, "reading," which was redefined by various subdomains (e.g., phonological awareness, lexical retrieval, reading readiness).
The final coding form differed in appearance from the 16-page version because it was briefer and designed to match the computer spreadsheet. So, for example, both coding form and spreadsheet now displayed one line of data for every reading measure in a study.
Selected study codes were then transferred from the 16-page coding form to the briefer, final form. Before beginning this process, two coders independently transferred the codes of five studies from one form to the other, immediately checking accuracy. One coder then transferred the codes of all previously coded studies to the final coding form. An independent coder then checked this transfer of codes for every study.
Codes for 86 studies were entered into an electronic spreadsheet. To ensure accuracy, two checkers examined the spreadsheet item by item. As one person read the data base entry, the second person checked the information on the coding form.
Typically, ES was computed as the standardized mean difference (d index): the difference between the means of the comparison groups divided by the pooled standard deviation (Hedges & Olkin, 1985). This formula represents LD-LA differences scaled in the uniform metric of standard deviation units. A positive ES reflects higher performance by the LA group. As recommended by Hedges (1981), this ES formula was adapted to yield an unbiased estimate of the underlying population effect. Whereas a majority of studies presented the information necessary to compute ES using the basic formula, some studies presented other comparison statistics. In such cases, ES was estimated from those other statistics.
We aggregated two or more ESs in the same study, if those ESs were identical on eight variables: reading subdomain, research design, sample size for LD, sample size for LA, grade level, and IQ (Full Scale, Verbal, and Performance). Thus, any two ESs in the same study that did not match exactly on these eight dimensions were judged to be independent, with one important exception. In a few instances, subgroups of a sample differed in size, but were identical with respect to the remaining seven variables. In these cases, ESs associated with these subsamples were eliminated. Also eliminated at this point were seven studies in which LD and LA students were matched on reading achievement or reading achievement and IQ. ESs from the remaining 79 studies were included in the meta-analysis.
We undertook four preliminary analyses to formulate decisions about which data, in what form, should be incorporated into the major analyses. First, we examined the effect of four types of study designs: (a) descriptive/one point in time, (b) descriptive/change over time, (c) intervention/posttest only, and (d) intervention/change over time. We decided to conduct analyses on only one type, which had the vast majority of ESs: the descriptive/one-point-in-time studies (n = 202).
Second, we examined whether and if so how to consolidate data across the reading subdomains. We found that five reading domains (decoding isolated words, reading connected text, reading comprehension, overall reading, and vocabulary) yielded ES values sufficiently similar, as indexed by their central tendencies, to be considered comparable. However, the remaining domains (phonological awareness, rapid automatized naming, and reading readiness) were comparable neither with the other five domains listed previously nor with each other. The mean covariate-adjusted ESs for these three domains, respectively, were 0.05, 0.26, and -0.40. Thus, we did not combine these three domains with the remaining five domains or with each other. This left 172 ESs.
Third, with this smaller data base, we identified independent samples that contributed more than one ES. These records were aggregated by averaging all variables (except the reading subdomain). Because all other variable values in the averaged records were identical, a single record was produced for each independent sample. This resulted in a data file of 112 records, each representing an independent sample with an ES in one of the five reading subdomains or a mean ES averaged over two or more of the five subdomains.
Finally, the distribution of the 112 ESs revealed outliers at both ends. To reduce the possibly distorting effect of these outliers, we windsorized them. Two ESs less than -1.00 were increased to -1.00; five ESs greater than 1.75 were reduced to 1.75. Doing so had a minimal effect on the overall mean ES.