Bullock (1975)

(page numbers in brackets)

Notes on the text

Preliminary pages (i-xxxvi)
Foreword, Membership, Contents, Introduction

Part 1 Attitudes and Standards
Chapter 1 (3-9)
Attitudes to the teaching of English
Chapter 2 (10-35)
Standards of reading
Chapter 3 (36-44)
Monitoring

Part 2 Language in the Early Years
Chapter 4 (47-50)
Language and learning
Chapter 5 (51-74)
Language in the early years

Part 3 Reading
Chapter 6 (77-96)
The reading process
Chapter 7 (97-114)
Reading in the early years
Chapter 8 (115-123)
Reading: the later stages
Chapter 9 (124-138)
Literature

Part 4 Language in the Middle and Secondary Years
Chapter 10 (141-161)
Oral language
Chapter 11 (162-187)
Written language
Chapter 12 (188-193)
Language across the curriculum

Part 5 Organisation
Chapter 13 (197-212)
The primary and middle years
Chapter 14 (213-219)
Continuity between schools
Chapter 15 (220-237)
The secondary school
Chapter 16 (238-242)
LEA advisory services

Part 6 Reading and Language Difficulties
Chapter 17 (245-265)
Screening, diagnosis and recording
Chapter 18 (266-276)
Children with reading difficulties
Chapter 19 (277-283)
Adult literacy
Chapter 20 (284-295)
Children from families of overseas origin

Part 7 Resources
Chapter 21 (299-313)
Books
Chapter 22 (314-327)
Technological aids and broadcasting

Part 8 Teacher Education and Training
Chapter 23 (331-346)
Initial training
Chapter 24 (347-356)
In-service education

Part 9 The Survey
Chapter 25
I Introduction (359-365)
II Primary Commentary (365-402)
III Secondary Commentary (402-443)
IV Questionnaire Tables (444-502)
V Technical Notes (502-510)

Part 10 Sumary of Conclusions and Recommendations
Chapter 26 (513-560)
Conclusions and recommendations

Appendix A (561-576)
Witnesses and sources of evidence
Appendix B (577-584)
Visits made

Glossary (585-595)
Index (596-609)


The Bullock Report (1975)
A language for life

Report of the Committee of Enquiry appointed by the Secretary of State for Education and Science under the Chairmanship of Sir Alan Bullock FBA

London: Her Majesty's Stationery Office 1975
© Crown copyright material is reproduced with the permission of the Controller of HMSO and the Queen's Printer for Scotland.


[page 36]

CHAPTER 3

Monitoring

3.1 Before we go on to consider how improvement can be secured in reading and the use of language we think it appropriate to complete this section on standards by setting out our conclusions on how these can be more effectively monitored. The national sample surveys carried out periodically by the National Foundation for Educational Research have provided useful indicators of progress in basic reading ability since 1948, but they have become increasingly difficult to interpret. In describing some of their shortcomings we have pointed out that the items represent only a limited sample of reading skills. The narrowness of the tests certainly ensured a high degree of precision in measurement, but it meant that their relevance was bound to be questioned. Tests which are limited to only one facet of a complex intermingling of skills clearly cannot supply information of a right quality. Such a conclusion poses an obvious question. Should some form of national survey be continued, and if so should it not be operated more systematically and according to more ambitious principles?

3.2 We are in no doubt of the importance of monitoring standards of achievement in literacy, and of doing so by using the most sophisticated methods possible. There will always be keen interest in the movement of standards, and it is perfectly natural that there should be. Where there is no information there will be speculation, and the absence of facts makes room for prejudice. We began the Report by pointing out how difficult it is to make reliable statements on standards of English today in comparison with those in the past. Opinion on this issue tends to polarise, and the lack of objective data is a serious handicap to rational discussion. Information of the right quality will be of value to teachers and researchers and will be a reference point for policy decisions at the level of central and local government.

3.3 We have also suggested that a wider and more demanding definition of literacy should be adopted. The existing criterion is determined by the reading standards of seven and nine year old children of many years ago on tests whose limitations are acknowledged. It should be replaced by a criterion capable of showing whether the reading and writing abilities of children are adequate to the demands made upon them in school and likely to face them in adult life. What we are proposing, then, is an entirely new approach. We are suggesting that monitoring should be extended beyond the limit of a single dimension to give more information than has ever been available before.

3.4 Obviously, no system of monitoring can encompass all the various objectives in English promoted by a wide variety of schools, let alone all the individual teachers within them. An ideal system would apply measurement continuously to the whole range of learning activity and weight the resulting indices according to importance. This is clearly far too ambitious. Nevertheless, our proposal is that the procedure should assess a wider range of


[page 37]

attainments than has been attempted in the past. What is required, therefore, is an instrument that combines practicability with a more comprehensive and therefore more realistic sampling of the skills.

3.5 Monitoring should employ an array of techniques of a kind that will make assessment both reliable and valid. Ideally, it should not set up 'backwash' effects of any kind, and by design it should rule out the possibility of specific teaching to achieve good test results. Assessment is possible only by examining the explicit products of school activity, and the instruments of assessment should therefore include samples of performance considered to be important and representative of attainments. They should also be responsive to developments in the curriculum. This suggests that the instruments should incorporate the means for discarding out of date procedures and materials and for introducing and validating experimental methods. Monitoring should embrace teaching objectives for the entire ability range, since only by measuring the lowest and highest attainments is it possible to obtain sound general indices. We believe the device that would best answer these needs is the item pool, which is described at greater length later in the chapter. It entails the collection of a large stock of test items, wide enough in range to cover as many aspects of the various abilities as it is felt appropriate to assess. Selection from this pool would be made each time the monitoring instrument was applied. We recommend that monitoring should be administered on the basis of light sampling and frequent occurrence, and that the results should be published annually. This method, described later, would have many advantages, not the least of which would be that a continually accumulating body of information would replace the practice of sudden disclosures at four-yearly intervals. The responsibility for monitoring should lie with a national research organisation, such as the National Foundation for Educational Research, and the process should involve teachers and other educationists at all points, from the definition of objectives to the compilation of results. The nature of this involvement is outlined in paragraph 3.16, and it will be seen that an adequate period of preparation would be necessary. We recommend that 1977 should be the target date for the first application of the new monitoring procedure.

3.6 Before discussing the process in greater detail we must consider the question of the most appropriate age points at which it should be applied. Clearly, the pupils must be able to work at a task without support or advice. Their capacity to use English for themselves is in itself an important aspect of enquiry. Moreover, it is one of our proposals that a selection of different assessment tasks should be distributed between the pupils in any one class. Eleven is still the age of transfer for most children, and the point where their education becomes more specialised. At this age pupils with reading and language difficulties face a situation where their deficiencies will be under still greater pressure. It is clearly a sensitive age point and one where objective information would be of particular value.

By the end of his school life a pupil should have reached certain levels of achievement in reading and producing language. The statutory leaving age might therefore seem a natural point to assess what proportion of young people have succeeded or failed in this objective. However, there are obvious arguments against choosing this as the point at which to apply the second


[page 38]

stage of monitoring, not least the incidence of external examinations. We have therefore concluded that fifteen would be the most suitable age for the second application of the monitoring procedure. Eleven and fifteen, the ages at which previous surveys have been carried out, have obvious advantages and should continue to be the points at which tests are administered.

3.7 A criticism of many methods of assessment is that they are applied only to attainments that can be directly and objectively measured. Other attributes, arguably of greater importance, are excluded because the marking is felt to be too subjective or likely to be too cumbersome and costly. There is an undoubted logistical appeal in multiple choice items which can be machine scored, especially when the reliability of the test and the precision of the results are thereby increased. The limitations of this technique are obvious, but its alternatives would have to prove themselves valid, reliable, and logistically efficient on the scale required. The feasibility of such alternatives to multiple choice testing is considered below.

3.8 In agreeing that the present means of measuring standards is too narrow in concept we concluded that the reading test should assess a wider variety of reading skills. At the most obvious level the test should determine whether the child is able to extract meaning from the page. It should then assess whether he can discern implied as well as explicit meaning, evaluate the material in terms of its own internal logic and of other evidence, and reorganise it in terms of other frames of reference. Passages would be selected for readability and calibrated for difficulty, and span a range wide enough to encompass a number of functions. These would include the descriptive, the narrative, and the expository, all within the range typically encountered by children in their school experience. Chapter 8 contains recommendations about higher order skills and reading in the curriculum areas, and we see these activities as contributing to the item pool upon which the monitoring instrument will draw. The information resulting from all this would indicate far more effectively than earlier data the extent to which reading proficiency had been developed to serve personal and social needs. The survey instruments would include a balanced mixture, with multiple choice questions for the simpler items and open-ended questions for the more complex and evocative material. The first are attractive on the grounds of economy, objectivity, and ease of scoring. The second can be framed in such a way that the pupil's responses to a sentence or paragraph might be reliably scored by impression markings. These responses can provide a wealth of data to assist researchers and teachers alike in interpreting the empirical analyses. Answers to the open-ended questions will need the controlled subjectivity of multiple marking. This implies that skilled and experienced markers will be required and their performance assessed for consistency and accuracy, and that scoring rubrics will have to be developed through a series of trials. The establishment of item pools (see 3.15) will permit a far greater range of test materials to be collected and used in a survey than could be incorporated in a single test designed to be completed by pupils in about half an hour. As a temporary expedient the NS6 test should remain in operation to ensure a continuing baseline until a new datum can be established. This would be achieved by linking the test to items in the pool to relate the future results of all such items with the old data. We emphasise, however, that the existing tests should be dispensed with at the earliest opportunity.


[page 39]

3.9 So far there has been no attempt to monitor standards of achievement in writing, and we recommend that the practice should now be introduced. The reasons advanced for periodical measurement of reading apply with equal force to writing, and the two sets of results would be mutually illuminating. It has to be acknowledged that to test writing on this scale is not a simple matter. The first questions to be answered are: what features of writing should be tested? by what criteria is one to measure them? how are reliability and validity to be ensured? Writing is a highly complex activity, and no test would be adequate that measured a narrow segment of its spectrum. This constraint has to be reconciled with the need for as economical a marking system as possible, and the difficulties are at once plain. At first sight the most obvious prerequisite for assessing writing would seem to be an agreement upon what can be expected of a child at a given age. There are, however, so many variables at work that it soon becomes clear that this agreement is not possible. There have been several attempts to establish criteria for maturity in writing: mean length of composition, sentence length, the subordination index, the minimum terminable unit, etc. We can take the last named as an example. The minimum terminable unit, or T-unit, is 'roughly any sentence or part of a sentence that is an independent clause, possibly containing, however, one or more dependent clauses'. (1) The average length of the T-unit, it has been argued, indicates 'syntactic maturity', and a child is seen to make slow but consistent progress as measured by this index. But a piece of writing might well have syntactic maturity and yet be wanting in organisation and content. Conversely, writing of high quality can employ a simple style that would not necessarily yield a high score as measured by the T-unit. Indeed, it is the mark of a mature writer to recognise the demands of his subject and construct his prose accordingly. Equally, it would be possible to work to a simple measure of correctness in grammar, usage, punctuation, and spelling; but important as these are, no one would suppose that they are the principal criteria by which the material should be judged. The conclusion is that writing can be adequately tested only by the scrutiny of a number of examples in which the child has had to cope with a variety of demands. An important measure of success in writing is to differentiate between the styles appropriate for particular purposes. To present tasks which call upon this ability would give more complete information than could be obtained from a single assignment. For example, at 11 the monitoring procedure might include writing that is autobiographical and narrative, explanatory and descriptive. At 15 it should be extended to involve higher levels of abstraction and greater complexity, and to include writing that answers the needs of various areas of the curriculum.

3.10 We therefore envisage the monitoring procedure in this area as consisting of a variety of tasks requiring different kinds of writing. Assessment of the scripts would involve 'impression marking' by small teams of markers, and in addition coding schemes would be applied for accuracy in spelling, punctuation, grammar and such other features as might be specified. There is convincing evidence to show that teams can achieve a good standard of consistency while dealing with large numbers of scripts. (2) It has been found, for example, that the averages of two sets of three persons marking by impression agree more closely than the impression marking of two individuals randomly chosen. It would, of course, be necessary to obtain a degree


[page 40]

of consistency which would allow comparison with previous years, a feature essential to monitoring. This could be achieved by including in each batch of scripts a proportion from the first year's test. It is an essential principle of the item pooling system that assessment can reflect changes in the use of language and stylistic differences over the years.

3.11 There is no doubt that multiple marking of this kind is far more difficult to operate than mechanical marking, not to say more costly. However, the light sampling we advocate would mean that the number of scripts to be handled at any one time would be comparatively small. Moreover, we believe there is no substitute for specimens of children's actual writing as material for assessing standards. It has been argued that multiple marking of continuous writing adds very little to what can be gained from interlinear tests, in which the child is required to correct errors which have been deliberately introduced into a passage of prose. Nevertheless, we believe that the assessment should involve the generation of continuous language, not merely a response to it. Many teachers would feel as we do that a child's ability to write cannot be judged without studying what he has actually written. A test that relied solely on the child's ability to detect and correct errors in what someone else had written would be unlikely to command general confidence.

3.12 There is very much more to producing writing of quality than avoiding breaches of the accepted norms of standard English. Nevertheless, this aspect of the task is by common consent held to be important, and it should feature appropriately in any monitoring procedure. By applying the coding schemes the markers would be able to measure competence in it. Furthermore, the researcher responsible for the survey could take the scripts at any particular level and conduct an analysis of errors occurring in the writing. This would provide a descriptive comment on the standards obtained and a qualitative comment on the report itself. In addition to the application of coding schemes to the scripts, the monitoring process could include such objective measures as multiple choice and interlinear tests. These structure the situation in which a child is asked to demonstrate aspects of his mastery of written language, and we recommend that they be included in the pool of items from which tests are made up. The impression marking of scripts, the application of coded schemes, and the inclusion in the pool of objective items would together give a comprehensive assessment of standards of writing.

3.13 We discussed at length the feasibility of monitoring standards of spoken English, which is complicated by the increase in the number of variables and in the element of subjectivity. There has been a certain amount of research into the viability of testing speaking and listening, and there is the experience of examination boards to draw upon. These would be helpful sources of information if monitoring were to be extended into this field. In the course of our discussions we reviewed existing tests of the skills involved. We also considered such techniques as teacher-led group discussion, pupil to pupil conversation, response to taped speech and questions, and assessment of group production in contrast with individual contribution. It would not be difficult to devise 'test' situations which would call upon the use of different kinds of language. The logistics of the widespread involvement of teachers and use of apparatus would be challenging, but not as formidable


[page 41]

as might at first appear. However, there are in our view certain fundamental obstacles. The nature of the activity is such that in testing it there is a danger of distorting it, and the problem of artificiality is a real one. Moreover, there is no doubt that many technical matters would have to be explored before tests of oral ability on this scale could be considered viable. The biggest problem would be that of comparing standards on a national basis and across the years. It would be necessary to store tapes in quantity to enable comparisons to be drawn, and the additional variables make this a less dependable device than the corresponding procedure in writing. We do not believe that in the present state of development it is practicable to introduce the monitoring of spoken English. This recommendation emerges from a consideration of the balance between gain and the difficulties of operation. The balance may shift if some of the latter can be removed, notably that of artificiality. Some useful research, both here and in the USA, has already pointed the way, and we recommend that further research be conducted into the development of suitable monitoring instruments and economical procedures.

3.14 In the monitoring of reading and writing we recommend a new style of assessment which will allow for an extensive coverage of attainments without imposing a heavy testing load on individual pupils. The principle we suggest is the sharing of a selection of assessment tasks between a number of groups of pupils. At any one phase of assessment each group attempts a different set of exercises or items from that of every other group. The performance of the population is thus estimated by the performance of the separate groups taken together. The levels of attainment in a single test will be represented as the mean score obtained by each sample of pupils.

3.15 On every occasion when monitoring assessment takes place the test material presented to the pupils will be drawn from a large pool stocked with carefully developed items. The variety of sources from which this stock draws will ensure an extensive coverage of the area to be assessed, in contrast to the inevitably narrow forms of measurement afforded by a single test. From this central pool selections of different question types will be constituted into tests, i.e. concentrations of items following a predetermined pattern of characteristics. The sets of questions will be compiled in such a way as to be of roughly equal standard, each set containing items from the simple to the difficult. At the latter end of the scale the test can make considerable demands and thereby avoid a weakness of existing survey tests; namely their fixing of a 'ceiling' too low to measure the real capacity of the most able children. The approximate equivalence of the sets of questions will be assured statistically, by means of performance norms established over all the schools and pupils tested. The exact equivalence of each test is not essential, since they are being used to assess group performance, not to award marks to individuals.

3.16 In constructing the question pool the first task will be to specify the nature of the content and the objectives which it is hoped the pupils are achieving. This specification should be drawn up by the research officers in accordance with the advice of a consultative panel of teachers, LEA advisers, and other educationists. The Department of Education and Science should be represented on the panel by HM Inspectorate. The result of this


[page 42]

process will be a 'blueprint' to guide question writers, who will generally be teachers trained for the purpose. The questions thus prepared for the pool will be examined in the first instance by a review group of teachers to eliminate unsatisfactory items. The agreed questions will then be reviewed by expert test constructors to eliminate or amend questions which show technical faults. A period of development will be necessary for the items to be pre-tested. The characteristics of each item have to be known before it can be decided finally whether to accept or reject it, and the relative difficulties of collections of items in each part of the pool will have to be determined empirically. When this has been done separate tests, consisting of items calibrated into scales, can be compiled by drawing from the pool.

3.17 The pools could be augmented with single items which are not comparable with the main body in terms of the content or task. As such these would not be calibrated in relation to other scaled tests but could nevertheless be included in a monitoring 'sweep' to indicate trends, try out new ideas for measurement, or simply function as survey material. In this latter case the purpose would not be to assess standards but to gather information about a specific ability. The percentage performing adequately at the task would be reported, but the information would not form part of the monitoring data. The tasks could be varied at will according to the kind of ability it was felt revealing to explore at any given time.

3.18 A signal advantage of question pooling is that it offers a degree of flexibility the single test can never provide. When the monitoring surveys are in train new exercises can be tried out alongside the calibrated items and thus 'chained in' at the appropriate point. Out of date material or examples found to be unsatisfactory can be discarded, while the repeated inclusion of an item will provide data which can be used to improve the accuracy of calibration. The major benefit of this flexibility is that it will be possible for the monitoring system to keep abreast of changes in the use of language and in teaching emphases in schools.

3.19 Until now, national assessment has involved administering one or two tests of reading of a 'large' sample and repeating the procedure with the same tests at roughly four-yearly intervals. We recommend that this form of assessment be replaced by one of light sampling, where the instruments are applied relatively frequently to a succession of 'small' samples. The principle is that monitoring should be applied once in every term, but to only 16 secondary or 32 primary schools on each occasion. 1,600 pupils would be required at one time, and eight of the tests from the pool would be divided among them, so that each test was completed by 200 children. By covering eight features of attainment in this manner it would be possible to gain a great deal of information without increasing the demand upon any one school or pupil. As a general rule a school would be selected only once in several decades, and a child would be unlikely to be involved more than once in his school life. Indeed, many children would complete their school days without ever encountering the monitoring process.

3.20 The figures we have cited are merely illustrative. The numbers would, of course, be subject to alteration according to the degree of precision required in estimating the Mean score, and that in turn would depend upon the reliabilities achieved in the tests. This emphasises the need for adequate


[page 43]

resources to be provided for instrument development. If testing were carried out at termly intervals a rolling estimate of standards could be made over any given period of time. It would also be possible to acquire gradually an appreciation of how performance varies at different times of the school year, a matter on which little is known.

3.21 In the past, large surveys have been afflicted by unforeseen difficulties, e.g. gas and postal strikes. With light sampling such problems would become only temporary inconveniences. The few schools affected could be picked up as soon as administratively convenient, or in extreme circumstances omitted from the sequence of surveys altogether. The disruption would affect only a small proportion of the total sample entering the period over which Means were averaged to give rolling estimates. Spreading the amount of testing evenly over time and distributing the content and skills between pools of questions would reduce the demands on any one pupil's time to a very reasonable level; only one school period of about 45 minutes would be required. Moreover, the work required of participating schools would be no more onerous than under the old procedure. Careful organisation at the distribution stage would ensure that teachers were not asked to 'shuffle and deal' sets of tests. Instead they would be given a prepared package with the tests in order. Distributing different test forms to adjacent pupils would reduce the need for close supervision and the setting up of 'examination conditions' for whole year groups of children.

3.22 The operation of surveys in the past may have tended to underestimate the variation in reading difficulties in different parts of the country. It was suggested to us in evidence that there is a need for more detailed information about standards in certain localities, eg Educational Priority Areas. There seems to us a good case for monitoring to be selectively applied in this way where the information would be of additional value. The flexibility of the system we are recommending would allow such needs to be accommodated. We feel it necessary to emphasise, however, that the new system should be firmly established before any such extension is introduced and that the principle of light sampling should not be impugned.

3.23 Once the item pools have been established the survey can be operated by a small team supported by the consultative committee. The flow of work will be continuous, unlike that engendered by the large four-yearly survey which demanded an intensive effort over a short period from a temporary staff. This small team will be permanent in the sense that, although its members may change, it will maintain a continuity of function and experience. This will enable it to build up an expert knowledge of the growing body of data and its interpretation. One of its tasks will be to develop methods of presentation which will enable the results to be readily assimilated at all levels.

3.24 In conclusion we recommend that adequate research and development work should precede the introduction of such a system of monitoring. There are, of course, several aspects of it which would require investigation and detailed preparation. Fundamental to our concept of monitoring is an acceptance of the view that reading and writing are highly complex activities. If they are to be assessed with a subtlety which reflects this the instruments cannot depend entirely on simply scored objective measurements. There is


[page 44]

an obvious difficulty when impression marking is introduced into the process. Nevertheless, the stability of this kind of scoring over a period of time can be ensured by taking appropriate steps: (i) careful selection and initial training of marker teams, (ii) preserving some continuity of markers over a number of years, (iii) recycling earlier scripts for comparison with current ones, (iv) periodic agreement trials with selected materials. We believe the benefits of impression marking to be considerable and that every effort should be made to overcome the difficulties.

3.25 It will also be necessary to conduct research into the nature of the materials most suitable for the assessment tasks we have suggested. This would be an essential prelude to the creation of item pools, the character of which would itself require a good deal of thought. In addition, consideration would have to be given to the cost and time required for the pre-testing programme referred to earlier. Objective tests and coded assessment would require proper validation. Consultation and experiment would be necessary for the specification of a suitable coding scheme, which would have to be simple, economical, and rigorous.

3.26 It seems to us beyond question that standards should be monitored, and that this should be done on a scale which will allow confidence in the accuracy and value of the findings. We do not underestimate the complexities involved in establishing such a system as we have outlined. Nevertheless, we believe that if a monitoring system is to command the confidence of both the teaching profession and the general public it must present a comprehensive picture of the various skills that constitute literacy.

REFERENCES

1. W Kellogg Hunt Grammatical Structures Written at Three Grade Levels National Council of Teachers of English Research Report No. 3. Urbana: Illinois: NCTE: 1965.

2. JN Britton, NC Martin and H Rosen Multiple Marking of English Compositions Schools Council Examination Bulletin No. 12. HMSO: 1966.

3. See, for example, A Wilkinson, L Stratta, and P Dudley The Quality of Listening Schools Council: 1974.

Chapter 2 | Chapter 4