Link to What's New This Week The Hawthorne Effect >

Dear Habermas Logo and Link to Site Index A Justice Site



Methods and Analysis

Mirror Sites:
CSUDH - Habermas - UWP - Archives

California State University, Dominguez Hills
University of Wisconsin, Parkside
Soka University Japan - Transcend Art and Peace
Created: April 16, 2004
Latest Update: April 16, 2004

E-Mail Icon jeannecurran@habermas.org
takata@uwp.edu

Index of Topics on Site Backup of Chapter 7: Field Experiments
By AUTHOR
SOURCE:
Copyright: Source Copyright.
Included here under Fair Use Doctrine for teaching purposes.
From The new Precision Journalism by Phillip Meyer

jeanne's comments: here's a different interpretation of the Hawthorne effect and a modern study looking for the effect.

Hawthorne effect

A better-known problem occurs when the subjects in an experiment realize that something special is happening to them. Just the feeling of being special can make them perform differently. Surely knowing that a Black Panther bumper sticker is on one's car could make one feel special.

This phenomenon is called Hawthorne effect after a series of experiments at Western Electric Company's Hawthorne Plant in Chicago in 1927. Six women were taken from a large shop department that made telephone relays and placed in a test room where their job conditions could be varied and their output measured. Their task was fairly simple: assemble a coil, armature, contact springs, and insulators by fastening them to a fixture with four screws. It was about a minute's worth of work. Each time a worker completed one, she dropped it into a chute where an electric tape-punching device added it to the total for computing the hourly production rate.

To establish the base for a pretest-posttest design, the normal production rate was measured without the assemblers being aware of the measurement. Then the experiment was explained to them: how it was to test the effect of different working conditions such as rest periods, lunch hours, or working hours. They were cautioned not to make any special efforts but to work only at a comfortable pace.

What happened next has achieved the status of myth in the separate literature of both social science and business administration. In the former, it is regarded as a horror story. For the latter, it is considered inspirational.

he second variable in the experiment (Time 2) was the production rate for five weeks in the test room while the subjects got used to the new surroundings. Time 3 changed the piece rate rules slightly. Times 4, 5, and 6 changed the rest periods around. And so it went for eleven separate observations. And for each observation, production went up -- not up and down as the conditions were varied. Just up.

Nonplussed, the experimenters threw the test into reverse. They took away all the special work breaks, piece rates, and rest periods. Production still went up. They put back some of the special conditions. More improvement. No matter what they did, production got better.

Something was going on. It was "testing effect." The six women knew they were in an experiment, felt good about it, enjoyed the special attention, and were anxious to please. They formed a separate social set within the plant, had frequent contact with management, and took part in the decisions over how the experimental conditions were to be manipulated. Their participation and the sense of being special overrode the effect of the initial admonition to make no special effort and work only at a comfortable pace. The study never found out what combination of rest periods, lunch hours, or payment methods has the most effect on productivity. But it was not wasted. The company learned that production improves when management shows concern for workers , and management and workers are "organized in cooperation with management in the pursuit of a common purpose."10 American management theorists took that idea to Japan after World War II, where it flourished, and it was eventually reintroduced to our shores in the 1980s. Those Hawthorne plant women were the first quality circle.

One of the flaws in the Hawthorne research design was that it tried to do too much. Expressed diagrammatically, it would look like this:

X1 X2 X3 X4 X5 X6 . . .

Following the notation system of Samuel Stouffer, we see many observations at different points in time. An experimental manipulation is inserted between each of the adjoining pairs of observations. A better design would have had a row of Y's parallel to the X's to represent a control group with a similar special room and the same amount of special attention but no changes in working conditions. Better yet, get a different group (randomly selected, of course) for each experimental condition. Make repeated measurements and insert the change somewhere in the middle, say between the third and fourth observations. In that way you can verify that the control group and the experimental group are not alike to start with but are responding in the same way to the passage of time and to the effects of being measured.

Factors that correlate with the passage of time are an ongoing problem with field experiments. Your subjects get older and wiser, public policy makers change their ways, record keeping methods change, the people making the observations, maybe even you, too, all change. The traditional way of coping with differences that correlate with time is with imagination. Stouffer noted that the basic research design, with all the controls and safeguards taken away, looks like this:

X2

One measurement of one thing at one point in time. He was complaining about 1940s social science, but what he said is still relevant to 1990s journalism. With such a research design, one phenomenon looked at once and not compared to anything, we "do not know much of anything," he said. "But we can still fill pages ... with 'brilliant analysis' if we use plausible conjecture in supplying missing cells from our imagination. Thus we may find that the adolescent today has wild ideas and conclude that society is going to the dogs." The result is a kind of pretest, posttest comparison:

X1X2

The italic cell is not an observation, but "our own yesterdays with hypothetical data, where X1 represents us and X2 our offspring. The tragicomic part is that most of the public, including, I fear, many social scientists, are so acculturated that they ask for no better data."

Since Stouffer's time, social scientists have become more careful. There is a tendency to add more control groups. The Black Panther bumper sticker experiment, for example, could have profited from this design:

X1
X2
Y2

The Y2 represents a control group that drives without bumper stickers as a test of the possibility that police are cracking down at Time 2. The control group would be even better if both X and Y drivers had either a Black Panther sticker or a neutral sticker applied before each trip, and if the application were performed at random after each driver was already in the car and had no way of knowing which sticker was being displayed.

An even more thorough design could look like this:

X1
X2
Y1
Y2
X'2
Y'2,br>

Here the nonstickered or neutral-stickered Y group is present at Time 1 to verify its initial comparability. X' and Y' are present as a test of the possibility that the experiment made the original two groups of drivers too aware of their roles as subjects and made them behave differently, like the women in the Hawthorne experiment. Such an effect would be indicated by a difference between X2 and X'2 as well as between Y2 and Y'2. Donald Campbell, with Julian Stanley in an early evaluation of designs for experimental and quasi-experimental social research, pointed out that the above design includes four separate tests of the hypothesis.11 If the police are really prejudiced against the Black Panthers, then there should be the following differences in the quantity of arrests:

X2
>
X1
X2
>
Y2
X'2
>
Y'2
X'2
>
Y1

Tacking on control groups can be a good idea in survey research when you return to the same respondents to get a pretest, posttest measure. The Miami Herald did that when Martin Luther King was assassinated just after it had completed a survey of its black population. The hypothesis was that King's nonviolent ideology was weakened by his death and the advocates of violence had gained. Fortunately, the earlier survey had asked questions about both kinds of behavior, and records of who had been interviewed had been retained. At the suggestion of Thomas Pettigrew, the Herald added a control group of fresh respondents to the second wave of interviews. The original respondents had had time to think about their responses, might have been changed by the interview experience, might even have read about themselves in the Miami Herald. Any differences at Time 2 might simply be the effect of the research process rather than any external event. The second-wave control group provided a check against that.

As it turned out, the control group's attitudes were indistinguishable from those of the panel, providing evidence that the experience of being interviewed had not altered the Herald's subjects. Knowing that was important, because there was a big change between Time 1 and Time 2. Miami blacks, after King's death, were more committed than ever to his nonviolent philosophy. The proportion interested in violence did not change.12

Once you start looking for spurious effects, it is difficult to know where to stop. Donald T. Campbell, first with Julian Stanley and later with Thomas D. Cook, has made an intensive effort to figure out how to do things right. To do that, he first had to list the things that can go wrong. At Harvard, they used to call his list of foul-ups "Campbell's demons." Here is a partial listing:

Campbell's demons

  1. History. If you measure something at two different times and get a difference, it could be because of any number of historical events that took place in the intervening period.

  2. Maturation. Subjects and experimenters alike get older, tired, bored, and otherwise different in the course of an experiment.

  3. Testing. Measuring the way a person responds to a stimulus can change the way he or she responds the next time there is a measurement. School achievement tests are notorious because teachers learn what is in the tests and start teaching their content. Pretty soon all the children are above average, just like in Lake Wobegon.

  4. Statistical regression. Journalists have been easy prey to this one. A school board announces a program to focus on its worst-performing schools and improve them. It picks the two or three schools with the worst test scores in the previous year and lavishes attention and new teaching methods on them. Sure enough, the next year those schools have better test scores. The problem is that they would have done better even if there had been no special attention or new technique.

    The reason is that there is a certain amount of random error in all tests and rankings. The schools at the bottom of the list got there partly by chance. Give them a new roll of the dice with next year's testing, and chance alone will move them closer to average. The phenomenon is call regression toward the mean, because it always moves the extreme performers, top and bottom, closer to the mean on the second test. It is a danger any time that you select the extremes of a distribution for treatment. Most educators know about it, but knowing about it doesn't stop them from taking the credit for it.

  5. Selection. If comparison groups are not chosen strictly at random, then hidden biases can destroy their comparability. Self-selection is the worst kind. If you were doing the Black Panther experiment, and you let students volunteer to display the bumper stickers, you might get the risk takers and therefore the most reckless drivers.

  6. Mortality. Not all of the subjects remain available during an experiment that lasts over a period of time. Those who drop out or get lost may be different in some systematic way. In the evaluation of Head Start programs for preschool children, for example, the children with the most-motivated parents were more likely to finish the treatment. The selective dropping out of the less motivated took away children who had poorer family situations and maybe other strikes against them. Their absence for the final comparisons made Head Start look better than it really is.

  7. Instrumentation. The measuring scale may have more flexibility in the middle than at the extremes. Audiences rating different moments of a presidential debate on a seven-point scale can make wider swings from the midpoint than when the comparison is made from an extremely high or low point.

  8. The John Henry effect. Members of a control group might know they are in a control group and try harder just out of rivalry. Students in some educational experiments have been suspected of doing this. John Henry, you may remember, was the steel-driving man who "wouldn't let a steam drill beat him down" in the traditional folk ballad.

  9. Resentful demoralization. Just the reverse of the John Henry effect. Control groups see the experimental group as being more favored, and they stop trying.

    See Cook and Campbell13 for the full list of threats to experimental validity. But don't be discouraged by them. As Campbell noted many years ago, "all measures are complex and all include irrelevant components that may produce apparent effects."14 It is not necessary to become so frightened of those irrelevant components that one avoids field experiments. It is only necessary to be aware of the things that can go wrong and treat your own work with the appropriate skepticism.



Site Copyright: Jeanne Curran and Susan R. Takata and Individual Authors, April 2004.
"Fair use" encouraged.