EDF600 – Research Methods

Lola Aagaard’s Notes

Chapter 9 / 13 –Experimental Designs

 

I. Experimental designs – these studies involve a treatment of some sort that is under the control of the researcher.  This is called manipulation of the independent variable.

Something happens to the participants – they take part in a class, a new method of instruction, use computers, have drugs administered to them, change the amount of time they spend doing something.  The group that gets something “new” is the treatment group.  The group that gets the same old thing or nothing or something different is the control group.  The researcher chooses which group is which before the study is implemented and data are collected.  The best way to make that choice is through random assignment.

            Example:  A randomly selected treatment group of 25 high school students is given no homework for a period of 9 weeks, while a control group has homework as usual.  Both groups are tested for attitude toward school and content knowledge in that subject at the end of the nine weeks.

            Example 2:  This is a real one – the High/Scope Perry Preschool Program.  In the 1960s, 123 poor black 3 and 4-year-old kids (who had low IQ scores and parents with very little education) in Ypsilanti, MI, were chosen for an experimental study.  These kids were randomly assigned to go to preschool or not.  (And the researchers have been following them ever since – it’s also a longitudinal study.  They just came out with the report of these kids, now age 40, and the cost-benefit of their preschool experience.  The costs of the preschool program are outweighed more than 5 to 1 by the gains in tax paid by gainfully employed former students and the lessened taxpayer burden of them not being on welfare or in jail, as compared to the kids in the study who did NOT go to preschool.)

 

A. Threats to validity – these come in two flavors:  internal and external. 

Internal validity is the confidence you have that your results (changes in the dependent variable) were actually due to your treatment (independent variable) and not the consequence of intervening/extraneous/confounding variables.

            External validity is the confidence that the results you found with your sample are generalizable outside the study to the population of interest. 

            These two types of validity balance against each other – it’s difficult to have high levels of both.  When you control your experiment very closely, you can raise internal validity, but lose external validity because the setting is so removed from the real world.  Or if you use a more “real world” setting, you generally have less control over what happens and things intervene (schedules get changed, kids drop out, etc.), causing a lowering of internal validity.

            But there are general suggestions for keeping both as high as possible:

 

1. Threats to internal validity -- these things need to be “controlled” in order to be confident in talking about cause and effect as a result of the experimental study.

a. History -- something happens external to the study that influences results (kids see a documentary on the topic that is being taught to them during your study and learn more from it than from their instruction; a fellow student in one school in your study is killed in a car wreck the night before CATS testing and that affects the other students’ achievement because of the emotional upset)

b. Maturation – subjects mature physically, emotionally, or psychologically during study.  Taking the first semester of 1st grade as a pretest measure of reading and then instituting a new way of teaching reading and collecting post-test data at the end of the second semester would be threatened by normal maturation of students’ abilities over the course of 1st grade.  We expect them to read better at the end, almost no matter what kind of instruction they have.  Or you give 3rd graders soy milk rather than regular milk at lunch for a year, then compare their end of year height to the beginning of the year.  How can you say changes are due to the soy milk – kids naturally grow over the course of a year!

c. Testing – pretest influences results of the posttest, due to practice effect or sensitizing to purpose of study.  If what you’re interested in is the social class bias of teachers, you can hardly give them a pre-test – it will alert them to what the treatment is going to be about.  Or if you give exactly the same pre and post-test to look for gains in science knowledge, some kids may remember some of the questions from the pretest and look them up in between.  So the change in their score comes more from the pretest than from any instruction that happened in between.

d. Instrumentation – validity/reliability problems with instruments, non-equivalent versions, or scoring problems – a non-reliable (unstable in a test-retest manner) instrument won’t allow you to conclude anything about the change in scores from pre to post-test because of the error variability in the scores from one administration to the next. Any changes might be totally due to the unreliability of the instrument.  Or if you’re giving that 9th grade math test to 1st graders, you have validity problems that will definitely affect your results.  Or if, like in open response questions, scoring is more subjective, then it may make a difference who is doing the scoring and any change or lack of it from one time period to the next could be solely due to the scoring of the test.

e. Statistical regression (regression to the mean) – extremely low or high scores at the beginning of the study mean that without any treatment at all it is extremely likely that the very high scorers on the pretest will drop on the posttest, and the very low scorers on the pretest will gain on the posttest. 

            Why?  Because it is assumed that a larger portion of an extreme score (whether high or low) is error rather than true score.  The odds of getting that much error again the very next time the test is taken are pretty low – so the next testing results in a score closer to the true score.

            This was a real issue early in the 1990s when the baseline scores for the accountability system were established.  There were all sorts of accusations aimed at schools that scored very low – the feeling was that they had done it on purpose just so they would be sure to score higher the next time.  And the schools that had scores quite high were upset when some of their scores dropped the second year rather than improving.  But it could have been all regression to the mean and perfectly normal.

            This is why the accountability cycle is a biennium – two years averaged together and compared to the goal rather than looking at one year at a time.  It is hoped that the two years together give a better estimate of the school’s true score compared to one year that might be an extreme score with lots of error in it.

f. Differential selection – if there is no random assignment to groups, then the  non-equivalent groups that result mean the differences in results may be due to the existing group differences that you may not know about (this is the big weakness in causal-comparative research, also – it’s why matching, etc. is a good idea)

g. Mortality – it doesn’t mean that subjects die.  It is attrition -- participants drop out of one group or the other in non-random fashion.  Depending on who is dropping out and why, it may leave you with nonequivalent groups and influence your results. 

            Let’s say you did a study of Extended School Services, with one group getting additional homework to do in ESS, while the other got fun enrichment activities.  You have to allow students to withdraw from the study when they want to – it’s in the informed consent, after all.  Guess which group would have more drop-outs?  Your conclusions would very likely be threatened by mortality in the ESS group.

h. Selection interaction effects – if your groups are not equivalent (not randomly selected) at the start, then one group may benefit more from the treatment, or mature more, or react to the test differently, etc.  Even the 7 threats above may not be equally threatening to both groups, thus messing up your results some more.

 

2. Threats to external validity – some of the internal validity problems also make trouble for external validity, for instance, use of pretests and differential selection (using volunteers, for instance).  But there are some different threats specific to external validity.

a. Multiple-treatment interference – using the same subjects for the study of several treatments in a row means that you can’t really tell what is the effect of the second and third treatments, due to possible carry-over from previous ones.  So any results from those could only be generalized to a situation in which all treatments were given in the population.  That isn’t useful or very likely.  The solution is to just give one treatment per group.

            For instance, if you were interested in how to get kids to best learn their spelling words, you might come up with several different methods:  1) repeating every word three times aloud and then writing it twice; 2) using a finger to write the words in shaving cream that is smeared on a flat surface; 3) making up rhythmic actions that would be repeated as each word is spelled aloud.  If you try out these methods on the SAME kids, one method after the other for three weeks, how do you know that by the third week the kids are combining all three methods as they study at home?  If they are, it messes up your conclusions about the usefulness of any method by itself.  You’ll have valid data only for the very first method you tried with them.

b. Specificity of variables – In order to generalize to a larger group outside the study, readers of the study need to know exactly what the treatment in the study was so they can duplicate it.  If the operational definitions or discussion of methods is not clear enough in the study, then readers trying to implement the findings may not actually be doing the very same thing and will get quite different results.  “Hands-on science” or “centers” may mean different things to different people, so the variables need to be specified in detail in the research report.

c. Treatment diffusion – spill-over of the treatment into the control group.  Whether teachers share information or kids talk about the different way they are studying or students show their friends the neat hands-on stuff they’re doing – somehow the control group actually gets a bit of the treatment.  The best way to keep this from happening is to have the treatment group in a different location from the control group!  Of course, putting one group at one school and one at another raises other issues of non-equivalence (different school cultures, teachers, etc), but it would solve THIS problem.

d. Experimenter effects – the participants may not like something about the experimenter, or a researcher may treat boys and girls differently in some way, or be so anxious about the experiment that the anxiety rubs off on the participants.  Or there is experimenter bias, where the researcher influences the results in the way he/she wants them to go, whether intentionally or unintentionally.  This is why it is a good idea to do “blind” studies, where the researcher doesn’t know when doing scoring whether the information came from the treatment or control group.

e. Reactive arrangements – these are participant effects

i. Hawthorne effect -- just getting the attention of being in a study, whether in the treatment or control group, changes participants behavior and scores.  Everybody likes to please someone who is paying attention to them.

ii. John Henry effect—the control group works harder because its members know they’re number 2 and thus it does nearly as well or sometimes better than the treatment group.

iii. Novelty effect – just doing something new can increase participants’ interest and motivation, thus changing their behavior and scores.   If you carry on the treatment long enough for it to become routine, this is less of a problem.  Computers were like this – they were lots of fun to use for ANYTHING when they weren’t so common.  Now the novelty has worn off for lots of people.

 

                        3. Control of threats to validity

a. Randomization – this means random assignment of subjects to treatment groups for internal validity and random selection from the population for external validity.

 If everyone in our sample has an equal chance of being in either group, then any differences in the groups are purely due to chance and aren’t a threat.  This is an adequate control only with relatively good size samples – the 30 per group idea.  If you’ve only got 10 people in your sample, then randomly assigning to groups isn’t going to make much difference one way or the other. 

Of course, this means that your independent variable can’t be something like gender, or grade, or athlete/non-athlete – you can’t randomly assign into those kinds of groups.

 

b. Use a control group  -- this is a group that doesn’t get any treatment or gets a different treatment [comparison group]

If we do a study of a nifty new way of teaching a unit on space, and a space documentary comes on TV, it won’t matter so much because (theoretically) our control group will be influenced by it just as much as our treatment group, so we can tell how much of the results were due to the documentary and adjust for that.

 

c. Give the control group a placebo – if everyone thinks they are getting the same thing, then you can perhaps be able spot the Hawthorne effect and limit the John Henry effect.  You just give a placebo to the controls – a treatment that isn’t the real thing and one that won’t interfere with the results of your study.  Then nobody feels like a “control” and if the treatment group doesn’t do better than the placebo group then either you have a Hawthorne effect with everyone or the treatment really didn’t do anything.

            It was very common for awhile in educational research to do the treatment with the treatment group in one room while the controls watched a movie about an unrelated topic in another room.  The movie was the placebo.

 

d. Use participants as their own controls – rather than have a separate control group, run the same people through both treatments that are being compared.  Do one method of teaching sight words for 4 weeks, then switch to the other method for the next 4 weeks, and compare the results.  You do have to watch for multiple treatment interference here, though.

 

e. Hold certain variables constant and compare homogenous subgroups.  You can eliminate the possible effects of a variable (one that might be important in differential selection, for instance) by removing that variable from the study entirely.  For instance, if you’re afraid that males and females will respond differently in your study, you can limit your subjects to just females.  Problem solved – no males are in the study.  Of course, you’ve limited your external validity because now you can only generalize to females….  This type of thing is often done for demographic variables of all sorts to control for differential selection.

            Or you can make sure that certain variables are exactly the same in each study.  Keep the time of day the same for both groups receiving treatments, or make sure their teachers have the same number of years of experience, or that none of their parents went to college.  So you don’t limit your study to one type of participant, but you equate the groups on particular variables that seem important – holding those variables constant across groups so they don’t confound your results.

 

f. Build the variable into the design.  Rather than eliminate males, you could make sure you include both males and females in your treatment groups and then analyze for the effects of gender as well as for the effects of your treatment. This is known as a factorial design.

 

g. Matching.  If you feel that age, gender, ethnic group, and SES are going to make a difference in your study, you can find matched pairs of students – find two 12 year old Hispanic girls on free lunch and randomly assign one to the treatment group and one to the control group.  You can do the same with any other combination of those variables you can find.  The idea is to make your groups as similar as possible on those variables. 

 

h. Statistical control – You can use specific statistical techniques to control for differential selection.  These are better than nothing, but not as good as controlling in your design.  Analysis of covariance is one possibility under some circumstances.  So is partial correlation, where the shared variance between your dependent variable and some extraneous variable is removed before looking for a relationship with the variable of interest in your study.  For example, if you wanted to know whether there was a relationship between parental income and student CTBS scores, but you were afraid that attitudes toward school might be an intervening variable, you could be helped by partial correlation.  You could statistically remove all of the shared variance (or overlap) between CTBS and school attitude, then look for a relationship between what is left (totally unaffected by attitude now!) and parental income. 

 

i. Unobtrusive pretest measures – these can help with pre-test effects.  Instead of giving a standard pre-test, you see if you can get similar information in another way, whether through documents or observation or asking someone who is knowledgeable about the participant. 

 

            B. Types of experimental designs

1. Pre-experimental (weak designs – lots of problems on the chart on p. 252 (8th) or p. 373 (7th) – either no control group or non-equivalent groups

a.  One-shot case study – one group, one post-test, no control of anything –   can’t even tell if a change took place

.                                   X O                

(X = treatment, O = measurement of dependent variable)

 

b. One-group pretest-posttest – one group, pre/post test – can now tell about change, but no other controls in place.

                                    O X O

 

c. Static group comparison – has a control group, but non-equivalent (no randomization or checking), post-test only.  Some control over testing, history, and maturation.

            X O                 or                     X1 O

 - O                                          X2 O

 

2. True experimental (very strong designs – few or no problems on the chart on p. 255 [8th] or p. 375 [7th]) – all have random assignment to groups, and the presence of a control group

a. Post-test only control group – random assignment, equivalent control group, only post-test – controls for everything, but you can’t look at change over time.

            R X O

            R  - O

(R = random assignment to groups)   

 

b. Pretest-posttest control group – random assignment, equivalent control group, pre/post tests, and you CAN look at change over time.

R O X O

R O  - O

 

c. Solomon four-group – four groups (treatment/control x pre/no pretest).  Very strong design, but takes massive numbers of subjects because of four groups.

R O X O

R O  - O

R  - X O

R  -  - O

 

3. quasi-experimental – (on same chart as true experimental designs in textbook) no random assignment of individuals to groups, but do have random assignment of groups to treatment and control conditions

a. Nonrandomized (nonequivalent) control group – two groups, random assignment of groups to treatment conditions, pre/post test.  MANY educational studies are of this type, using intact groups (classes, for instance) and introducing a new treatment (teaching method) into one of them.  Selection interactions are a problem, but if you add in the technique of matching, it becomes stronger.

            O X O

            O  - O

 

b. Time-series – make a series of observations (measure the dependent variable), then introduce a treatment and make another series of observations.  Keep track of weekly spelling test scores for a month, then try a new way of teaching the spelling words and keep track for another month.  You’ve got history and instrumentation threats here.

            O O O O X O O O O

 

c. Control group time-series (multiple time series)  – time-series with a control (that doesn’t get the new method of spelling instruction) – this helps with the history threat of a straight time-series

            O O O O X O O O O

            O O O O  - O O O O

 

d. Counterbalanced design – When you have equal number of groups and treatments, you can give every group every treatment, but in different orders.  My colleague and I did something like this with our study of testing formats in our undergraduate class.  We had 5 sections of classes, but only three treatments.  So everybody got each treatment once and then got a repeat of two.  But we didn’t give the same type of testing format to all 5 sections at the same time, we sort of counter-balanced the order – two sections got to use a cheat sheet on Test 2, while two others used a cheat sheet AND had group discussion, while the final one had only group discussion.  On Test 3 we changed that for each section and they did something they hadn’t done before. 

            This needs to be the choice of design only if the treatments aren’t going to interact with each other or you get multiple-treatment interference.

 

4. Single-subject designs – these are very important in special education where the behavior of only one child needs to be modified, or in special cases where there are very few students in need of a particular treatment.

            We’re not going to go into them in depth, but you can read about them in your textbook.

            Generalizability of these studies is accomplished through replication or repeating the study many times with different individuals.  If the same results appear in most or all cases, then the results can be generalized with some confidence.