Key Concepts: Observation and Measurement

Key Concepts: Observation and Measurement

Chapter Learning Outcomes After reading and studying this chapter, students should be able to:

• comprehend the importance of accurate and precise observation and measurement, because these operations are at the heart of what psychologists do.

• explain how accurate measurements first begin with operational definitions and how these definitions must yield to measures that are both valid and reliable.

• differentiate between the various types of reliability and validity.

• know what information is needed to make an appropriate selection of a statistic to answer questions of interest.

• recognize pitfalls to constructs graphs and avoid those pitfalls when creating their own graphs.

• appreciate the various challenges and threats to collecting data for an applied project and know the steps that would lead to the initiation of such a project, such as pilot testing and data storage.

• draw appropriate conclusions from empirical data and understand that psychologists seek to falsify incorrect hypotheses and never prove a theory or a hypothesis.


lan66845_07_c07_p191-228.indd 191 4/20/12 2:50 PM


CHAPTER 7Introduction


The fundamental goals of psychology are to understand, explain, predict, and con-trol behavior; thus, projects with a psychological perspective strive to fulfill one or more of these goals. Each task is a large undertaking for anyone interested in psychol- ogy and human behavior. Some- times these tasks seem daunting; it is hard to know where to start such a project. The foundations of research in psychology start with observation and measure- ment. Like building a house, if you don’t have a solid founda- tion, whatever you do afterward will be on shaky ground. So to work toward our goals in psy- chology, we acquire basic skills in observation and measurement. In previous chapters, we dis- cussed some of the major types of research designs; in this chapter, we will put the pieces together by discussing matters that are important to all research designs.

Voices from the Workplace

Your name: Rachel W.

Your age: 20

Your gender: Female

Your primary job title: University Relations Recruiter

Your current employer: Whirlpool

How long have you been employed in your present position?

3 months

What year did you graduate with your bachelor’s degree in psychology?


Describe your major job duties and responsibilities.

Recruiting and attracting new talent from the university setting; interviewing; managing hiring pro- cesses; planning and implementing new strategies for gaining candidate interest and building the Whirlpool Brand presence on campuses.

What elements of your undergraduate training in psychology do you use in your work?

Personality analysis, counseling skills; statistics.

What do you like most about your job?

Meeting and working with new people. (continued)

Psychological observation strives to understand, predict, explain, and control behavior. Why is it important that there are guidelines for observation and measurement?


lan66845_07_c07_p191-228.indd 192 4/20/12 2:50 PM


CHAPTER 7Section 7.1 Variables: Independent, Dependent, and More

7.1 Variables: Independent, Dependent, and More

Let’s begin by reviewing the definition of a variable. A variable is an entity that can take on different values (Harmon & Morgan, 1999). In reading this text, this is prob-ably not your first encounter with the word variable. If you had any math in grade school, junior high, and high school, you should be familiar with variables. You solved equations for variable X or variable Y. You probably solved single variable equations and multivariable equations. So in math, the X variable stood for an entity that had a certain value, and your task was to figure out the value. Sometimes in an equation you might have been given X and asked to solve for Y. In very general terms, a variable is an entity that can take on different values. You’ve probably used the word variable in everyday language also, and not just in reference to math. You might have a variable speed drill press out in the shop, or you might think that the meteorologist has variable success in predicting the weather. The term variable refers to varying or different values (numbers, scores, speeds, and so on).

What do you like least about your job?

Sometimes the travel becomes tiring.

What is the compensation package for an entry-level position in your occupation?


What benefits (e.g., health insurance, pension, etc.) are typically available for someone in your profession?

Typical for most business settings.

What are the key skills necessary for you to succeed in your career?

People skills and strategic thinking.

Thinking back to your undergraduate career, what courses would you recommend that you believe are key to success in your type of career?

Social psychology, personality, research methods, counseling.

Thinking back to your undergraduate career, can you think of outside of class activities (e.g., research assistantships, internships, Psi Chi, etc.) that were key to success in your type of career?

Internships and leadership positions within organizations.

What advice would you give to someone who was thinking about entering the field you are in?

Get some business background in addition to psychology. Get internship experience and get involved in organizations including the leadership roles.

If you were choosing a career and occupation all over again, what (if anything) would you do differently?

Nothing at this point in my career.

Copyright © 2009 by the American Psychological Association. Reproduced with permission. The official citation that should be used in referencing this material is R. Eric Landrum, Finding Jobs With a Psychol- ogy Bachelor’s Degree: Expert Advice for Launching Your Career, American Psychological Association, 2009. The use of this information does not imply endorsement by the publisher. No further reproduction or distribution is permitted without written permission from the American Psychological Association.

Voices from the Workplace (continued)

lan66845_07_c07_p191-228.indd 193 4/20/12 2:50 PM


CHAPTER 7Section 7.1 Variables: Independent, Dependent, and More

In psychology, we build on this general definition of variable, with more specificity. We divide the variables into two broad categories—independent variables and dependent vari- ables. The key idea to remember is that a variable—either an independent variable or a dependent variable—must be able to take on different scores, numbers, outcomes, or values.

Recall from Chapter 3 that the independent variable is the variable that is manipulated, con- trolled, or arranged/organized by the researcher. When the independent variable is manipu- lated or controlled, this is sometimes referred to as an active independent variable (Harmon & Morgan, 1999; Townsend, 1953). The manipulated or controlled version of the indepen- dent variable is easier to understand than the arranged/organized independent variable.

The dependent variable is the one that is measured—hopefully the direct result of the manipulations of the independent variable. Dependent variables can be either qualitative or quantitative. A qualitative variable is one in which the responses differ in kind or type. That is, there is a difference in quality (what form) rather than quantity (how many), and the outcomes of these qualitative variables are usually described in words. Quantitative variables differ in amount; there is more or less of some known entity. Quantitative vari- ables are usually described by numbers, and psychologists tend to strive to develop mea- sures of behaviors (dependent variables) that yield a number. The particular approach uti- lized throughout this textbook focuses on quantitative approaches. Dependent variables can also be described in terms of the measurement process. Each quantitative approach is designed to yield a number; see Table 7.1 for types and examples of dependent variables.

Table 7.1: Types of dependent variables, with examples

Dependent Variable Type Examples

Frequency (how often a behavior


Number of cigarettes smoked in a day; number of text messages sent in an hour; number of times you studied before a test; number of times you hit the brakes as you approached an intersection

Latency (the amount of time until a

behavior occurs)

How long it took you to learn the lyrics to a new song; after the semester started, how many days (weeks) it was until you opened this textbook; once you saw a red light, the amount of time it took until you started braking

Duration (the amount of time a

behavior lasts)

The amount of time you spent playing XBOX 360; the amount of time you studied (in minutes); the amount of time your foot was on the brake

Amplitude (the intensity of a


The amount of noise (in decibels) generated by a class of third graders; the degree of test anxiety (high, medium, low) exhibited by high school students taking the SAT; the intensity of your braking (tapping the brakes versus slamming on the brakes)

Choice Selection (a decision from a number

of alternatives)

Your answers to a multiple-choice test; your responses on a personality inventory to determine if you are introverted or extroverted; at a repair shop, which type of new brakes you select to be installed on your car

When all goes well in a study, the measurements from the dependent variable are a func- tion of the independent variable; in other words, the manipulations of the independent variable lead to changes in the values of the dependent variable. These terms for variables,

lan66845_07_c07_p191-228.indd 194 4/20/12 2:50 PM


CHAPTER 7Section 7.1 Variables: Independent, Dependent, and More

independent and dependent, were popularized in psychology by Woodworth (1938) and later by Woodworth and Schlosberg (1954) in their first and second editions, respectively, of Experimental Psychology. The terms were used as a means of emphasizing the cause-and- effect relationship between what the researcher does (independent) and the subsequent outcome (dependent)—however, not all studies yield cause-and-effect conclusions. But where did these terms come from? And how does the use of these concepts help us to fur- ther observe and measure human behavior? It all comes down to operational definitions, which we will discuss in the next section.

Classic Studies in Psychology: The Hawthorne Studies

Generally speaking, the Hawthorne effect refers to the situation where participants in a study may band together to work harder than normal, perhaps because they have been specially selected for a study or they feel loyalty to the researchers or the experimental situation. The Hawthorne effect is described frequently in Research Methods texts, and Adair’s (1984) examination of texts from the 1970s and early 1980s found many erroneous descriptions of the studies. My goal is not to report similar errors here. To avoid this, in part, I consulted the original Roethlisberger and Dickson (1939) text, Management and the Worker, as well as other references on the topic. These studies began in the 1920s and ended in the 1930s. They were conducted at the Western Electric Company’s Hawthorne plant, which was adja- cent to both Chicago and Cicero. By the mid-1920s, Western Electric employed 25,000 people at the Hawthorne plant and served as the manufacturing and supply branch of American Telephone and Tele- graph, better known today as AT&T (Baritz, 1960). F .J. Roethlisberger of Harvard University and W .J. Dickson of the Western Electric Company were chiefly involved in these efforts, but many consultants were brought in over the course of the multiyear studies. In fact, there were many different studies within the “Hawthorne studies,” which dubiously leads us to the Hawthorne effect.

The first set of studies, beginning in November 1924, were illumination studies conducted to examine the impact different levels of lighting had on worker productivity. In one variation, individuals were tested with lighting at 10 foot-candles (roughly speaking, 1 foot-candle is the amount of light that one candle generates 1 foot away from the candle), and over time successive work periods decreased 1 foot-candle at a time. Interestingly, when lighting was decreased from 10 foot-candles to 9 foot- candles, productivity increased. In fact, productivity continued to increase with decreased lighting until about 3 foot-candles, at which point productivity decreased (although it is reported that one employee was able to operate at the level of .06 foot-candles, or, an ordinary amount of moonlight) (Adair, 1984; Roethlisberger & Dickson, 1939). If nothing else from this study, the researchers learned from this study that understanding productivity was much more complicated than lighting.

Around April 1927, a second series of studies began, which would typically be referred to as the Relay Assembly Test Room Studies (Adair, 1984; Baritz, 1960). Experimentally speaking, Roethlisberger and Dickson became more rigorous in this series of studies. For example, they selected five female employ- ees who were relay assemblers out of a large department, and placed these employees in a special test room for better control of the conditions and variables to be tested. As a dependent variable, one could measure the daily and weekly output of test relays assembled by each woman.

Prior to moving the female workers into the test room, employee records were known, hence a base- line of productivity in assembling test relays was available to the researchers.

Over the course of 270 weeks (yes, 270 weeks), the researchers systematically varied the conditions in the relay assembly test room, all the while recording dependent variable data on the number of test relays. These experimental variations were referred to as periods, and periods lasted weeks at a time. So, for example, for some periods the amount of voluntary rest time was increased, while for other periods voluntary rest time was decreased. For some periods the rest breaks were decreased in the morning but lengthened in the afternoon; for one period workers were giving Saturday mornings off (a 48-hour work week was customary at the time). During one period (Period XII), there was a return to baseline control conditions—a nice experimental comparison to approximately where (continued)

lan66845_07_c07_p191-228.indd 195 4/20/12 2:50 PM


CHAPTER 7Section 7.1 Variables: Independent, Dependent, and More

the employees started the experiment. Productivity in each period seemed to increase regardless of the manipulation introduced (Adair, 1984), in many but not all cases. In other words, when experi- mental conditions were manipulated to attempt to decrease productivity, oftentimes productivity increased. When the employees returned to baseline control conditions in Period XII, “unexpectedly, rather than dropping to preexperiment levels, productivity was maintained” (Adair, 1984, p. 336). See Figure 7.1 for the single-subject data from the Hawthorne relay assembly test room.

Classic Studies in Psychology: The Hawthorne Studies (continued)

Exp. Periods Relays Op. 1p Operator 1






1 2 3 4 5 6 7 8 9 10 11 12 13

Exp. Periods

Weeks ending


Apr 30

May 28

May 25

Jun 25

Apr 27

Jun 22

Jul 23

Aug 20

Sep 17

Oct 15

Nov 12

Dec 10

Jan 7

Feb 4

Apr 28

Jun 25

Aug 18

Oct 13

Dec 8

Feb 2

Mar 3

Mar 30

May 26

Jul 21

Sep 15

Nov 10

Jan 5

Mar 2

Mar 31

1 2 3

1927 1928 1929

4 5 6 7 8 9 10 11 12 13

Op. 2a Operator 2 80





Operator 4 80





Operator 3 70




Operator 5 70




This graph shows the single-subject data from the Hawthorne relay assembly test room.

Source: Roethlisberger and Dickson (1939)

Figure 7.1: Single-subject data from Hawthorne study


lan66845_07_c07_p191-228.indd 196 4/20/12 2:50 PM


CHAPTER 7Section 7.2 Operational Definitions and Related Ideas

7.2 Operational Definitions and Related Ideas

During your study of psychology, you may have heard about operational defini-tions. An operational definition is a translation of the key terms of the hypothesis into measurable (i.e., public, observable) quantities. If you wanted to study depres- sion, then you would need to operationally define depression in such a way as to obtain a numerical score; or, if you wanted to measure hunger, you would have to define hunger in such a way as to get a rating or score (from a qualitative perspective, the definitions would come from non-numeric sources). Even though this makes sense, and this approach can be useful in framing how we approach measuring independent and dependent variables, it isn’t the original intention of operational definitions, or operationism. The typical focus point for the beginnings of the notion of operational definitions points to Percy Bridgman (1927), who wrote The Logic of Modern Physics. This book by Bridgman appears to have been widely read in psychology, but sometimes perhaps not far past the first few pages (Koch, 1992). To be clear, Bridgman never proposed the notion of operational definitions, but is credited with the idea of operationism (Boring, 1950). The key notion, however, was in making the connection between the behavior to be studied and the measurement of that behavior. In theory, that’s where the concept of operational definition would be so crucial to what a psychologist does.

There were also additional studies as part of the Hawthorne studies, such as the Mica Splitting Test Room and the Bank Wiring Room. Taken together, what do we learn from the Hawthorne studies, and what is the Hawthorne effect? Simply put, Stagner (1982) defines the Hawthorne effect as “a tendency of human beings to be influenced by special attention from others” (p. 856). The results of the Haw- thorne studies are used by many for very different purposes, ranging from discussion of the docile worker (Baritz, 1960) to discussions about if the Hawthorne effect exists (Jones, 1992). In his review of previous studies, Adair (1984) found claims of Hawthorne effects from previous studies, and a few studies were successful when they purposely attempted to generate a Hawthorne effect. What should we take away from all this? It’s important for us to realize that when people are given special atten- tion, they may behave differently than normal. Although this effect is known as the Hawthorne effect (and it seems from the literature that Hawthorne effects have been demonstrated), it is unclear if the actual Hawthorne studies conducted in the 1920s and 1930s bear much relation at all to what we now call the Hawthorne effect.

Reflection Questions

1. Think about your own prior work experience and the environment in which you worked. Did your surroundings affect your productivity? Did you work in an office where you could close the door and work in privacy, or did you work in an open space or outdoors under different work- ing conditions? How might the ready access to coworkers positively or negatively influence your productivity?

2. Consider the increasing number of individuals who telecommute and work from home. Couple this with the availability of Skype, FaceTime, and other software packages that allow for elec- tronic “face-to-face” interactions? In what types of projects or jobs might electronic interaction be sufficient? What are the conditions by which you would know you would need to meet and work with a person—in person—versus knowing that an electronic interaction would be OK?

Classic Studies in Psychology: The Hawthorne Studies (continued)

lan66845_07_c07_p191-228.indd 197 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

So what does this all mean? The best idea would be to remem- ber that clearly defining the key terms of psychological research is important. Measuring human behavior in a meaningful way requires a rigorous approach, striving for both reliability and validity (more on these topics in the next section). You should realize that however we choose to define the behaviors we study, there are philosophical assump- tions that underlie that method- ological approach, whether we are aware of those assumptions or not (Green, 1992). The concept of operational definitions can be a useful concept in general, but it appears to have changed from

its original inception. Even so, researchers still use this notion to help develop research ideas, such as Bishop et al. (2004) working to develop an operational definition of mind- fulness. Clear, precise definitions benefit all, regardless of whether they are truly “opera- tional” or not.

7.3 The Measurement Process

The measurement process is central to any area of research. For our purposes, mea-surement involves how we capture the responses of individuals (either quantita-tively or qualitatively) in such a manner as to allow for their systematic analysis. In any measurement process, however, there is always the possibility of error. Psychologists know this, and they keep this in mind when drawing conclusions by stating them in the context of probability. In essence, whenever we measure anything, there is the potential for error. Classical test theory suggests that when a measurement is obtained, that mea- surement is composed of true score plus error (or X = t + e). Suppose you wanted to know the height of your best friend. Your best friend has a true height—in other words, there is one answer that is correct (but we don’t know the true height, so we measure). However, in measuring your best friend, there is the potential for error. You might use a yardstick or tape measure, and you could make an error in reading the number, or your friend could be wearing shoes with thick soles or slumping over. The resulting height is composed of part true score plus part error (by the way, the error could be an overestimation or an underestimation). How could we increase our confidence in minimizing the error of mea- surement? Other measurements could be taken and the results compared (a test-retest situation). Although never eliminating the potential for measurement error, error can be minimized by using the research methods of experimental psychology. The amount of error is never definitively known for a particular individual, but the amount of error is estimated when studying a group of people.

If you wanted to study depression, how would you operationally define it?


lan66845_07_c07_p191-228.indd 198 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

Similarly, measuring any aspect of a person’s behavior yields a result containing true score plus error. Psychologists strive to minimize the error in measurement through the use of methodology and statistics. Where does the error in measurement error come from? A number of sources can lead to the underestimation or overestimation of the true score. A person can contribute to measurement error by being unusually motivated to do well or by not feeling his or her best. The instruments (surveys or questionnaires, for example) may be too demanding, too complicated, or too lengthy, leading to frustration, fatigue, or boredom. The researcher may also be a source of measurement error by being too friendly or too stern with the participants. The researcher may also provide inadequate instructions about the task or may simply make errors in recording participant responses. Finally, the location and specifics of the situation may lead to measurement errors; for example, the temperature, humidity, and how crowded the room is may hinder the acqui- sition of the true score. The techniques that you have learned throughout this book will help you to make better approximations of the true score while attempting to minimize the influence of measurement error.

Reliability Simply put, reliability refers to consistency in measurement. If only it were that simple. If we are to have confidence that a behavior we measure is meaningful, then we have to have confidence that the measurement is reliable and consistent. There are a number of ways to think about reliability, and we’ll briefly discuss the main ones. It is important to note that reliability is estimated, not measured (Colosi, 1997).

Test-Retest Reliability This type of reliability may perhaps be one of the easier types of reliability to understand. Test-retest reliability refers to the consistency in scores when the same test is administered twice to the same group of individuals. Test-retest reliability is calculated by correlating scores from each person. Test-retest reliability makes the most sense when you are trying to measure a trait or quality that is assumed to be relatively stable over time (Cohen & Swerdlik, 2005). For example, a researcher may be interested in studying the trait of humil- ity. Many personality traits are assumed to be relatively stable over time, so your humil- ity levels at the beginning of the semester should not be too different from your humility levels one month into the semester. We could then correlate your humility test score with your humility retest score; the resulting correlation coefficient is known as the coefficient of stability (Aiken & Groth-Marnat, 2006; Cohen & Swerdlik, 2005). Generally speaking, the longer the time between test and retest, the lower test-retest reliability is likely to be.

Parallel Forms/Alternate Forms Reliability One of the benefits of the test-retest approach is that you create a single instrument and administer that instrument twice to the same group of people. However, one of the draw- backs to this approach is that, depending on the interval between testing, some indi- viduals might remember some of the items from test to retest. To avoid this, someone interested in constructing a reliable test could use a parallel forms or alternative forms approach. Although related, these two approaches are technically different (Cohen & Swerdlik, 2005). In a parallel forms test, you would have two versions of a test, Test A and Test B. You would then give both Test A and Test B to the same group of individuals,

lan66845_07_c07_p191-228.indd 199 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

and you could correlate the outcomes between the two test administrations; this resulting correlation coefficient is known as the coefficient of equivalence (Aiken & Groth-Marnat, 2006). With true parallel forms tests, we would want identical means and standard devia- tions of test scores, but in practice, we would hope that each parallel form would correlate equivalently with other measures (Cohen & Swerdlik, 2005).

With alternate forms reliability, two different forms of the test are designed to be parallel, but do not meet the criteria levels of parallel forms (for example, non-equivalent means and standard deviations). For instance, instructors often distribute two (or more) ver- sions of a test (perhaps on different colors of paper). This is usually done by the instruc- tor to minimize cheating in a large lecture hall testing situation. One hopes that the dif- ferent versions of the test (that is, alternate forms) are truly equivalent. This example provides the spirit of alternate-forms testing, but doesn’t qualify. In true alternate-forms testing, each student is asked to complete all alternate forms so that reliability estimates can be calculated.

Internal Consistency Reliability Test-retest, parallel forms, and alternate forms reliability all require that a participant com- plete two (or more) versions of a measure. In some cases this may not be methodologically possible or prudent. A variety of methods have been developed to estimate the reliability of a measure in a single administration, rather than requiring multiple administrations or multiple forms. The split-half method of estimating the internal consistency of a measure involves splitting the instrument in half and then correlating the scores from the result- ing halves. For example, say I created in my Statistical Methods course a 100-item test about measures of central tendency (mean, median, and mode) and measures of vari- ability (range, variance, and standard deviation). Using the split-half method, I would ask students to take the 100-item test, but then I would separate that test into two (hopefully) equivalent halves, such as the 50 odd-numbered items and the 50 even-numbered items. I could then correlate the score from the odds and the score from the evens to obtain an estimate of internal consistency reliability (then, the correlation coefficient needs to be adjusted using a separate formula). In fact, this approach is widely used in testing (Aiken & Groth-Marnat, 2006; Cohen & Swerdlik, 2005).

Interrater/Interobserver Reliability Each of the above reliability estimates focuses on participants’ responses to a test or ques- tionnaire, attempting to address, from a particular sample, the reliability of responses. Sometimes, however, an expert panel of judges is asked to observe a particular behav- ior and then score that behavior based on a predetermined rating scheme. The reliabil- ity between the scores from the raters is known as interrater reliability (also known as interobserver reliability, scorer reliability, or judge reliability; Cohen & Swerdlik, 2005). Let’s say, for example, that you are interested in the level of aggression on a playground at a local grade school. You develop a scoring system for aggressive behaviors, such as name-calling, pushing, shoving, hitting, biting, fighting, and so on. You videotape multi- ple sessions at the local playground from a location where the children cannot see the vid- eotaping (of course, you’ve gone through all the Institutional Review Board procedures to ensure that you are ethical). Then you have a panel of developmental psychologists individually view the videotapes, coding children’s behavior based on your aggressive

lan66845_07_c07_p191-228.indd 200 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

behavior scale. Interrater reliability would be use- ful to determine the level of agreement between the raters in using the scoring system.

There are a variety of methods you could use to calculate interrater reliability. You could first look at a percentage agreement score—that is, on all the behaviors rated, how often did your raters/ judges code a behavior in the same behavioral category? The formula you would use to calcu- late percentage agreement would be: (number of agreements/number of agreements and dis- agreements) × 100. Another technique, with two judges, is that you could simply calculate a cor- relation coefficient between the pairs of scores on each behavioral instance. There are multiple approaches to capturing the consistency of raters and judges in these situations.


Whereas reliability addresses consistency in mea- surement, validity addresses the question “Are we measuring what we think we are measur- ing?” There are at least two major approaches to how we think about validity. One approach comes from the psychometric literature and how psychologists construct new measurement instruments. The classic approach here is to discuss content validity, construct validity, criterion-related validity, and face valid- ity. The other approach comes from our study of experimental design and particular quasi-experimental designs—in fact, some refer to this latter approach as a “Cook and Campbell” approach, in part due to an influential book (1979) that brought together this conceptualization of validity, as well as a classic listing of threats to validity. We’ll briefly review both major approaches here.

In the classic psychometric approach, there is a trio of C’s: content validity, criterion- related validity, and construct validity. Note that for a measure (an instrument, survey, questionnaire, test) to have validity, all three types of validity mentioned here are impor- tant; each is necessary, but not sufficient alone, to establish validity (Morgan, Gliner, & Harmon, 2001). Content validity refers to the composition of items that make up the test or measure. Do the contents of the test adequately reflect the universe of ideas, behaviors, attitudes, etc., that compose the behavior of interest? For example, if you are interested in studying introversion, and you are developing an introversion inventory to measure one’s level of introversion, do the items on the inventory capture the totality of the concept of introversion? More formally, “content validity is concerned with whether the content of a test elicits a range of responses that are representative of the entire domain or universe of skills, understandings, and other behaviors that a test is designed to measure” (Aiken & Groth-Marnat, 2006, p. 97). If you are taking the Graduate Record Exam (GRE) subject

Interrater reliability can be used to determine the level of agreement between raters who used a scoring system to find out the level of aggression on a playground.

Science Faction/SuperStock

lan66845_07_c07_p191-228.indd 201 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

test in psychology, content validity asks the question “Are the items truly capturing your knowledge of psychology?” Content validity alone does not mean that an instrument is valid (remember, each type of validity is necessary but not sufficient). Instead, content validity is established through the process of creating test items, including a thorough review of the literature and consultation with experts (Morgan et al., 2001).

Criterion-related validity refers to how the measurement outcome, or score, relates to other types of scores. A general way to think about criterion-related validity would be “Given that we now have a score that is reliable, what will this score predict?” Psycholo- gists are very interested in making predictions about behavior, so criterion-related validity can be very useful in practice. Two subcategories of criterion-related validity are concur- rent validity and predictive validity. Concurrent validity refers to how the score on a test or inventory is related to your current state of affairs. For example, if a person were to take a mental status exam right now, and this person scores in a certain range, this might tell us right now (concurrently) that this person is suffering from a particular type of mental dis- order. Or, when you go to take your driver’s license test, and you receive a passing score, this indicates (hopefully) that you possess current knowledge of safe driving practices. Predictive validity takes current knowledge and attempts to make a prediction about the future, such as a college admissions office using high school GPA as one of the predictors of future success in college, or the scores on a pre-employment test attempting to predict whether you will be a good hire and if you will become an effective manager. Essentially, criterion-related validity addresses the predictability of current events or future events.

Construct validity has been called “umbrella validity” (Cohen & Swerdlik, 2005, p. 157) because all types of validity feed into the overall conclusion about construct validity. Gen- erally speaking, construct validity exists when a tests measures what it purports to mea- sure. A construct is a hypothetical idea that is intended to be measured but does not exist

as a tangible thing. For example, intelligence is a construct. That is, intelligence is this hypothetical idea that we believe humans and animals possess in certain degrees. If we were to do a post-mortem examination of a person’s brain, we would not be able to extract the part of the brain known as intelli- gence—intelligence is not a tangible, physical entity. Intelligence is a hypothetical idea that psychologists (and others) construct, and we spend considerable time and energy measur- ing this hypothetical idea. Much of what we study in psychology are constructs such as humility, sympathy, depression, happiness, anxiety, altruism, success, dependence, and self-esteem. But to accumulate evidence in support of approaching construct validity, Cohen and Swerdlik (2005) suggest the fol- lowing steps: (a) Establish that the test mea- sure one singular construct; (b) test scores change as a function of how a theory would predict they change, such as increasing

With construct validity, intelligence is a hypothetical idea that psychologists spend considerable time and energy measuring.


lan66845_07_c07_p191-228.indd 202 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

or decreasing with age, time, or an experimental manipulation (independent variable); (c) using a pretest-posttest design, test scores change in a predictable and theoretically relevant manner; (d) test scores among different groups of people differ in a theoretically relevant way; and (e) test scores correlate with scores from other tests; these scores are pre- dicted theoretically to be related, and they are related in the manner predicted by theory.

Although not part of the “C” trio of validities, face validity is often mentioned as a type of validity, referring to whether the person taking the test believes that it measures what it purports to measure. Face validity is ascertained from the perspective of the test taker, and not from the responses to test items. If the test-taker believes that the items are unrelated to the stated purpose of the test, this might affect the quality of responses, or our confi- dence that the test was taken seriously. Face validity may be more relevant to the “public relations” of the test, rather than the test results themselves (Cohen & Swerdlik, 2005). The 3 C’s of validity (plus face validity) compose the classic test construction approach to validity. These are important concepts to consider when developing a measure of behav- ior. But there are other considerations as well, such as the level of confidence we have in the conclusions we draw or the generalizability of the results from the present study to other times, places, or settings. Cook and Campbell (1979) offered a different conceptual- ization of validity, and these ideas are particularly relevant.

Four different categories of validity include internal, external, statistical conclusion, and construct validity (Cook & Campbell, 1979). In fact, these authors conceptualize validity a bit differently from psychometricians when they define validity as “the best available approximation to the truth or falsity of propositions, including propositions about cause” (p. 37). Internal validity refers to the general nature of the relationship between the inde- pendent variables and the dependent variables. The chief goal in establishing internal validity is the determination of causality—did the manipulation of the independent vari- ables cause changes in the dependent variables? External validity refers to the question of whether, if a causal relationship does exist between these variables, the relationship can be generalized to other research settings, other samples, or other times. Statistical conclu- sion validity refers to the sensitivity or statistical power of the experimental situation. In our attempt to determine cause-and-effect relationships, are we using both method- ological and statistical approaches sensitive enough to capture causal relationships, if they exist? Finally, construct validity concerns how the operations used in measurement are related to the higher-order constructs upon which they presume to measure. For example, researchers may develop a test to measure intelligence, and this new test may have inter- nal, external, and statistical conclusion validity, but does the test truly measure intelli- gence? This is the question construct validity attempts to answer; in this case, construct validity overlaps with the psychometric approach to validity in asking the question “Are we measuring what we think we are measuring?”

One of the benefits of the Cook and Campbell approach to validity is that it provides not only a framework to evaluate research but also insight in the adequate design of research before it is conducted. Cook and Campbell (1979) painstakingly listed many of the threats to each of the four types of validity. We won’t re-create that entire listing here, but by thinking about the threats to internal validity, for example, you can begin to see many of the factors that are relevant to designing a study as well as conducting the study. Our goal, of course, is to design a research study in such a way as to avoid or minimize threats to validity. Table 7.2 lists the classic threats to validity, with a brief definition and example.

lan66845_07_c07_p191-228.indd 203 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

Table 7.2: Classic threats to internal validity

Threat to Internal Validity Brief Definition Research Example


Something happens during the experimental session that might change responses on the dependent variable.

If you are collecting data in a large classroom and the fire alarm goes off, this may impact participants’ responses.


Change in behavior can occur on its own due to the passage of time (aging), experience, etc., with the change occurring separate from the independent variable manipulation.

In a within-subjects design, you have participants view visual information on a computer screen in 200 trials. By the end of the study, participants may be fatigued and changing dependent variable responses due to time and experience.


When testing participants more than once, earlier testing may influence the outcomes of later testing.

If you design a course to help students do better on the GRE, students take that test at the beginning of the course and again at the end of the course. Mere exposure to the GRE the first time may influence scores the second time, regardless of the intervention.


A change occurs in the method by which you are collecting data; that is, your instrumentation changes.

If you are collecting data through a survey program on a website, and the website crashes during your experiment, then you have experienced an instrumentation failure.

Statistical Regression

When experimental and control group assignments are based on extreme scores in a distribution, these individuals at the extremes tend to have scores that move toward the middle of the distribution (extreme scores, when they change, become less extreme).

In a grade school setting, children who are scoring the absolute lowest on a reading ability test are given extra instruction each week on reading and are retested after the program is complete. Because these children’s scores were at the lowest part of the distribution, these scores, when they change, have nowhere else to go but up. Is that change due to the effectiveness of the reading program, or statistical regression?


When people are selected to serve in different groups, such as an experimental group and a control group, are there pre- existing group differences even before the introduction of the independent variable?

In some studies, volunteers are recruited because of the type of study or potential implications (such as developing a new drug in a clinical trial). Volunteers, however, are often motivated differently than non-volunteers. This pre-existing difference at the point of group selection may influence the effectiveness of the independent variable manipulation.

Mortality Individuals drop out of a study at a differential rate in one group compared to another group.

In your study, you start with 50 individuals in the treatment group and 50 individuals in the control group. When the study is complete, you have 48 individuals in the control group but only 32 individuals in the experimental group. There was more mortality (“loss of participants”) in one group than the other, which means there is a potential threat to the conclusions we draw from the study. (continued)

lan66845_07_c07_p191-228.indd 204 4/20/12 2:50 PM


CHAPTER 7Section 7.3 The Measurement Process

Table 7.2: Classic threats to internal validity (Continued)

Threat to Internal Validity Brief Definition Research Example

Interaction with Selection

If some of the above threats happen in one group but not in the other group (selection), then these threats are said to interact with selection.

If the instrumentation fails in the control group but not in the experimental group, this is known as a selection × instrumentation threat. If something happens during the course of the study to one group but not the other, this is a selection × history threat.

Diffusion/ Imitation of Treatments

If information or opportunities for the experimental group spill over into the control group, the control group can obtain some benefit of an independent variable manipulation.

In an exercise study at the campus recreation center, students in the experimental group are given specific exercises to perform in a specific sequence to maximize health benefits. Control group members also working out at the Rec Center catch on to this sequence, and start using it on their own.

Compensatory Equalization of Treatments

When participants discover their control group status and believe that the experimental group is receiving something valuable, control group members may work harder to overcome their underdog status.

In a study of academic achievement, participants in the experimental group are given special materials that help them learn the subject matter better and perform better on tests. Control group members, hearing of the advantage they did not receive, vow to work twice as hard to keep up with the experimental group and show them that they can do work at equivalent levels.

Resentful Demoralization

When participants discover their control group status and realize that the experimental group is receiving something valuable, they decide to give up and stop trying as hard as they normally would.

In the same academic achievement example as above, rather that vowing to overcome their underdog status, the control group simply gives up on learning the material, possibly believing that the experiment is unfair and wondering why they should bother to try.

The ideas of reliability and validity are central to both the observation and measurement of behavior. These are everyday ideas that psychologists utilize to improve their work. To conclude our discussion of those ideas here, let me leave you with a practical example of how researchers use both ideas of reliability and validity in their measure of behavior. Let’s say you are working in a drug rehabilitation clinic where you are helping people overcome an addiction to cocaine. Knowing when someone’s craving for cocaine is ele- vated in this setting could be important; you might want to be more vigilant with the person, provide greater assistance, offer more intense counseling, and so on. There are multiple scales that measure cocaine craving, but they are rather long. There is a 45-item Cocaine Craving Questionnaire-Now scale and a 33-item Questionnaire of Cocaine Use scale (see Sussner, Smelson, Rodrigues, Kline, Losonczy, and Ziedonis (2006) for more details about these scales). But administering such long surveys in the midst of someone’s recovering from a cocaine addiction might be unwieldy. So Sussner et al. (2006) set out to create a shorter cocaine craving questionnaire that would be easier to administer but at the same time possess both validity and reliability.

lan66845_07_c07_p191-228.indd 205 4/20/12 2:50 PM


CHAPTER 7Section 7.4 Scales of Measurement and Statistic Selection

Sussner et al. (2006) called their new instrument the Cocaine Craving Questionnaire— Brief. It is a 10-item survey and was derived from the longer CCQ-Now 45-item survey. The key question becomes this: Does the new scale possess reliability and validity? To establish validity, these researchers correlated scores on the CCQ-Brief with pre-existing valid measures of cocaine cravings, and high positive correlations between previous measures and new measures were taken as one indicator of validity. That is, if you create a new measure, and your new scores correlate highly with a measure that has already been shown to be valid, then your new measure begins to accumulate evidence of validity. As for reliability, the CCQ-Brief was studied using an internal consistency (inter-item approach), yielding a Cronbach’s α = .90. Thus validity and reliability are regular components of the research process, as demonstrated by the Sussner et al. (2006) study. One other note to make about the relationship between validity and reliability— an instrument can be reliable without being valid, but an instrument can only be valid when it is measured reliably. We understand the importance of measuring behavior reliably and in a valid manner, but how do we collect data? When we rely on numerical (quantitative) scores, what do the actual numbers mean, and how might we analyze the data once we have it?

7.4 Scales of Measurement and Statistic Selection

This chapter is about observation and measurement. We already know that when we measure the dependent variable in a quantitative fashion, we want numerical scores that are both reliable and valid. But how do we obtain those scores—that is, how do we measure human behavior? The process of translating observations into scores involves scales of measurement. Based in part on a seminal article by Stevens (1946), there are four general scales of measurement: nominal, ordinal, interval, and ratio. This order of presentation is important, because it is generally thought that the nominal scale has the least utility in terms of value and statistical analysis options, and the ratio scale has the most utility and greater statistical options. Said another way, we’d prefer to have ratio scale data than nominal scale data in most situations. But before we address data analysis options, let’s review a bit about each type of scale of measurement.

Nominal Scales

On the nominal scale, individuals are placed (or coded) into classifications or categories that are used to keep track of similarities and differences. For example, each basketball player on the court wears a different number on his or her jersey to help others keep track of the players. A higher jersey number does not mean that the player is better, nor does a lower jersey number mean that the player is worse. The numbers themselves do not express relative value, but the numbers are used to track differences. Numbers can also be used to track similarities. For example, if we were interested in conducting a poll on campus about who you plan to vote for in the next presidential election, we might also want to ask prospective voters about their political affiliation (Republican, Democrat, or Independent). As we recorded this information, we might code the data in such a way as 1 = Republicans, 2 = Democrats, and 3 = Independents. Note that here the use of numbers is to classify people into similar categories, and different numbers are used to denote

lan66845_07_c07_p191-228.indd 206 4/20/12 2:50 PM


CHAPTER 7Section 7.4 Scales of Measurement and Statistic Selection

different political affiliations. The numbers them- selves do not have implicit meanings; that is, Independents are not one and one-half times bet- ter than Democrats, nor do Republicans have half the value of Democrats. The numbers themselves are arbitrary placeholders allowing us to keep track of differences; we could have just as easily coded 14 = Republicans, 3 = Democrats, and 77 = Independents; the numbers selected are arbitrary, used to classify those in a similar category.

Why would we want to classify nominal scale cat- egories with numeric labels? This process facili- tates data analysis in statistical programs such as SPSS (Statistical Program for the Social Sci- ences). But only certain types of analyses are rel- evant with nominal scale data. Take, for example, the last four digits of your cell phone number. These data are nominal scale data. The last four digits are used to keep track of different phone accounts, but a higher phone number like x8783 does not mean you have a better number than x2334. Those four digits just help keep track of different telephone accounts and lines. However, we could ask a classroom of students to provide us with the last four digits of their cell phone number and then calculate the average phone number (try this sometime with your classmates). You can do this on a calculator, and SPSS will do it for you as well. However, calculating the mean of nominal scale data doesn’t make a lot of sense—with our two numbers above, the average phone number is 5558.50, which doesn’t mean much. You will need to know the appropriate data analysis techniques for different scales of measurement; later in this chapter there is some guidance on these issues. If you truly wanted to have an idea about the central tendency of nominal scale data, the mode would be a better choice. The mode is the most frequently occurring score in the distribu- tion. With our political affiliation example, you might discover that code 2 (Democrats) is the most frequently observed political affiliation on campus. It doesn’t make any sense to average together the 1s, 2s, and 3s, but it is meaningful to know that 2 is the modal score.

Ordinal Scales

On the ordinal scale, the magnitude of the numbers mean something—in other words, a higher number means more, and a lower number means less. There is an underlying con- tinuum expressed with the numbers on the ordinal scale. One example would be when items are rank ordered. If the data are rank ordered in some way, then you are dealing with ordinal scale numbers. Another assumption of the ordinal scale is that the distance or difference between adjacent numbers is not assumed to be equal; in fact, we assume unequal intervals.

Similar to how jersey numbers help fans keep track of their favorite players, nominal scales help researchers categorize study participants.


lan66845_07_c07_p191-228.indd 207 4/20/12 2:50 PM


CHAPTER 7Section 7.4 Scales of Measurement and Statistic Selection

On the ordinal scale, numbers are unlike the arbitrary values on the nominal scale. So, if you were to rank order your top 10 movies of all time, the No. 1 movie would be your most favorite, and your No. 10 movie would be your tenth favorite. In this rank order scenario, the lower the number, the better the movie—the number has meaning. Anything with rank order is ordinal scale: your class rank when you graduated from high school, the national rankings of college football BCS polls, the gold-silver-bronze medals of the Olympic Games, and so on. We can analyze ordinal scale statistics, and there are many techniques available. The statistical approaches utilized to analyze both nominal and ordi- nal scale values fall under the heading of nonparametric statistics. This term refers to the idea that the data from nominal and ordinal scale measurements may not necessarily be normally distributed, hence specialized statistical procedures are used (more later on how the underlying assumptions of the data influence the statistical approach). One last thought about the ordinal scale: The intervals are assumed to be unequal. Do you remem- ber when Michael Phelps from the United States won his Olympic Gold medal in 2008 by .01 of a second? In the men’s 100m fly, Michael Phelps won the gold with a time of 50.58 seconds, Milorad Cavic of Serbia won the silver with a time of 50.59 seconds, and Andrew Lauterstein won the bronze with a time of 51.12 seconds. The distance between the first place and second place medals was .01 second, while the distance between the second and third place medals was .13 seconds. This is what is meant by uneven intervals—the distance between first and second place is not necessarily the same as between second place and third place.

Interval Scales

The interval scale builds on the properties of the ordinal scale (Stevens, 1946). So, on the interval scale, the numbers are meaningful. Higher numbers mean something; that is, there is a continuum underlying the number system. In our typical thinking about the interval scale, the number zero is just another number on the scale. Another new addition to our

thinking about the interval scale is that the intervals are now uniform and meaningful (thus, the interval scale). One good example of the interval scale (although not overly psycholog- ical in nature) is the Fahrenheit scale. A higher number means more heat, and a lower number means less heat. The intervals are uniform and meaningful— the distance between 208 and 408 is the same distance between 508 and 708. Finally, 08 does not mean lack of heat on the Fahren- heit scale; it’s just another num- ber on the scale. There are other examples of interval scales, but they don’t tend to be psycho- logical in nature—latitude and

The Fahrenheit scale on a thermometer is a good example of an interval scale.


lan66845_07_c07_p191-228.indd 208 4/20/12 2:50 PM


CHAPTER 7Section 7.4 Scales of Measurement and Statistic Selection

longitude, altitude, a person’s net financial worth, clothing sizes, etc. But what about psy- chological variables?

The challenge for the interval scale with psychological variables deals with zero as just another number on the scale. For the next and final scale of measurement (ratio scale), zero is the absence of value, which makes ratios meaningful. But on the interval scale, zero is typically thought of as just another number on the scale. Let’s say you know some- one who scored a zero on an intelligence test. Is zero a legitimate score on an intelligence test? Are negative values possible? More importantly, would a zero on an intelligence test imply a lack of intelligence? It’s hard to know what a zero means on a test of intelligence. In actual practice, there may not be many true psychological interval scales. Most psy- chologists make the assumption, then, that these types of numeric scales with equal inter- vals are treated the same as ratio scales. In fact, in some places you’ll hear these types of data referred to as interval/ratio scale data. In SPSS, for example, the only options when you select a scale for your variable data are nominal, ordinal, and scale. Of course, there are some dangers to this type of assumption (Labovitz, 1967). Interval and ratio scale data also fall under the category of parametric data, which means additional assumptions are made about the underlying distribution of the data gathered—that is, that the distribution is relatively normal (if the data are not normally distributed, it will have implications for the conclusions drawn from the analysis, as well as might dictate that a non-parametric approach be used). But in actual practice, even though there are concerns and limitations, many psychological variables are considered interval/ratio (or scale) variables.

Ratio Scales

On the ratio scale, many of the characteristics previously mentioned still apply. In fact, the ratio scale is our usual use of numbers. There is a quantitative dimension and an underly- ing continuum for the numbers used, and on the ratio scale, zero is used to identify the lack of something (not just another number on the scale like in true interval scales). When zero is the endpoint on the scale, ratios are now meaningful. For example, 10 inches is twice as long as 5 inches (a ratio), because 0 inches means no length. Four hours is half as much as 8 hours because 0 hours means no time. When 0 means the lack of value, then ratios become meaningful. Ratios are not meaningful, however, on the interval scale. Is 208 twice as warm as 108? On a psychological test of intelligence, is someone who has an IQ of 120 twice as intelligent as someone who has an IQ of 60? When 0 means something (that is, when 08 F is just another number on the scale) then ratios are difficult to interpret meaningfully.

So think of ratio scale data as our usual use of numbers, such as counting the frequency of a behavior or asking a person to respond on a scale from 1 to 10. We certainly make assumptions about what these numbers mean, and because the interpretation of true interval scale data is difficult for psychological variables, often in practice we lump together interval and ratio, sometimes referring to this type of data as interval/ratio. In fact, some would say that true ratio scale data are rare in psychology (Becker, 1999). Means and standard deviations make sense with interval/ratio scale data, but not with nominal or ordinal scale data. Oftentimes, psychologists take advantage of these fuzzy boundaries between scale types. As discussed in chapter 6, a very common scale used in survey research is a Likert-type agreement scale, where the items are declarative

lan66845_07_c07_p191-228.indd 209 4/20/12 2:50 PM


CHAPTER 7Section 7.4 Scales of Measurement and Statistic Selection

statements and you are asked to respond on a scale such as 1 = strongly disagree, 2 = dis- agree, 3 = neutral, 4 = agree, and 5 = strongly agree. We treat these responses as interval/ ratio scale data when we calculate the mean response for any particular item. However, when carefully examined, these data are not ratio scale and probably not interval scale, but rather ordinal scale. Let’s say that two instructors are being evaluated at the end of the semester, and one of the items on the course evaluation is “This instructor seemed well prepared for class.” Dr. A might receive a mean score of 4.22, whereas Dr. B receives a mean score of 3.75. We treat these data like interval/ratio data, but the reported score of a value between 1 and 5 is more similar to a rank order score. For example, is the dis- tance on this scale between 2 and 3 the same as the distance between 4 and 5 (remember, on the ordinal scale, intervals do not need to be uniform; but on the interval and ratio scales, they must be).

So why would studies treat ordinal scale “agreement data” like interval/ratio data? One answer is that the types of statistics that are applied to interval/ratio scale data are taught more, and are more familiar to most psychologists, such as the t test or ANOVA, as opposed to non-parametric statistics used with ordinal scale data. Also, it’s easier to understand and interpret the means of interval/ratio scale data, rather than the median rank orders of ordinal scale data. For now, you just need to know that we make multiple assumptions about data analysis, and sometimes we violate those assumptions. But how would you determine what statistic to use in which data analysis situation? To answer that question, we need to know (1) what types of scales your variables are measured on, and (2) what type of conclusion you want to draw.

There are a number of ways to approach this complex issue (Morgan, Gliner, & Harmon, 2002; Vowler, 2007); a broad approach would be to ask if you are interested in examining the differences between groups or the associations or relationships among variables. If you are interested in understanding the differences between groups, you may take an approach where you would use a t test or F test from an ANOVA. If you are more inter- ested in associations, you might be using a chi-square, correlation, or multiple regres- sion approach. However, we need more answers to more questions prior to proceed- ing. It would be good to be able to clearly identify our independent variables and our dependent variables—not as easy as it might seem, especially for beginning students. We would also need to know about the type of research design utilized—for example, between groups design, within groups design, and mixed group design,. When examin- ing our variables (both independent and dependent), we’d want to know their scales of measurement (nominal, ordinal, interval/ratio)—sometimes the distinction you might hear is continuous variable versus discrete variable. There are also assumptions that underlie not only the data collection process but also the requirements of certain data analysis approaches (a common assumption is that the data being analyzed are nor- mally distributed).

So there are many considerations when selecting the appropriate statistic, and what we’ve discussed here is just a brief overview. If you continue in psychology, and especially if you continue doing research, your confidence will likely grow in knowing which statistic to use in which situation.

lan66845_07_c07_p191-228.indd 210 4/20/12 2:50 PM


CHAPTER 7Section 7.5 Graphing Your Results

7.5 Graphing Your Results

After you’ve completed your observation and measurement of behavior, eventually it will be time to tell your story. Storytellers have many conventions that they can use to best communicate a story, such as foreshadowing, building action, conflict resolu- tion, etc. In thinking about scientific storytelling (for more, see Landrum, 2008), a graph can help tell a complicated story in an efficient manner. Tufte (1983) suggests that clear, precise, and efficient graphs should: (a) show the data; (b) encourage the viewer to think about the content of the graph rather than focus on the graphic design; (c) avoid distortions; (d) present much data in a small space, making large data sets more coherent; (e) reveal the complexity of the data on both a broad level and fine level; (f) serve a clear purpose—description, explora- tion, tabulation, or decoration; and (g) have close integration with the text that accompanies the graph.

Not that undergraduates have a lot of spare time, but if you are truly interested in a classic about graphical design, read Tufte’s (1983) The Visual Display of Quantitative Information—it is a true classic. But as you can see from the previous paragraph, designing a graph that con- forms to all of these characteris- tics is a fairly tall order, which is why graphs may not be used as much in psychology journals. However, the lack of graphs in psychology writing could have a detrimental effect. When Smith, Best, Stubbs, Archibald, and Roberson-Nay (2002) looked at perceptions of hard and soft sciences, these researchers found that a higher usage of graphs contributes to the perception of a hard science, whereas the greater usage of tables and inferential statistics contributes to the perception of a soft science. One suggestion would be to not select a graph depending on its perception of hard or soft, but to think about this: Does the graph help explain a complicated story with clarity, precision, and efficiency? (Tufte, 1983).

In creating your graph, you must also be fair with the data. There are many good guides that can help you with this, including your Publication Manual (APA, 2010), as well as Nicol and Pexman’s (2010) Displaying Your Findings: A Practical Guide for Creating Figures, Posters, and Presentations. Kosslyn (1994) also offered sage advice, with excellent examples; for instance, he presented the graph shown in Figure 7.2 to depict the number of warheads in possession of the United States and (then) U.S.S.R. in 1991. The actual data at the time were that the United States had 11,877 warheads and the U.S.S.R. had 11,602 warheads.

You should present your data in an efficient manner, using clear, concise charts, graphs, and tables.


lan66845_07_c07_p191-228.indd 211 4/20/12 2:50 PM


CHAPTER 7Section 7.5 Graphing Your Results

United States U.S.S.R.







0N um

be r

of s

tr at

eg ic

w ar

he ad

s (t

ho us

an ds


Here is an example of a graph where the data are accurate, but because of the scale selected for the Y (vertical) axis, it makes it appear that there is not much difference in the number of warheads in the two countries.

Source: Statistical Program for the Social Sciences

Figure 7.2: Number of warheads, version 1

Depicting the data in this purposeful way tells a particular story, and in this case, the author would be emphasizing the near equivalence of the number of strategic warheads in possession by both entities. But look at the same data as graphed in Figure 7.3. This presentation certainly tells a different story.

United States U.S.S.R.







N um

be r

of s

tr at

eg ic

w ar

he ad

s (t

ho us

an ds


Here the data are accurate, but because of the scale selected for the Y (vertical) axis, it appears that there is a large difference in the number of warheads in the two countries.

Source: Statistical Program for the Social Sciences

Figure 7.3: Number of warheads, version 2

lan66845_07_c07_p191-228.indd 212 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

This graph clearly emphasizes differences. But truncating the x-axis (the zigzag line on the axis), this bar graph is meant to send a message. Of course, you’ll want to use a graph to send a message, but you want to be fair with the data. So if you truncate an axis, be sure to label it properly. Try not to distort the graph to tell the story—ideally, the data are telling a compelling story, and the graph is the mechanism you choose for effective communica- tion. For tips on graph creation, see the following list.

1. Use bright white paper and black ink in drawing graphs. 2. The ordinate (y axis, vertical line) always depicts the dependent variable, and

this line should be about 2/3 the length of the abscissa (x axis, horizontal line) which always depicts the independent variable. For every inch up the ordinate, the abscissa should be 1.5 to 1.6 inches long.

3. Label both axes and provide a figure legend in the graph if necessary; the figure caption is placed on a separate page.

4. In its final form, the lettering on the graph should be no smaller than 1/16 of an inch.

5. The demands of your audience (e.g., your instructor or a professional journal) may dictate other procedures for creating acceptable graphs. When in doubt, con- sult the Publication Manual (2010) or Nicol and Pexman (2010).

Even when accurate, the depiction of data can be manipulated in a number of ways. It is important to attend to the details of graphs so that you can draw your own conclusions about their meaning.

7.6 Procedural Matters: Experimental Pitfalls and Precautions

As you go about designing a study and collecting data for analysis and interpreta-tion, you will want to think ahead about some of the general issues involved in drawing conclusions from data. There are some that apply to all of the specific designs we discussed earlier in this book.


In designing and conducting your study, you will want to avoid confounds or confound- ing variables as much as possible. As a reminder, a confound is a complication in the research design, based on the idea that something outside of the plan of the experiment has influenced your measurement of the dependent variable in addition to (or in place of) your independent variable. In other words, a confound means something else may account for your results, other than the purposeful design of your study. For example, a confound might influence one level of your independent variable manipulation, but not the other levels.

Let’s say that you were interested in testing the effectiveness of a new teaching technique in the classroom. For comparison purposes, you want to teach the same course to the same level of students, grade the same way—in fact, try to do everything the same as much as you possibly can except to manipulate the independent variable, the teaching technique (perhaps traditional lecture style versus a service learning approach). You then decide that

lan66845_07_c07_p191-228.indd 213 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

the best way to do this is to find a course big enough that offers two similar sections taught by the same instructor, but the sections are taught at different times of the day. After your study you find that the service learning group scored significantly better on tests, so you conclude that it is the better technique, right? Well, there may be a confound. It may be that time of day confounds with your dependent variable. What if one section is taught at 8 a.m. and the other section at 2 p.m.? Is it reasonable to assume that performance might already be different between these two groups before the introduction of the independent variable? If you answer yes, then time of day is a potentially confounding variable.

How do we handle confounds? If you can show that previous studies ruled out these con- founding variables influencing your dependent variable, then you can be more confident in your results. You could try to find two sections closer in time to minimize any confound- ing. You could expand the study and incorporate time of day as an independent variable, with the aid of other instructors teaching the same types of multiple-section courses. Con- founds are not necessarily fatal flaws, but they do detract from drawing strong conclu- sions from your study. While a confound may only influence one level of the independent variable, an artifact influences all levels of the independent variable. Confounds threaten internal validity, whereas artifacts threaten external validity.


When a data collection artifact occurs, the measurement process is distorted, biased, or corrupted in some fashion. In fact, it is not often known in what direction the distortion may be (it may inadvertently support the experimenter’s hypothesis or detract from it)— in essence, we do not know if the artifact is leading us to a Type I or Type II error. The four general categories of artifacts to be presented here include physical setting, within subjects, demand characteristics, and experimenter expectancy.

Physical Setting In some cases the physical set- ting may influence participant performance and lead to data artifacts. Too warm, too cool, too humid a setting may detract from participant’s true perfor- mance. Noise, general atmo- sphere, and crowdedness may also influence performance. By being sensitive to these condi- tions experimenters can usually provide an adequate atmosphere for participants. If some of these conditions are out of the experi- menter’s control, then consis- tency is the goal: If you believe that the room temperature may affect participant performance, then test all participants at the

The physical setting of your study must be carefully chosen because it may cause data artifacts. What might be problematic about this setting?


lan66845_07_c07_p191-228.indd 214 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

same temperature, or turn the potential data collection artifact into an independent vari- able and systematically test the hypothesis of whether or not the physical setting variable effects participants’ performance. If you think that room temperature is affecting perfor- mance on a task, then you could test the hypothesis empirically. Try to arrange for three different rooms, each at a different temperature, and then test to see if temperature does indeed influence task performance.

Within Subjects There are a number of subject-related artifacts to be aware of when collecting data. (In prior editions of the APA Publication Manual, the human participants in a study were called subjects. Even though they are called participants today, the term “subject” is still used in some cases, such as a within-subjects design.) In particular, response sets can influence participant performance. A response set is a pattern of responding seen in a participant that may not accurately reflect the participant’s true feelings on a topic. For example, response set acquiescence involves the participants getting stuck in saying yes repeatedly in a survey or questionnaire. If participants see their own pattern of respond- ing as all yeses, then they may stop reading the questions carefully and answer yes to everything (of course, the way to avoid this is to have questions worded in both direc- tions; that is, to have both yes and no answers indicate whatever measure of interest you are studying in your experiment). See Table 7.4 for an example of how to avoid response set acquiescence.

Table 7.3: Sample survey items’ susceptibility to response set acquiescence

Susceptible to Response Set Acquiescence Less Susceptible to Response Set Acquiescence

1. The instructor held my attention during class lectures.

2. The instructor wrote exams that fairly tested the material covered.

3. The instructor seems to be well prepared for class.

4. The instructor was available for extra help outside of class.

5. The instructor regularly answered students’ questions.

1. The instructor was seldom able to hold my attention during class lectures.

2. The instructor wrote exams that fairly tested the material covered.

3. The instructor often appeared to be unprepared for class.

4. The instructor was available for extra help outside of class.

5. The instructor rarely answered students’ questions.

Note. These items could be answered on a scale from 1 = strongly disagree to 5 = strongly agree.

Response set social desirability comes from participants’ responding in a pattern that they believe makes them look good, or look better than they are. That is, participants are presenting themselves as socially desirable when, in fact, they may not be. If you were to ask participants if they are racist, you would probably obtain an underestimation of the actual number of people who could be considered racist. With socially charged issues it is often difficult to overcome response set social desirability, but with carefully worded questions and multiple approaches to the concept (such as role-playing or simulations), such issues can be studied effectively. Also, there are scales that are used to attempt to

lan66845_07_c07_p191-228.indd 215 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

measure one’s level of response set social desirability, such as the Marlowe-Crowne Social Desirability Scale (Crowne & Marlow, 1960; Marlow & Crowne, 1961) and the lie subscale of the MMPI-2 (Pearson Education, Inc., 2007). Interestingly, the MMPI-2 also has a fake bad subscale as well as a lie subscale.

One other common type of within-subjects artifact is known as participants’ self-per- ception, which occurs when the participants change themselves (on their own) during the course of the study. In many cases, the experimenter wants an assessment of current behavior (although sometimes the goal of a study is, indeed, to change a participant’s behavior). However, this within-subjects artifact occurs when participants decide for themselves to change their own behavior, and this behavior change is not a planned part of the study. A classic example of this occurring comes from the industrial/organizational psychology literature, where assembly line workers placed in a special situation banded together to work hard to impress the researchers (Roethlisberger & Dickson, 1939). The details of this classic study are presented in this chapter.

Demand Characteristics Demand characteristics, first introduced in Chapter 3, are another data collection artifact, stemming from the participants’ understanding of what the experiment is all about, and potentially responding the way the experimenter wants them to (in a manner of speaking, giving in to the demands or expectations of the experimenter). One method of dealing with this is to disguise the nature of the study so that the participant has difficulty discerning the hypothesis and giving the experimenter what he or she is looking for. Along those lines, the partic- ipants could be uninformed about the complete nature of the study and not told about it until the study’s conclusion. This approach is called a single-blind study because the participants are “blind” to (that is, they do not know) the condi- tion of the experiment they are participating in (this has nothing to do with visual abilities, and this term may be considered offensive by some). You should note, however, that this involves the use of deception, and such steps should be considered at length (we mentioned the pitfalls of deception in Chapter 2). Often it is sufficient that the participants know in general about the study, but they do not know what specific con- dition or group they are in, hence not knowing how to respond to a demand characteristic. If participants cannot ascertain whether they are in the experimental or control group, then a single- blind study is under way and the demand char- acteristics can be minimized. One method to determine if the independent variable manipula- tion worked is to simply ask participants about it in a post-experimental interview.

In a single-blind study, participants do not know all the conditions of the experiment they are participating in.


lan66845_07_c07_p191-228.indd 216 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

Experimenter Expectancy One additional type of data collection artifact is experimenter expectancy. This bias occurs because the experimenter (in this case, the person conducting the experimental session) accidentally influences the participants to perform in a certain, unnatural manner. This might happen if two different experimenters were used for the experimental or control groups, different instructions were used, or if one experimenter was very friendly to the experimental group but cold to the control group. To avoid these effects, the experiment could be performed (run) in one session if feasible, experimenters can be trained to avoid experimenter expectancy cues, or a double-blind study can be performed. In a double- blind study, neither the participants nor the experimenter in the room know which par- ticipants are in which group (experimental or control). In this case, the experimenter cannot unknowingly provide performance cues (i.e., expectations) to the participants, because the experimenter does not know which group is which. Someone else helping to administer the experiment knows of the group assignments and reveals them only when the data collection segment of the experiment is over. For more of the classic work on experimenter expectancy, see Rosenthal’s work (1966; 1967).

Pilot Testing Your Study Think of a pilot test or pre-test as a dress rehearsal prior to conducting your study. It is wise to pilot test, because in measuring human behavior, elements of an experiment can go wrong if details are not attended to. For example, in survey research, a pilot test can help you to determine if participants understand your survey questions and if you are completely covering the topic as you intended (Collins, 2003). There are typically four goals to achieve when pilot testing your survey prior to launch. The survey researchers want to evaluate the draft survey items, optimize the length of the scale for adequate response rate, detect any weaknesses in the survey, and attempt to duplicate the condi- tions under which the survey will be administered.

In a study asking college students about health risk-taking behavior Daley, McDermott, McCormack Brown, & Kittleson (2003) effectively used multiple rounds of pilot testing before wide distribution of multiple web-based surveys. Pilot tests indicated, for example, that the time to completion was 22 minutes, a 75% response rate, and students believed that the web interfaces were poorly designed. With this information gleaned from the pilot testing, Daley et al. (2003) were able to make changes to the design of the survey prior to launching a data collection effort aimed at over 1,500 college students.

When designing survey research, you may want to ensure respondents (1) know the answers, (2) can recall the answers, (3) understand the questions, and (4) are comfort- able reporting the answers in the survey context. For instance, make sure that the survey items are at a reading level that is appropriate for first-year college students. Additionally, when you are collecting data with a Likert-type agreement scale (strongly disagree, disagree, neutral, agree, strongly agree), the survey items should be declarative sentences, and not phrased in the form of a question. By assuring participants that their data are anonymous, and not linking their identity to responses, you encourage honesty about sensitive topics or illegal behaviors. Pilot testing allows you to find most problems that may occur in your study before conducting your study. Here are some quick reminders for you to consider before your pilot testing phase (from Litwin, 1995):

lan66845_07_c07_p191-228.indd 217 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

• Are there any typographical errors? • Are there any misspelled words? • Does the item numbering make sense? • Is the font size big enough to be easily read (on paper, on the screen)? • Is the vocabulary appropriate for the respondents? • Is the survey too long? • Is the style of the items too monotonous? • Are there easy questions mixed in with the difficult questions? • Are the skip patterns too difficult to follow? • Does the survey format flow? • Are the items appropriate for the respondents? • Are the items sensitive to possible cultural barriers? • Is the survey in the best language for the respondents?

Manipulation Checks and the Post-Experiment Interviews

In some research scenarios, the independent variable involves a manipulation where the participant is intended to undergo some temporary change in state. For instance, when peo- ple are slightly depressed, how do they react when listening to music that has lyrics that are remorseful? A researcher who is testing normal, healthy volunteers may attempt to induce a moderate degree of sadness in his or her participants prior to the exposure to the lyrics. A manipulation check is a methodological procedure that occurs toward the end of the study (sometimes during a post-experiment interview) where the researcher ascertains just how sad (or not) the participant became during the study. In other words, did the intended effect of the independent variable occur? Manipulation checks may be fairly common in certain types of research. For example, in research published in the Journal of Personality in the 1980s and 1990s, over 50% of the published articles included manipulation checks (Mallon, King- sley, Affleck, & Tennen, 1998). A manipulation check can be used just to see if the indepen- dent variable manipulation worked (e.g., Keller & Bless, 2005), or a score can be generated for the level of success of the independent variable manipulation, and this score can be used as a mediating variable for further analysis (that is, the strength of the independent variable can be used statistically to help explain the outcomes on the dependent variables).

A post-experiment interview may not necessarily involve a manipulation check. The post-experiment interview is precisely what it says—after the experiment is complete, the researcher interviews the participant to get an idea about the participant’s perceptions of the research experience. Did he or she understand the task? Did the debriefing provide enough information? A manipulation check may also occur during this sequence. Say, for example, that a researcher wanted to understand the impact of a happy or sad mood on completing an instructor’s course evaluation. Course evaluations are one important com- ponent for evaluating an instructor’s teaching effectiveness at the end of a course—more important than students may realize. So, if the student is happy or sad at the moment of the evaluation, does that impact teaching evaluation scores? A researcher in the laboratory may attempt to induce a happy or sad state in a group of participants, and then ask them to complete a teaching evaluation. The manipulation check will attempt to confirm if the participant became happy or sad. Kidd (1977) reminded us that “valid manipulation check measures may be obtainable only in certain types of circumstances, namely those in which the subjects have the opportunity to reflect and report on their psychological state and are

lan66845_07_c07_p191-228.indd 218 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

willing to do” (p. 96). Thus, manipulation checks during the post-experiment interview may not always be necessary—it depends on the type of independent variable being used.

Data Collection and Storage

At first, the notion of data collection appears straightforward: In this part of the research, you collect your data. However, like most processes, it’s much more complicated than that. Thomas and Selthon (2003) described the steps that are involved in data collec- tion: (a) plan for the data collection process; (b) test data collection procedures in a pilot test (presented earlier); (c) collect data; (d) code the data for further data analysis (could involve creating a codebook; see more below); and (e) edit the data—check for accuracy (addressing issues such as missing data and outliers).

There are many methods of data collection available to researchers, and a comprehen- sive review is not possible here. But for each of these variations, you are likely to find expert advice, whether it be for collecting data on the Internet (Birnbaum, 2004; Cantrell & Lupinacci, 2007; Courtney & Craven, 2005), or collecting and storing qualitative data (Levine, 1985), for example. Regardless of your approach, you planned for data collection, you conducted your pilot tests, and you collected your data—now it’s time to prepare for data analysis. It’s time to explore the data that you have. You’ll hear different terms for this, such as data screening (Pryjmachuk & Richards, 2007) or data verification (Thomas & Selthon, 2003). Pryjmachuk and Richards (2007) advised “caution demands that, prior to full data analysis, researchers employ procedures such as data cleaning, data screening, and exploratory data analysis” (p. 43). By making sure, as much as possible, that your data are accurate, you help to ensure the integrity of your results. Said another way, if you based your statistical analysis and research conclusions on faulty data, then the conclu- sions themselves are faulty. It is important to check to see that the data the participants provided was entered and coded correctly. When discussing data cleaning, two frequent topics emerge that warrant our attention here—outliers and missing data.

There are different ways that we can think about outliers. Some can be data entry mis- takes, or others can be implausible entries that could be correct but have a high likeli- hood of being incorrect. Data entry mistakes are sometimes easy to find. For instance, consider the results for a survey using a response scale from 1 = not at all confident to 5 = extremely confident. If you were looking at this type of data, and you saw an entry of 0 or an entry of 6, then you would know that this was an error. Ideally, you could go back to the original survey and replace this value with the correct answer. Sometimes you might see a 22 or 34, which is a data-keying error. Perhaps the actual answer was 2 (for the first example)—but for the second example, was the actual response a 3 or a 4? Ideally, you’d go back to the original survey and correct the mistake. If the original data were not available to you, you’d probably delete that particular observation—more on missing data in a moment. Some outliers are easy to identify, (0, 6, 22, 34), and there are multiple fixes available.

Outliers typically arise from two sources, coding errors or two different populations within the sample (DiLalla & Dollinger, 2006). Coding errors need to be corrected, either by checking against the original data or deleting that data point (more options exist— more on that momentarily). Having two different populations in your sample is more

lan66845_07_c07_p191-228.indd 219 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

difficult to ascertain. DiLalla and Dollinger (2006) provide an example where you might be testing, in a sample, some individuals who are psycholog- ically healthy and others who suffer from a men- tal disorder. Ideally, you would have the ability to separate these two populations and then conduct separate statistical tests on the two samples. You can only do this, however, if you suspect ahead of time that you will be capturing two or more popu- lations and you have a reliable and valid measure to allow you to separate the populations after the data are collected. It is important to deal with out- liers because they can wreak havoc on your statis- tical conclusions (Pryjmachuk & Richards, 2007), particularly if your sample is relatively small.

Dealing with missing data can be equally com- plex. You may give instructions to participants that they should leave blank any questions that they do not want to answer. Or, you instruct par- ticipants to answer every question, but they don’t follow instructions. What do you do when the data are missing? As a student, if you are in this situation in a course (or working with a faculty member as a research assistant), your instructor or supervisor will have some guidelines for you.

One conservative approach would be to not use data from a participant unless the data are complete, that is, none are missing. However, we often instruct participants not to answer questions that don’t apply to them, so in many cases the absence of data indicates that the participant was following instructions. You could set a priori (meaning before the fact) a level of acceptable missing data. DiLalla and Dollinger (2006) report that in personality research using surveys, a common threshold is 5% missing data. Thus, if you are asking 100 survey items on a questionnaire, you would keep a participants’ data if they left blank up to 5 items, but if 6 or more items were left blank, your a priori rule states that you would not include that person’s data in your data set. Handling decisions about missing data in this way makes much more sense if you set the decision rule prior to the study.

A final consideration in the data collection process is data storage. Although this may seem straightforward, it is an important consideration. You’ll want to keep original sur- veys or files that contain data throughout your project. If you are collecting data with the eventual intention of publishing that research, you’ll need to keep and archive your data for at least 5 years. Not only will you need to keep the originals, but make sure you keep (and back up) your electronic files, such as the SPSS data file and your codebook. You may need to think about a storage plan as well. For instance, when you apply for IRB approval, part of the application asks you about where the data will be stored, and how it will be secured. Depending on the type of research you are conducting, your data may be linked to individuals, and you’ll want to be sure that you take steps to ensure anonym- ity and confidentiality. Thus, if you are keeping paper files, where will these be located? Will they be stored in a locked filing cabinet? For electronic files, will these be stored on a computer, multiple computers, or USB memory stick? If so, who will have access—will

Outliers can skew data if there are coding errors or if the sample used to collect data came from two different populations.


lan66845_07_c07_p191-228.indd 220 4/20/12 2:50 PM


CHAPTER 7Section 7.6 Procedural Matters: Experimental Pitfalls and Precautions

the individual files be password protected in case someone finds your data files? There are steps you can take to store your data anonymously, but if that is not possible or plausible, you’ll need to carefully consider a data storage plan.

Once you’ve completed your data collection process, and you are as confident as you can be that your data are clean (i.e., accurate), you’ll be ready for statistical analysis. But the type of data you glean from psychological research very much depends on how the research is designed. That is, the type of research you design will substantially influence both your data collection and data analysis options. There are research basics inherent in all good research designs. The incorporation of the information in this chapter along with previous chapters will help to improve your research efforts and ultimately lead you to better and more meaningful conclusions from the data collected.

Case Study: Precision Matters (Careful: One Size Rarely Fits All)

The entire premise of this chapter is that observation practices and measurement operations matter. They truly do. If psychology students are not careful to apply critical thinking skills learned throughout their undergraduate education, we can be as gullible as the average citizen. Thus, the measures that we use to capture behavior—dependent variables—need to be meaningful if the results from our studies and projects are to have any impact on the discipline. Testing your hypothesis within the con- text of an applied project means that your study goes out on a limb (so to speak) to make a prediction that could be supported or refuted. That is, there needs to be specificity regarding the hypothesis and what the eventual outcomes would look like if supported or refuted.

But what if your hypothesis were so vague or broad that it would be hard to refute? If it were too vague or too broad, then the outcomes would not have much meaning—sometimes a one-size-fits-all approach results in lackluster outcomes. For example, if your hypothesis were something like “tomor- row the sun will rise in the east and set in the west,” well, that’s not much of a hypothesis. But what if the setup to the hypothesis were not so obvious? Say, for example, you are in a study where the research is interested in identifying personality characteristics of the general population. You’ve just completed a 25-item personality inventory (online), and 10 minutes later you receive an email with your results. Based on the analysis of your data, your feedback looks like this:

“You have a great need for other people to like and admire you. You have a tendency to be critical of yourself. You have a great deal of unused capacity that you have not turned to your advantage. While you have some personality weaknesses, you are generally able to compensate for them. Disciplined and self-controlled outside, you tend to be worrisome and insecure inside. At times you have serious doubts as to whether you have made the right decision or done the right thing. You prefer a certain amount of change and variety and become dissatisfied when hemmed in by restrictions and limita- tions. You pride yourself as an independent thinker and do not accept others’ statements without sat- isfactory proof. You have found it unwise to be too frank in revealing yourself to others. At times you are extroverted, affable, and sociable, while at other times you are introverted, wary, reserved. Some of your aspirations tend to be pretty unrealistic. Security is one of your major goals in life.”

Most people would tend to agree with this assessment of their personality, precisely because it is so broad and vague! At times this is called the Barnum effect in psychology (after the circus showman P. T. Barnum who frequently promised “something for everyone”); the demonstration originated with Forer (1949). Often we are gullible when it comes to interpreting information about ourselves, and we may not apply the same critical thinking approach to ourselves as we do to the study of others. Whitbourne (2010) wrote about this topic from the perspective of fulfillment. She suggested that we may be gullible in these types of situations because (1) the message is so broad that there literally is something that applies to everyone in such a vague statement; (2) we welcome comforting predictions about the future because the unknown aspect of the future is scary to some; and (3) we are motivated to (continued)

lan66845_07_c07_p191-228.indd 221 4/20/12 2:50 PM


CHAPTER 7Section 7.7 Causality and Drawing Conclusions from Evidence

7.7 Causality and Drawing Conclusions from Evidence

The most powerful conclu-sion that we can make using science is a cause- and-effect conclusion. To be able to determine the causality of events is powerful, because in theory we could make posi- tive outcomes occur more often and work to prevent negative outcomes from happening as often. For example, it would be beneficial to know what causes marital satisfaction, what causes happiness, what causes col- lege student success, and what causes self-actualization so that we could promote those causes

want to believe statements about our own personality, so we read more into the vague statements than usual, searching for nuggets of truth.

The same broad, vague approach that applies to personality statements can also apply to situations where you want to believe what is on your fortune cookie or an astrology reading or daily horoscope (Ward & Grasha, 1986). Careful observation and measurement can assist in the psychological myth- busting of being gullible and believing in such broad, vague statements. There are risks involved in believing in such myths (Whitbourne, 2010), such as potentially wasting your money, being given poor advice, and ignoring good advice because it lacks the entertainment value of a horoscope or an astrology reading. Assessments that are based on solid science, applying the fundamental principles of observation and measurement, are likely to be much more accurate and predictive of your future than Barnum effect-type statements that are so vague that they nearly fit all.

Reflection Questions

1. Do you read your horoscope? How often? Do you read it for fun, or are there times when you fol- low the advice given on a particular day? Did you ever follow the advice and discover that it led to a good outcome or a bad outcome? What benefit might there be in applying a scientific approach to recording and measuring systematically the successes and failures of your horoscope readings?

2. What are the types of situations in life where you may be more susceptible to persuasion and influence regarding decisions involving scenarios where a description fits you to a “T”? Ever gone shopping for a new car (or at least a car that is new to you)? What about shopping on Craigslist or eBay? Are there other scenarios besides shopping where you might be more gullible to believe what someone is saying about you?

3. As you think about and reflect upon the applied project that you have designed, what are the key observation and measurement components of your study? What is your dependent variable (or what are your dependent variables)? Are they measured in such a way that precise measure- ments are possible and a viable test of your hypotheses can occur? Have you avoided a “one-size- fits-all” scenario? How so?

Case Study: Precision Matters (Careful: One Size Rarely Fits All) (continued)

Determining cause and effect is one of the most powerful and difficult conclusions achievable in science.


lan66845_07_c07_p191-228.indd 222 4/20/12 2:51 PM


CHAPTER 7Section 7.7 Causality and Drawing Conclusions from Evidence

and help individuals strive toward achieving their goals. Conversely, it would be nice to know what causes Alzheimer’s disease, what causes autism, what causes clinical depres- sion, what causes low-self esteem, and what causes suicide so that we could work to pre- vent antecedent (before-the-fact) causes that lead to these negative outcomes. However, it takes precise methodology to arrive at any level of confidence about causality, and there are many different forms of research questions to ask. Meltzoff (1998) does a very nice job of describing the types of research questions. Meltzoff’s description is summarized in Table 7.4, using generic statements but with realistic examples.

Table 7.4: Types of research questions, with examples

Types of Research Questions Generic Example Specific Example

Existence Questions Does x exist? Can people have a Facebook addiction? Does sincere altruism exist?

Questions of Description and Classification

What is x like? To what extent does x exist?

What are the best practices of master teachers? What is graduate school like? To what extent are teacher-created tests like the GRE?

Questions of Composition

What are the components that make up x? What are the factors that make up x?

What variables lead to high student satisfaction with college? What are the leading indicators that someone is clinically depressed?

Statistical Relationship Questions

Is there an association or relationship between x and y?

Is one’s age related to GPA? Is there an association between gender and political affiliation?

Descriptive-Comparative Questions

Is Group x different from Group y?

Are males or females more likely to stay in college? For adults returning to school, do parents or non-parents have a better GPA in college?

Causality Questions Does x cause, lead to, or prevent changes in y?

Does psychotherapy help individuals with dissociative identity disorder? Does attending tutoring sessions lead to better student test performance?

Causality-Comparative Questions

Does x cause more of a change in y than z does?

Is Prozac better than Xanax at helping people deal with depressive symptoms? Does caffeine help with better concentration skills as compared to a placebo?

Causality-Comparative Interaction Questions

Does x cause more change in y than does z under certain conditions but not under others?

Are male Republicans more likely than female Republicans to vote for a Democratic nominee? Are psychology majors more likely to be successful in their health care- based careers than non-majors, but only for psychology majors who attend graduate school?

For many scientists, the ultimate goal is the determination of causality; that is, understand- ing cause-and-effect relationships. There are three criteria for establishing causality, as

lan66845_07_c07_p191-228.indd 223 4/20/12 2:51 PM


CHAPTER 7Section 7.8 Proving Versus Disproving in Psychology

summarized by Burns (1997). First, there must be clear temporal precedence. This means that for the cause to be the cause, and the effect to be the effect, the cause must come first and the effect must come second. If both cause and effect occur simultaneously, then we cannot know the cause or the effect. There must be a clear time sequence here. Second, measures of cause and effect must covary. That is, if there is a cause-and-effect relation- ship, the presentation of the cause needs to yield the effect, but if there is no presentation of the cause, then there should be no effect. If you change the nature of the cause, then you should also be changing the nature of the effect. Lastly, there should be no plausible alter- native explanation. If we have adequately applied our research methods, experimental controls, methodological designs, and so forth, then we need to say, with confidence, that there is no other logical explanation for the effect other than the cause. Note that we do not say that we proved that the cause is the reason for the effect, but we infer the relation- ship when we (a) have temporal precedence, (b) have covariation, and (c) have ruled out plausible alternative explanations (Burns, 1997). Technically speaking, we don’t “prove” anything in psychology, but we disprove.

7.8 Proving Versus Disproving in Psychology

The notion of falsificationism is important to science and psychology. A key contribu-tor to this notion was Karl Popper, who suggested that the goal of science should not be to confirm or prove theories, but to falsify or disprove theories. The approach taken by Popper and others is not merely semantic double-talk, but has serious methodolog- ical implications for how we carry out science and how we advance our knowledge of the

human condition. Newell (2005, para. 3–5) summa- rizes this position nicely when he gives an example of the falsification approach if we were to test

the proposition ‘all swans are white.’ This can never be proven, since that would require checking each and every one of them everywhere; but it can be disproven by finding a single instance of a non-white swan. A the- ory is scientific, then, if we can say what would possibly cause us to reject it. Although a theory is never proven, if we can falsify it then we force our- selves to look again and come up with a better one. (italics in original)

Researchers develop a general theory and then generate a number of plausible alternative expla- nations or hypotheses that would defeat or dis- prove the theory. If the alternative explanations turn out to be correct, then the theory lacks sup- port. If the alternative explanations are not sup- ported, then the theory is still alive and well. We begin with a general idea and numerous

According to Karl Popper, the purpose of science is to falsify or disprove scientific theory. Do you agree with Popper?

Associated Press

lan66845_07_c07_p191-228.indd 224 4/20/12 2:51 PM


CHAPTER 7Concept Check

alternative explanations; our goal is to disprove the alternative explanations so that the only rational idea left standing is our theory (sort of a “king of the hill” situation, but with ideas). That is how psychological theories are supported—by disproof, not proof. As psychologists-in-training, be careful with the language you use. Sometimes students want to be able to say that they “proved” something, particularly when writing the Discussion section. Remember, we don’t prove anything, but we attempt to disprove competing theo- ries until the only plausible explanation left standing is our working hypothesis.

Chapter Summary

The principles of the scientific approach in psychology—observation and measure-ment—are presented in this chapter. In any experiment or quasi-experiment, basic fundamental decisions have to be made concerning independent and dependent variables. For dependent variables, how will they be measured, and if measured quan- titatively (which is frequently the case in psychology), on what scale will they be mea- sured? What operations will be followed to ensure reliability and validity of the data gathered and the conclusions drawn? Once the foundational questions are answered, then a plethora of practical matters must be considered, such as avoiding confounding variables, avoiding data collection artifacts (and threats to validity), pilot testing, manip- ulation checks, and data collection and storage. With all the care applied to every step of the process, meticulous decisions are made based on the outcomes of the study, taking care not to draw conclusions that are overzealous and/or not supported by the data pre- sented. The complexities and intricacies of this process are just some of the reasons why advanced training is needed—such as the undergraduate degree—so that psychological finding can be properly reported and utilized in an applied manner to help improve the human condition.

Concept Check

1. Which of the following would NOT be an example of a variable?

A. 10 years old B. Gender C. Breed of dog D. Shoe size

2. Which of the following would NOT be a quantitative variable?

A. Ounces of liquid B. Speed of completion C. Genre of book D. Number of items answered correctly

3. Classical test theory claims that a measurement is the

A. sum of knowledge and experience. B. sum of true score and error. C. difference between ability and aptitude. D. difference between right and wrong answers.

lan66845_07_c07_p191-228.indd 225 4/20/12 2:51 PM


CHAPTER 7Key Terms to Remember

4. The split-half method of reliability is a form of

A. test-retest reliability. B. internal consistency. C. interrater reliability. D. alternate forms reliability.

5. A priori analysis refers to analysis

A. done on more than two groups. B. executed before data collection. C. completed before other analyses. D. planned before data collection.

Answers 1. A. 10 years old. The answer can be found Section 7.1.

2. C. Genre of book. The answer can be found Section 7.1.

3. B. Sum of true score and error. The answer can be found Section 7.3.

4. B. Internal consistency. The answer can be found in Section 7.3.

5. D. Planned before data collection. The answer can be found in Section 7.6.

Questions for Critical Thinking

1. Think about the perceptions you had about psychology before you began your formal, college-level study of psychology? Did you think that psychology would be easy compared to some of the other disciplines you might have studied? How do you think about psychology now? Is it as easy as you once thought? Which components of an education in psychology are you finding the most worthwhile, and which components seem disconnected from other avenues of study you are pursuing?

2. You have completed a number of courses in different disciplines, and probably other courses in the social sciences outside of psychology (sociology, criminal justice, anthropology, economics, and so on). How does a psychological approach to studying human behavior differ from the approaches of other social sciences in studying human behavior? To what extent are these principles of observation and measurement similar to or different from the approaches in other social sci- ence disciplines?

Key Terms to Remember

alternate forms A test where a researcher develops two different forms of a test that are designed to be parallel but do not meet the same criteria levels for parallel forms.

artifact When the measurement process is distorted, biased, or corrupted in some fashion.

coefficient of equivalence The correla- tion coefficient that results from a parallel forms test. See parallel forms.

coefficient of stability A correlation coef- ficient that results from testing and retest- ing a score over time.

lan66845_07_c07_p191-228.indd 226 4/20/12 2:51 PM


CHAPTER 7Key Terms to Remember

concurrent validity The assessment of how the score on a test or inventory is related to your current state of affairs.

confound An event or occurrence that happens at the same time of your study that is not part of your designed study but can influence its outcome.

construct validity When a test measures what it purports to measure. Also known as “umbrella validity.”

content validity The determination as to whether or not the composition of items that make up a test reflects the universe of ideas, behaviors, and attitudes that com- pose the behavior of interest.

covary To establish temporal precedence in a cause-and-effect relationship, the effect must be evident upon presentation of the cause, If there is no presentation of the cause, then there should be no effect. See temporal precedence.

criterion-related validity The assessment of how the measurement outcome, or score, relates to other types of scores.

external validity The assessment of whether or not a causal relationship can be generalized to other research settings, samples, or times in the event that a causal relationship has been determined to exist between the independent and dependent variables.

face validity The assessment of whether or not the person taking the test believes that the test is measuring what is purports to measure.

falsificationism The concept that the goal of science should not be to confirm or prove theories but rather to falsify or disprove theories.

internal validity The assessment of the general nature of the relationship between the independent variables and the depen- dent variables. It primarily focuses on the determination of causality and whether or not the manipulation of the independent variables caused changes in the dependent variables.

interrater reliability A method of deter- mining reliability in which two or more raters categorize nominal data and obtain the same result when using the same instrument to measure a concept.

interval/ratio An interval scale presents numbers in a meaningful way and pro- vides equal intervals including zero. In a ratio scale, numbers are used in the typical manner, where 0 = a lack of something. The two scales of measurement are usually combined in psychological research since their interpretation individually can pres- ent challenges.

measurement How the responses of indi- viduals are captured for the purposes of research.

operational definition A concise defini- tion that exhibits precisely what is being measured.

parallel forms A test where a researcher administers two versions of a test to the same group of individuals, resulting in a correlation of the outcomes between the two test administrations. See coefficient of equivalence.

pilot test A “practice run” of a ques- tionnaire used to determine weaknesses and optimize the length of the scale for adequate response rate. The conditions in which the survey will be administered are typically replicated as closely as possible to the actual survey administration.

lan66845_07_c07_p191-228.indd 227 4/20/12 2:51 PM


CHAPTER 7Web Resources

predictive validity When a researcher takes current knowledge and attempts to make a prediction about the future.

plausible alternative explanation The ability to state, with confidence, that there is no other logical explanation for the effect other than the cause.

qualitative variable A variable in which the responses differ in kind or type.

quantitative variables Variables that are measured on a numeric or quantitative scale.

response set A pattern of responding seen in a participant that may not accurately reflect the participant’s true feelings on a topic.

response set acquiescence When participants get stuck in the trend of responding yes repeatedly in a survey or questionnaire.

response set social desirability When participants respond in a pattern that they believe makes them look good, or look bet- ter than they are.

scales of measurement Tools used to translate observations into scores in nomi- nal, ordinal, interval, or ratio scales.

split-half method A method of estimating internal consistency that involves splitting the instrument in half and then correlating the scores from the resulting halves.

statistical conclusion validity The assessment of whether or not method- ological and statistical approaches used in an experimental situation are sensitive enough to capture a causal relationship.

temporal precedence To determine what is the cause and what is the effect, the cause must come first and the effect must come second. If they occur at the same point in time, then the determination of which is the cause and which is the effect cannot be made.

validity The determination as to whether or not researchers are truly “measuring what they think they are measuring” for the purposes of their research.

variable An entity that can take on differ- ent values.

Web Resources

This website provides a video that gives a short lesson on dependency relationships and establishes the differences between dependent and independent variables.

This website provides definitions, examples, and in-depth interpretation of validity and establishes the differences between different types of validity.

This website outlines how to collect data, including a brief introduction to the process of getting approved by an Institutional Review Board and obtaining informed consent.

This website explains levels of measurement and the best way to apply them in a research setting through defining types of scales and explaining when it is appropriate to use specific scales.

lan66845_07_c07_p191-228.indd 228 4/20/12 2:51 PM

Comments are closed.