Popham, Chapter 3 Pondertime & Chapter 4 Pondertime, Due 1/25/2012

Chapter 3 Pondertime (p. 80, #3, #5)
3.  What kinds of educational assessment procedures do you think should definitely require assembly of reliability evidence?  Why?

Popham describes three types of reliability evidence:  stability, alternate form, and internal consistency. (pg 62 ff.)  If we suppose that some assessment procedures are high stakes, then it seems logical to demand that those procedures be reliable.  In other words we want to have confidence in high stakes decisions, therefore the instruments we are using:  should not vary in test-retest situations,  should be accurate measures no matter their particular format, and should not give mixed results at different points in the procedure.

So what are high stakes tests?  If I may be so bold (to co-opt some of Popham’s informal style), any test which groups a students, tracks a student, or determines some significant future course of action could be deemed a high stakes test.

Conversely am I saying that low-stakes assessments need not require any assembly of reliability evidence?  Yes, low-stakes assessments do not require reliability analyses and that agrees with Popham’s recommendation pg. 75, “In general, if you construct your own classroom tests with care, those tests will be sufficiently reliable for the decisions you will base on the tests’ results.”


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

5.  What is your reaction to classification consistency as an approach to the determination of reliability?

I would react with shock and surprise, had Popham (2011) not warned us already that “even those educators who know … sometimes mush the three brands of reliability together (pg. 73).”  In checking some other articles, I found that there was an argument going back and forth in Educational Research in 2009 and 2010 on this very topic.  It seems that Newton (2009) wrote claiming that based on “internal consistency…a substantial percentage of students would receive different levels [scores] were the testing process to be replicated.”  At which point Bramley (2010) wrote back to “show that it is not possible to calculate classification accuracy from classification consistency.”

The argument Bramley (2010) uses to refute Newton basically reminds us that reliability, i.e. classification accuracy, depends on some uncertainties in the tested population, which can vary widely irrespective of the questions being consistent.  Interestingly enough, once you admit that the measurements are different and unrelated, it is fair to ask how much they may differ from one another in practice.  Bramley’s (2010) last sentence reads “The author’s experience with both simulated and real data suggests that values for classification accuracy and consistency are often quite close – within about 5 percentage points.”  Talk about a storm in a teacup!


Bramley, T. (2010). A response to an article published in Educational Research’s Special Issue on Assessment (June 2009). What can be inferred about classification accuracy from classification consistency? Educational Research. 52(3), 325-330. SPU EBSCO url.

Newton, P. E. (2009). The Reliability of Results from National Curriculum Testing in England. Educational Research, 51(2), 181-212. SPU EBSCO url.

Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

Chapter 4 Pondertime (p. 109, #2, #5)

2. It was suggested that some measurement specialists regard all forms of validity evidence as construct-related validity evidence.  Do you agree?  If so, or if not, why?

I start with a quote that spoke to me recently from Mighton (2011), who believes “that some educators have been so seduced by the language they use that they can’t clearly see the issues anymore.”  With that as a caveat of sorts, on with my answer.

I’m sympathetic to the argument Popham (2011) makes on pg. 102 that construct related validity evidence is a more powerful concept, since it can adequately express the meaning of both content related and criterion related validity evidence.  The power of construct related validity evidence lies in the use of empirical evidence and the definition of the construct.  Thus,  criterion related evidence of validity can be thought of as a construct-related since it relies on a predictor construct, while content related can be thought of as construct-related since it uses empirical evidence (say, from specialists) or others to define an unobservable construct (content that is useful) and show that it has been suitably measured.



Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

Mighton, J. (2011). The End of Ignorance: Multiplying our Human Potential. Vintage Canada.

5.  What kind(s) of validity evidence do you think classroom teachers need to assemble regarding their classroom assessment devices?

Content related evidence would show that assessments have good representativeness relative to curricular aims, and assuming the curricular aims were aligned with state/school standards, then teachers could be relatively sure that their inferences for each student’s success in the next unit or level (i.e. their grade) was valid. 

I don’t see much value of criterion-related validity evidence for a classroom teacher, since it is mostly predictive.  That is to say, in the day-to-day operation of the classroom, I doubt that time should be spent predicting student performance on a criterion that will be evaluated potentially the next grade-level at the earliest.  However, the statistician in me would love to see correlations that could be built between a student’s performance on a summative assessment in grade X, Unit Y, when that students get to grade X+1 and Unit Z.  Probably the amount of data which would need to be collected an then crunched and then compared would be pretty expensive corroboration of the time honored-truths that “if you don’t do well at arithmetic, you won’t get algebra”  and “if you don’t get algebra, you probably won’t get geometry or trigonometry or statistics, and don’t even think about calculus”.  Which is sad because those all are pretty different from each other and learners could be quite diverse in their abilities or interests, which a test in arithmetic (basic math) can’t necessarily predict.*

As far as construct-related validity evidence, while it may be powerful as a concept, it is probably too much overhead for a classroom teacher to worry about in the day-to-day functioning of the classroom.  However, for as long as the debate over standardized tests is raging, I think a classroom teacher needs to be cognizant of types of validity evidence and what assumptions are being made by theorists and administrators that impact the functioning and procedures of the day-to-day classrooms

* I make this point based on some accounts in Mighton (2011) where he describes students that he tutored in middle-school that later went on to higher degrees in mathematics. 

Mighton, J. (2011).  The End of Ignorance:  Multiplying our Human Potential.  Vintage Canada.


As I was reading this part of the chapter I created the following diagram to help me keep track of the difference between alignment and representativeness.


Trackbacks are closed, but you can post a comment.


  • maryalinger  On January 28, 2012 at 9:33 pm

    Hi John,

    When I was reading the chapters, I was struggling to understand how they would relate to everyday classrooms. I think your comment “for as long as the debate over standardized tests is raging, I think a classroom teacher needs to be cognizant of types of validity evidence and what assumptions are being made by theorists and administrators that impact the functioning and procedures of the day-to-day classrooms” says it all.

  • Taylor Jacobsen  On January 29, 2012 at 9:14 pm

    Your references in the question on classification consistency are interesting. I’m still trying to make sure I understand them. So is Bramley saying that even if a test has classification consistency you cannot necessarily imply that students are being classified into the “correct” groups because it is about reliability and not accuracy? It only means that similar numbers of students are grouped the same way each time the test is taken? So its not about how valid the classifications are for each student but how consistent the test is overall? But then he points out that they do tend to line up quite well anyway. Let me know if I’m off here; I’ll move forward on the pretense that I understood your quotes.

    In the end, what do you think? Is classification consistency a good way for educators or researches to assess the reliability? Do you think we can assume reliability when particular students may change groups but the total in each classification is similar?

  • Chris Ashcraft  On January 30, 2012 at 12:45 am

    It would be nice to know that my tests have criterion-related validity – meaning that the test scores are good predictors of success in their year-two or college level course. It would be a tragedy for them to do well on my exams, but not be prepared for the next level of their education. That being said I would not want to take on the evaluation necessary to make that determination.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: