Tag Archives: EDU6172

Test/Quiz Distractor Analysis for Assessment Course

We were asked to complete an assignment wherein a assessment device (quiz or test) was analyzed for effectiveness using a couple of indices namely the

Difficulty Index, p


(Popham, 2011, p. 257)

And the Item Discrimination Index, D


(Popham, 2011, p. 260)


My submission for this assignment is here.

The goal is to improve you assessment tool and instruction by diving deep into each question on your test/quiz, and examining patterns in responses, as highlighted by these indices.


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston, MA: Pearson Education, Inc.

TPA Task 3 Dry-Run

We were asked in EDU6172 to do a TPA (Teacher Proficiency Assessment) Task 3 assignment.  Here is my submission.

Popham, Chapter 6 Pondertime & Chapter 7 Pondertime, Due February 8, 2012

Chapter 6 Pondertime (p. 161, #1, #3)
1.  If you were asked to take part in a mini-debate about the respective virtues of selected-response items versus constructed-response items, what do you think would be your major points if you were supporting the use of selected response test items?

Selected-response test items can be more numerous, i.e. students can complete them more quickly than constructed response test items.  Selected-response tests can also be graded more quickly.  Although, I would readily admit that selected-response test items do not give near the amount of insight into the real thought processes of the test-taker that constructed-response items provide.  And, there is more chance of guessing a correct answer in a selected-response test, than in a constructed-response test.

3.  Why do you think that multiple-choice tests have been so widely used in nationally standardized norm-referenced achievement tests during the past half-century?

To put it quite simply, multiple-choice tests are easier to grade and easier to analyze.  The range of outputs of a multiple-choice test is only a function of the number of questions on the test.  Which is to say, you can explicitly determine all the possible scores of a multiple choice test and then determine how many students have fallen into certain output categories or groupings.

Chapter 7 Pondertime (p. 184, #1, #2)
1.  What do you think has been the instructional impact, if any, of the widespread incorporation of student writing samples in high-stakes educational achievement tests used in numerous states?

Teachers who desire their students to do well on high-stakes educational achievement tests, undergo huge pressures to tailor their instruction to help their students succeed on such tests.  For student writing samples,this means that teachers are delivering instruction that causes students to practice the skills of needed to produce students writing samples that are of high quality.  Tips such as 5-paragraph essay format, narrowing topic quickly and precisely, and writing with the right combination of detail and brevity is key.  Other instructional impact may include the loss of time for other topics in the classroom as preparation is made for doing well on writing sample portions.

According to Wikipedia, the writing section of the SAT was added in March of 2005  (almost 20 years after I took the SAT).  I should do some research on whether data since that time has proven the Writing section to be valuable or useful.


SAT (2012) Retrieved February 10, 2012 from http://en.wikipedia.org/wiki/SAT

2.  How would you contrast short-answer items and essay items with respect to their elicited levels of cognitive behavior?  Are there differences in the kinds of cognitive demands called for by the two item types?  If so, what are they?

Short answer items demand limited levels of cognitive behavior, at least in comparison to essay items.  The longer form demands, on average, more recall and reasoning from the student, or at least more regurgitation of opinions heard or given during classroom discussions.  The real issue I think is cognitive demands.  A short-answer item, by definition may be a mere phrase or paragraph, and thus not require much application of new learning or thinking (i.e. extrapolation or interpolation of sources and opinions of others).  The essay-based test question forces a student to mentally perambulate through sources and opinions, hearsay and argumentation, and prove that they themselves can either replicate a standard argument or come up with a newer one.  In this case I am equating “argument” with a chain of assertions more or less supported by logic or sources which necessitates some development (i.e. proposition, inference, conclusion). 

That perambulation requires some extreme cognitive demand in order not to be found falling into some slough of whimsy or crevasse of fallacy.  A student successful in an essay item has definitely met some higher cognitive demands than those the short-answer question.  An interesting followup question might be:  “Is more better?”

According to the  Wikipedia article on the SAT (2012), the writing section on that test has been studied since its inception in 2005 and there is some evidence that the longer the answer essay the better the score.  This may be an artifact that favors the rambling student!


SAT (2012) Retrieved February 10, 2012 from http://en.wikipedia.org/wiki/SAT


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston, MA: Pearson Education, Inc.

Popham, Chapter 15 Pondertime & Chapter 16 Pondertime, Due March 7, 2012

Chapter 15 Pondertime (p. 384-385, #1, #3)
1.  If you were devising a plan to promote dramatically improved evaluation of the nation’s teachers, how would you go about doing so?

[hmm…where to start on one of the most hotly debated topics of our current political atmosphere.]

First let’s get something straight, public school teachers are currently bureaucrats, that is to say that their environment is fundamentally *not* profit or results driven, like private industry.  The private sector, the profit sector, is where I come from having left Microsoft in 2011.

Second, we could dream up ways of improving evaluation but until teachers overwhelmingly see the value of evaluation then they will be manipulated by district, union and media.  (Actually, I think the substance of this point goes back to a quote from Bill Gates.)

Third, New York recently (Feb 2012) published rankings on 18000 public school teachers.  The “value added” plan for improving evaluation of the nation’s teachers seems to be engendering a lot of debate.  Despite the rhetoric, private and parochial schools seem to have no problem measuring teacher effectiveness based on the “product” or “outcome”, that is have students learned or not, can they prove it in some way, i.e. on a standardized test or exam.

Finally, all that is noble and virtuous about our education system, that children are treated fairly and that each one is nurtured to achieve his or her full potential, none of that can really be fostered in a cut-throat competitive-toxic teacher environment.  The best plan for improved evaluation is to find one that teachers themselves agree to, and teachers themselves implement and believe in.  That must be peer-based, must have real rewards (consequences!), and must be a focus on growth and improvement, not punishment and status-quo.

3.  When teachers evaluate themselves (or when supervisors evaluate teachers), various kinds of evidence can be used in order to arrive at an evaluative judgment.  If you were a teacher who was trying to decide on the relative worth of the following types of evidence, where would you rank the importance of students’ test results?
    a.  Systematic observations of the teachers’ classroom activities
    b.  Students’ anonymous evaluations of the teacher’s ability
    c.  Students’ pretest-to-posttest gain scores on instructionally sensitive standards-based tests
    d.  A teachers’ self-evaluation

Would your ranking of the importance of these four types of evidence be the same in all instructional contexts?  If not, what factors would make you change your rankings?

Below is my list of evidence, and listed in weight order from heavy weight to light weight, that should be used in an evaluation of a teacher.

1.  I believe in teacher self-evaluation against clear mutually-acceptable criteria and reasonable expectations informed by work load and experience.  Teachers are making evaluations of their students in both significant and insignificant ways all day, every day.  Teachers are able to evaluate themselves.  If blind spots develop or are recognized they should be highlighted in a coaching atmosphere.

2.  I put pretest-to-posttest gains next, because I think #1 actually drives #2.  If I wanted to prove in a self-evaluation that I was growing and having more impact on authentic student learning then I would jump at the chance to present *data*.  That means I would pull out assessments which show that gains have been made, that I have added value.  Notice that I wouldn’t put that in the newspaper or broadcast in the media, but I would use those data for self-evaluations.

3.  I would put observations next.  I think that nothing beats peer review of teacher practice.  All of the high-flying charter or private schools use something like this as a method of continual process improvement.  It I believe is an essential piece of teacher evaluations.

4.  I would put students’ evaluations last.  Recall that students are minors and their maturity is often questionable.  To place high-stakes or career-impacting decisions in their hands seems foolhardy.  Nevertheless I love getting the data, I would just de-emphasize it, hence it comes in last in my list of priorities.

I realize that instructional contexts vary, but I think my descriptions above are suitably general that the rankings would still hold.  All teacher should set goals and self-evaluate.  All teachers should do before and after type exams or gauge their student’s’ improvement.  Observaions are key because teachers should get peer and master teacher feedback / challenges to improve…

Chapter 16 Pondertime (p. 409, #1, #3)
1.  If you were asked to make a presentation to a district school board in which you both defend and criticize a goal-attainment approach to grading students, what would  your chief arguments be on both sides of this issue?

Goal Attainment Grading:  Pros and Cons
-  having clear “goals” and with clear definitions of “attainment” can be more readily communicated to students, parents and other staff
-  potential to decrease the variability of grading between students, i.e. may decrease some common tendencies (to the norm, to be too harsh, to be too lenient) that sometimes exist in grading.
-  since instruction is based on goals (standards), and assessment is ideally focused on goals, it is logical extension that the communication back to student and parents (grades) should be based on goals attained or not.

– There is no accounting for effort in the goal-attainment approach, at some grades and in some situations, a notion of effort expended by students can be very informative.
– Given the number of goals needed/required this could be a more intensive grading system
-  there are other interesting variables which teachers would like to report on for certain students and goal-attainment doesn’t capture them all.

3.  If you were trying to help a new teacher adopt a defensible grading strategy, what advice would you give that teacher about the importance of descriptive feedback and how to communicate it to students and parents?

Grading by its very nature is a sorting process that is fraught with imprecision.  In order to reduce the perception that grading is arbitrary or subjective, and thus not defensible to the student/parent, it is very important that feedback be descriptive.  By descriptive we mean that any deviation from the standard that a teacher is claiming for a student is supported with evidence, and that evidence is presented which inherently points to the improvements that are being requested of the student which point the way towards how a grade can be improved or conversely worsened.  The beginning teacher should avoid the thinking that grading is merely proof that they are doing their job, and focus on the exercise that grading is a communication of goal-attainment (or lack of attainment) to all interested parties.  Once that groundwork is laid, the more interesting conversation about how attainment is/was measured can begin and be the focus of any improvement plans or rewards.


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston, MA: Pearson Education, Inc.

Popham, Chapter 13 Pondertime & Chapter 14 Pondertime, Due February 29, 2012

Chapter 13 Pondertime (p. 332, #1, #2)
1.  If you had to use only one of the three individual student interpretation schemes treated in this chapter (percentiles, grade-equivalent scores, and scale scores), which one would it be?  Why did you make that choice?

Given the constraint of only using one score interpretation scheme, I would use percentiles. I am imagining that my most prevalent reason for interpreting scores is in talks with parents and students. I like the advantages that percentiles are fairly easy to understand, and I think I could explain norm group issues with most people in a fairly short time or with little difficulty.

(Popham, 2011, p. 322)

2.  It is sometimes argued that testing companies and state departments of education adopt scale-score reporting methods simply to make it more difficult for everyday citizens to understand how well students perform.  (It’s easier for citizens to make sense out of “65% correct” than “an IRT scale score of 420.”)  Do you think there’s any truth in that criticism?

By now I think we are all used to Popham’s (2011) cynicism. Given the amount of information that barrages “everyday citizens” today, it is probably encouraging that they even try to understand stories about student performance. I think that a story that is discussing scores probably already has an underlying agenda that will necessarily preclude any extensive discussion of scoring philosophies and methodologies. For example:

”Scores are going down: Pass the Levy to Arrest the Free-Fall” or
”Scores are going up: Defeat the Levy since it is not Needed”

[If I had more time I would find some recent stories about student scores and discuss them here.]

NOTE: I found this chapter pretty successful at staying above the fray of standardized testing, and its opponents and proponents. However, when I saw that Popham (2011) had a reference to an article by Jo Boaler (2003) on Riverside, I had to read it. You ought to read it too, since it gives some detail on how and why standardized testing and the reporting of scores on said tests frequently goes very wrong for our underserved populations.

Chapter 14 Pondertime (p. 350, #1, #2)
1.  Can you think of guidelines, other than the two described in the chapter, to be used in evaluating a classroom teacher’s test-preparation practices?  If so, what are they?

I am in a real quandary on this question.  Part of my dilemma is due to an SAT preparation course that I will start teaching next week at my school.  This class is also the basket into which I have thrown all of my TPA “eggs”.   Before I had read this chapter I was going to use as the basis for my lesson, readily-accessible previous forms (“ethically not OK” ?), and study guides which have the same format (“educationally indefensible” ??!) as well as going over some testing tips.

[This quandary hit me the evening of 2/27, and I was cranky all day 2/28.]

Now after ruminating on it, I don’t want to add guidelines, I want to remove one, for I think that educational defensibility is a crock.  For a few reasons.

First, the test questions are the learning goals.  We learned this in our “Understanding by Design” tasks in EDU6171.  You start with the assessment.  You work backward from there, building a gradual and compelling chain of lessons and learning activities that virtually ensure that the assessment can be successfully completed.

Second, if it isn’t tested then it isn’t learned.  All your best lessons are mere vapor, unless you test for that information and demand recall.  The only way a high school student can prove that they were doing anything for 12-13 years of education is if there is a test on it.

Third, testing is not going away.  I don’t think standardized testing should go away either.  I took a couple of AP tests, I took the PSAT, the SAT, I took the ASVAB, I took the GRE, I took the WEST-B, I took a couple of WEST-E’s and I know people that took the EIT and the Foreign Service Exam.  Popham urges us to not teach to the test but if the test were a work of staggering genius which could really measure what we thought students should all know and measured it in a way that we all thought was fair, we would definitely teach to that standard.  The truth is testing needs to go that way and not retreat before the onslaught of those who want to relax a simple standard of one student, sitting quietly, writing out all that he or she knows about a given topic.

Based on the fact that tests have valid goals, that testing forces students to drill and rehearse,  repeat and remember, and that testing is not going away, I think teaching to the test is fundamentally defensible.  So teach to the test, but if you are still holding onto your quaint purist notions, you can, like Popham, qualify that it is the test or rather its contents to which you are teaching.

I resolve to make real connections between test content and standards.
I resolve to use every ethical means possible to get my students to succeed at standardized and other tests. 
I resolve to view the test as a minimum bar, a flawed, imprecise and quirky bar, but a necessary bar.
I resolve to thus teach to the test and then teach, teach, teach some more. 
I resolve to quit belly-aching about the test, and start teaching. 

Finally, I think testing and test preparation is a social justice issue.  Take a look at the following data and see if you can spot the trend(s)…

(College Board, 2011, p. 4)

I teach to the exam so that students can jump the barriers that their family income have presented to them.  I teach to the exam so that they can get into college, stay there, and then get their children into college.  It’s the long view, it’s a definite challenge, but it starts with excelling on the current standardized tests that we have right now, as a minimum.

2.  How about test preparation practices?  Can you think of any other sorts of test-preparation guidelines that are meaningfully different from the five described in the chapter?  If so, using the chapter’s two evaluative guidelines or any new ones you might prefer, how appropriate are such test-preparation practices?

Teachers are in an arms-race with the the test creators and that tension and dynamism is exactly what we need.

I find it mildly interesting that there are no test preparation practices that pass the Educationally Defensible test, but fail the Professional Ethics test.  Maybe that is a hint at something we are missing here.  Does faulty ethics automatically imply that something is not educationally defensible?


Boaler, J. (2003). When Learning No Longer Matters: Standardized Testing and the Creation of Inequality. Phi Delta Kappan. 84(7). pp. 502-6. Retrieved February 26, 2012 from EBSCO

CollegeBoard. (2011).  2011 College-Bound Seniors:  Total Group Profile Report.  CollegeBoard.  Retrieved February 28, 2012 from http://professionals.collegeboard.com/profdownload/cbs2011_total_group_report.pdf

Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston, MA: Pearson Education, Inc.

Guidelines:  for evaluating test-preparation practices.
(Popham, 2011, p. 338).
(Popham, 2011, p. 339).

[Test Preparation] Practices:  methods of getting ready for tests.
1. Previous-form preparation. (e.g. you stole SAT test question booklets out of a dumpster after they were administered)
2. Current-form preparation. (e.g. you photocopied the SAT test which will be given on March 3rd, and sold it to students)
3. Generalized test-taking preparation. (e.g. “if you can rule out 2 answers, consider guessing”)
4. Same-format preparation.  (e.g. “SAT questions look like this, sound like this, use these tricks”)
5. Varied-format preparation.  (e.g. I will not use the same font or layout that the SAT uses, I have my pride!)

Above summarizes practices described in the Popham (2011) that have been known to be used in preparation for high-stakes tests.

(Popham, 2011, p. 344).


And finally a Gedankenexperiment:

Consider the following thought experiment. What if a test could be devised that was:

a. fair to all takers: ELL, low SES, ethnically diverse, gender, orientation (questions didn’t have any positive discrimination factors that could be used to identify subgroups of test takers)
b. aligned to standards: it neither left any concepts out, nor did it add anything superfluous.

Such a test—despite all of its perfection—would  still have detractors, such a test would still cause people to be against testing in general, such a test would still be blamed for society’s ills.  I think that is because the average person has a deep resentment of ranking and especially when that ranking does not put the average person at the top, where they tend to think they belong, despite all statistical improbability which that view affords.

Popham, Chapter 11 Pondertime (p. 267, #1, #2), Chapter 12 Pondertime (p. 303, #2, #4), Due February 22, 2012.

Chapter 11 Pondertime (p. 267, #1, #2),
1. Why is it difficult to generate discrimination indices for performance assessments consisting of only one or two fairly elaborate tasks?

Popham (2011) writes that “because educators have far less experience in using (and improving) performance assessments and portfolio assessments, there isn’t really a delightful set of improvement procedures available for those assessment strategies (p. 265).”

But let me see if I can tease out a reason why this might be so.  Discrimination indices are based on students getting an answer wrong, and flat wrong, which is so easy to do in a selected-response test.  In a constructed-response test, one can imagine that the range of scores is more spread, i.e. due to partial credit on two “fairly elaborate tasks” the student scores are fairly spread out.  Thus “right” and “wrong” on either task depends on the cut score, which doesn’t really indicate exactly what a students knows or doesn’t know on the item, merely that some knew more, some knew less.

Without a breakdown of scoring for each sub-task or criteria, it would be impossible to say which part of instruction was weakest, and I don’t think that is being given to us in this example.

And I have no idea where to start if the two elaborate tasks are being graded holistically…

2. If you found there was a conflict between judgmental and empirical evidence regarding the merits of a particular item, which form of evidence would you be inclined to believe?

In the case of a conflict between judgmental and empirical evidence, I would tend to go with judgmental evidence, since there is nothing really comparable to human feedback on an exact question.  However, now that I said that, the number geek in me loves the idea of getting students unfamiliar with the instruction to also take my test so that I have an approximation of discriminators from uninstructed groups.  For, it seems like that would take away the bias of colleagues that “want to help me out” and may avoid giving me extremely objective feedback.

Chapter 12 Pondertime (p. 303, #2, #4)
2. What strategies do you believe would be most effective in encouraging more teachers to adopt the formative-assessment process in their own classrooms?

A couple of ideas spring to mind, let’s take a look at each one in turn.

First, provide a technology or trick that makes it easy to get real-time feedback from the class on how much they are understanding.  I think teachers do this all the time, the hopelessly incomplete and inaccurate question posed to a classroom full of scribbling or sometimes distracted kids:  “how are people doing? are you getting this?”

But that’s just enabling zero-th order, or the-simplest-interpretation of formative assessment.  The real strategy for encouraging more teachers is to make sure they understand formative assessment, and then wage an all out education blitz that formative assessment (billboards?  formative assessment trailers on all campaign ads?) is a useful strategy.

I was most curious to see that the initial positive impetus for formative assessment happened right around ESEA/NCLB.  That may have tainted it, to be fair, since it is widely held that ESEA/NCLB is at best a large stick-without-a-carrot, and at worst a failed effort.

The re-authorization of ESEA/NCLB is by no means certain, but were it to be championed or improved dramatically, we could write our legislators and ask them to mention formative assessments in the re-authorization legislation?  Hmmm…

4. The chapter was concluded with some speculation about why it is that formative assessment is not used more widely.  If you had to choose the single most important impediment that prevents more teachers from employing formative assessment, what would this one impediment be?  Do you think it is possible to remove this impediment?

In my opinion the single biggest impediment to teachers doing formative assessment is inertia, or as Popham (2011) writes “the inherent difficulty of getting people to change their ways (p. 297).”  It seems like teachers are bombarded these days with methods and workshops that claim to make their learning more effective, and there is no planning period time to actually improve or innovate on lessons.  Thus teachers are stuck in a cycle of wanting to improve lessons, but facing large workloads of grading and keeping up, so that the improving of new lessons takes a back seat.  So I guess I am actually saying that I think the biggest impediment is resistance to change, magnified by the utter paucity of planning period time.  The removal of this impediment would be an increase in planning period time, in other words give teachers more structured time to re-think lessons.

Recall that all teachers (me included) think I am doing a pretty OK job right now.  To convince me otherwise and show that dramatic improvement is possible with a little change or a modest effort, is key to overcoming the inertia impediment.


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston, MA: Pearson Education, Inc.


Definition of Formative Assessment
Graphical Depiction of Typical Learning Progression (Popham, p. 282)
The Four Levels of Formative Assessment (Popham, p. 287)

Popham, Chapter 8 Pondertime & Chapter 9 Pondertime, Due February 15, 2012

Chapter 8 Pondertime (p. 209, #1, #5)
1.  What do you personally consider to be the greatest single strength of performance assessment?  How about the greatest single weakness?

A performance assessment, rightly conceived and executed is the closest you can get to verification that a student has experienced authentic learning.  That is its strength, namely that it skips along the top of Bloom’s Taxonomy, high level skills, ready to be applied to real problems, and ready to be built upon and extended.

Greatest single weakness is the artificial one, namely “it takes longer to grade.”

But if all of high quality classroom interaction is performance assessment at a high level, does this weakness still carry any weight?

5.  Do you prefer holistic or analytic scoring of students’ responses to performance tests?  And, pray tell, why?

Analytic scoring I think combines two important ingredients, i.e. uniformity of grading with detailed feedback.  I just can’t help but think that holistic scoring is just a cop-out.  How can I look a student in the eye and tell them this is one number that sums up their X-hours of work?  I like the following approach.  “Some classroom teachers have attempted to garner the best of both worlds by scoring all responses holistically, then analytically rescoring (for feedback purposes) all responses of low-performing students.  (Popham, pp 197)”

Chapter 9 Pondertime (p. 227, #2, #4)
2.  Three purposes of portfolio assessment were described in the chapter:  documentation of student progress, showcasing student accomplishments, and evaluation of student status.  Which of these three purposes do you believe to be most meritorious?  And, of course, why?

First, I should say that my school uses portfolios extensively.  Second, I should also say that we are currently in our Spring Exhibition cycle in which portfolios play a key role.  So, this is a timely chapter and an extremely relevant question.

I believe the use of portfolios that has the most merit is in documenting student progress.  In part because I really like the students having to self-evaluate and I don’t see that as an option when portfolios are only used for showcase work, or purely for evaluation of student status.

I also like that a portfolio that is a living document has more value throughout an educational cycle.  For example, I foresee that a portfolio is used on a daily basis, and then content is culled or archived from the main portfolio as milestone assessments are reached and the best work from a given cycle is carried forward.

So, in part, a portfolio that is documenting student progress is almost a formal superset of not only a portfolio that only has showcase items, but also one which is only used to evaluate student status.

4.  If it is true that portfolios need to be personalized for particular students, is it possible to devise one-size-fits-all criteria for evaluating classroom portfolio work products?  Why or why not?

I believe it is possible to create effective general criteria for evaluating classroom portfolio work, especially for a diverse population of students.  I think that the criteria could be effective for helping make inferences on student learning, and still be relatively simple.  I think this is most effectively done by having students be aware of the criteria which will be used so that they can react against those criteria in the self-evaluation of their portfolios.  I also think that it is in that interaction, that the criteria are more effectively refined and or broadened to take into account the personalized learnings that are taking place.


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston, MA: Pearson Education, Inc.


I liked this graphic as it helps inform a decision to use or not use performance-type assessments.

Evaluative Criteria for Performance-Test Tasks. (Popham, pg. 194)


Sometimes I wish we could select the two questions we wanted to answer from each chapter’s Pondertime®.

Differences in Assessment Outcomes between Portfolios and Standardized Testing Practices. (Popham, pg 212).


Popham, Chapter 5 Pondertime (p 135-136, #1, #5, #6, #7), Due 2/1/2012

Chapter 5 Pondertime (p. 135-136, #1, #5, #6, #7)
1.  If you were asked to support a high school graduation test you knew would result in more minority than majority youngsters being denied a diploma, could you do it?  If so, under what circumstances?

As Popham (2011) states so eloquently on pg. 115, disparate impact does not equal assessment bias.  I would in this case take a hard look at the school graduation test and evaluate it for assessment bias using some of the tools suggested in this chapter.  I would also be very careful of creating a prophecy/suspicion either in my mind or in the minds of other faculty and staff and especially of the students such that those prophecies/suspicions become self-fulfilling.  I would support the exam if I found it to be free from assessment bias, and then sign up for summer school duty to help those students who were denied a diploma to get back on track!


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

5.  What is your view about how much effort a classroom teacher should devote to bias detection and bias elimination?

Popham (2011) lists a few ways to reduce bias, including judgmental approaches which include

-bias review panels

-per item absence-of-bias judgments and

-an overall absence-of bias judgment.

(Popham, pg 117)

In addition to judgmental approaches listed above there are also empirical approaches, but finally the author gives some practical tips on how a classroom teacher can remove bias.  Whereas I agree that becoming sensitive to bias is a necessary start, I don’t believe it is sufficient.  I really appreciated the section “Parent Talk” (pg. 121) which described how some relevant peer review of assessment tools can go a long way to making for a more equitable classroom and assessment process, before a parent has to make an accusation to the contrary.  I believe that also not only for new teachers but even veteran teachers (since the makeup of the classroom can change), some peer review occasionally could be quite helpful for bias detection and bias elimination.  I would even go so far as to say that former students and even current students can help reduce bias due to “squareness” i.e. the disconnect between generations that can sometimes make assessments difficult for lack of cultural (i.e. intergenerational) literacy.


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

6.  What kind of overall education assessment strategy can you think of that might make the testing of students with disabilities more fair?  How about LEP students?

Needless to say this question is a daunting one.  The spectrum of students with disabilities and the spectrum of students with LEP is broad and perhaps even overlapped.

Popham discusses the history of ESEA and NCLB to give context to current discussion and a little history of the IEP as a tool for measuring student progress.  I think the only assessment strategy that would make testing of both students fair is an individualized one.  I am very sympathetic to the argument that and IEP allows for individualized goals—not a means to water down content requirements or lower standards—but as a tool to describe the best ways to assess a particular student given their abilities.  I appreciate that as a motivation for discussion accommodations.

In particular  I took a closer look at the report from the CCSSO (Thompson, Morse, Sharp & Hall, 2005).  I like that the discussion can be changed to talk about the accommodations necessary to help students do their best on assessments, and especially Popham’s suggestion that we ask the students what accommodations they would most need.


[NOTE:  no points for Popham for using “mentally retarded”, a term which has been out of favor now for quite a few years.]

When it comes to ELL/LEP, I like a similar approach related to accommodations.  The conservative in me is reluctant to translate exams and materials so that every student can basically prolong their learning of English as a pre-requisite of good citizenship in this country.  Schools are meant to be hothouses of growth and not insular enclaves which seek to create a world which does not exist in the broader society.  That said, I think assessments continue to be in English and students which need help get individualized and focused training to help them get up to speed with their peers.


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

Thompson, S.J., Morse, A.B., Sharpe, M. & Hall S. (2005).  Accommodations Manual:  How to Select, Administer, and Evaluate Use of Accommodations for Instruction and Assessment of Students with Disabilities.  (2nd ed.).  Washington, DC:  Council of Chief State School Officers.

7.  Can bias in educational assessment devices be totally eliminated?  Why or why not?

I don’t believe you can ever totally eliminate bias.  For a couple of reasons:
    1. the students change each year, which is to say
    2. at any given time you can’t know all the backgrounds of all your students, i.e. what they have or have not experienced.

Without perfect knowledge of what your students have experienced, or where they have come from, you may always trigger some memory or anxiety-producing experience that you hadn’t intended.

I like to think more about how you could reduce the triggers in an assessment through pure symbolic representations (math) or purely natural world representations (science).  I suppose you might say that the hard sciences are closer to being able to produce bias-free assessments, but when you start talking about the ethics of science, the chance for bias increases.  And, I can’t imagine effective math or science education without talking about the impact of math and science in the ethical realm. 

In the trivial case, I suppose you could have no bias in an educational assessment if you didn’t teach anything, and thus needed no assessments (beyond trivial ones).


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.


Popham, Chapter 3 Pondertime & Chapter 4 Pondertime, Due 1/25/2012

Chapter 3 Pondertime (p. 80, #3, #5)
3.  What kinds of educational assessment procedures do you think should definitely require assembly of reliability evidence?  Why?

Popham describes three types of reliability evidence:  stability, alternate form, and internal consistency. (pg 62 ff.)  If we suppose that some assessment procedures are high stakes, then it seems logical to demand that those procedures be reliable.  In other words we want to have confidence in high stakes decisions, therefore the instruments we are using:  should not vary in test-retest situations,  should be accurate measures no matter their particular format, and should not give mixed results at different points in the procedure.

So what are high stakes tests?  If I may be so bold (to co-opt some of Popham’s informal style), any test which groups a students, tracks a student, or determines some significant future course of action could be deemed a high stakes test.

Conversely am I saying that low-stakes assessments need not require any assembly of reliability evidence?  Yes, low-stakes assessments do not require reliability analyses and that agrees with Popham’s recommendation pg. 75, “In general, if you construct your own classroom tests with care, those tests will be sufficiently reliable for the decisions you will base on the tests’ results.”


Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

5.  What is your reaction to classification consistency as an approach to the determination of reliability?

I would react with shock and surprise, had Popham (2011) not warned us already that “even those educators who know … sometimes mush the three brands of reliability together (pg. 73).”  In checking some other articles, I found that there was an argument going back and forth in Educational Research in 2009 and 2010 on this very topic.  It seems that Newton (2009) wrote claiming that based on “internal consistency…a substantial percentage of students would receive different levels [scores] were the testing process to be replicated.”  At which point Bramley (2010) wrote back to “show that it is not possible to calculate classification accuracy from classification consistency.”

The argument Bramley (2010) uses to refute Newton basically reminds us that reliability, i.e. classification accuracy, depends on some uncertainties in the tested population, which can vary widely irrespective of the questions being consistent.  Interestingly enough, once you admit that the measurements are different and unrelated, it is fair to ask how much they may differ from one another in practice.  Bramley’s (2010) last sentence reads “The author’s experience with both simulated and real data suggests that values for classification accuracy and consistency are often quite close – within about 5 percentage points.”  Talk about a storm in a teacup!


Bramley, T. (2010). A response to an article published in Educational Research’s Special Issue on Assessment (June 2009). What can be inferred about classification accuracy from classification consistency? Educational Research. 52(3), 325-330. SPU EBSCO url.

Newton, P. E. (2009). The Reliability of Results from National Curriculum Testing in England. Educational Research, 51(2), 181-212. SPU EBSCO url.

Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

Chapter 4 Pondertime (p. 109, #2, #5)

2. It was suggested that some measurement specialists regard all forms of validity evidence as construct-related validity evidence.  Do you agree?  If so, or if not, why?

I start with a quote that spoke to me recently from Mighton (2011), who believes “that some educators have been so seduced by the language they use that they can’t clearly see the issues anymore.”  With that as a caveat of sorts, on with my answer.

I’m sympathetic to the argument Popham (2011) makes on pg. 102 that construct related validity evidence is a more powerful concept, since it can adequately express the meaning of both content related and criterion related validity evidence.  The power of construct related validity evidence lies in the use of empirical evidence and the definition of the construct.  Thus,  criterion related evidence of validity can be thought of as a construct-related since it relies on a predictor construct, while content related can be thought of as construct-related since it uses empirical evidence (say, from specialists) or others to define an unobservable construct (content that is useful) and show that it has been suitably measured.



Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc.

Mighton, J. (2011). The End of Ignorance: Multiplying our Human Potential. Vintage Canada.

5.  What kind(s) of validity evidence do you think classroom teachers need to assemble regarding their classroom assessment devices?

Content related evidence would show that assessments have good representativeness relative to curricular aims, and assuming the curricular aims were aligned with state/school standards, then teachers could be relatively sure that their inferences for each student’s success in the next unit or level (i.e. their grade) was valid. 

I don’t see much value of criterion-related validity evidence for a classroom teacher, since it is mostly predictive.  That is to say, in the day-to-day operation of the classroom, I doubt that time should be spent predicting student performance on a criterion that will be evaluated potentially the next grade-level at the earliest.  However, the statistician in me would love to see correlations that could be built between a student’s performance on a summative assessment in grade X, Unit Y, when that students get to grade X+1 and Unit Z.  Probably the amount of data which would need to be collected an then crunched and then compared would be pretty expensive corroboration of the time honored-truths that “if you don’t do well at arithmetic, you won’t get algebra”  and “if you don’t get algebra, you probably won’t get geometry or trigonometry or statistics, and don’t even think about calculus”.  Which is sad because those all are pretty different from each other and learners could be quite diverse in their abilities or interests, which a test in arithmetic (basic math) can’t necessarily predict.*

As far as construct-related validity evidence, while it may be powerful as a concept, it is probably too much overhead for a classroom teacher to worry about in the day-to-day functioning of the classroom.  However, for as long as the debate over standardized tests is raging, I think a classroom teacher needs to be cognizant of types of validity evidence and what assumptions are being made by theorists and administrators that impact the functioning and procedures of the day-to-day classrooms

* I make this point based on some accounts in Mighton (2011) where he describes students that he tutored in middle-school that later went on to higher degrees in mathematics. 

Mighton, J. (2011).  The End of Ignorance:  Multiplying our Human Potential.  Vintage Canada.


As I was reading this part of the chapter I created the following diagram to help me keep track of the difference between alignment and representativeness.


Popham, Chapter 1 Pondertime, Chapter 2 Pondertime, Due 1/11/2012


Chapter 1 Pondertime, pg. 26, #5

5.  Do you think the movement to discourage the use of the terms intelligence and aptitude is appropriate?  Why?

"We’ll look back on this as a dark age in education." So says Toronto playwright, math scholar and dabbler in the philosophy of education John Mighton. He’s also a math tutor who, with his non-profit organization, Junior Undiscovered Mathematical Prodigies (JUMP), has helped turn 1,000 one-time struggling kids into virtual math whizzes. Mighton, 45, attributes our pre-enlightened state to a school culture that neglects what he calls the "psychological aspect of learning" — that is, nurturing a child’s confidence and excitement about school. (Ferguson, 2003)

First, I would like to say that we don’t need another movement to change terminology.  Especially since according to Popham (2011) “the tests…although they have been relabeled, haven’t really changed that much.”  What really needs to be discouraged is the false notion, but notion is almost not strong enough a word, since Mighton compares the impact of that notion to be so widespread it has cast a dark pall over the whole endeavor that is education.

Second, and to state a more positive action, I like being a proponent of change that starts spreading the sentiment that all children can learn mathematics, that all children can get to calculus, and that any failure to do so is a failure of the educator and not the student.  That seems to get at the root of the issue, but is of course, far harder and more far-reaching.

Come join me!  The challenge is to take prejudice couched in language like:

“Subitisation skill as a predictor of math ability”

and replace it with

“This student isn’t getting it right now, what are *you* going to do about it?”


Ferguson, S. (2003). The Math Motivator. (cover story). Maclean’s, 116(38), 20.

Popham, W.J. (2011). Classroom Assessment: What Teachers Need to Know. (6th ed.). Boston ,MA: Pearson Education, Inc. pg. 20.

Chapter 2 Pondertime, pg. 58, #1, #2, #3
1.  I t was argued in the chapter that most classroom assessment tasks call for criterion-referenced approaches to measurement.  Do you agree?  Why or why not?

I like criterion referenced approaches since we are in day-to-day instruction involved in imparting specific knowledge which is easily measured per student and per functional task or piece of information.  That assessment can be formative or summative, but it must be in the end criterion-referenced.  That is the only way the assessment will perform the most needed function which is an immediate gauge of the effectiveness of instruction.

2.  Why should/shouldn’t classroom teachers simply teach toward the content standards isolated by national groups?  If teachers don’t, are they unpatriotic?

I assume the reference to patriotism is a thinly veiled reference to the performance of students in the USA relative to other countries when it comes to national mathematics exams.  If so, the assumption is that the content standards if adhered to for instruction would help students here do better.

Perhaps the reference to patriotism is merely that rank-and-file teachers should respect national authority.  Since when is blind allegiance to authority prized in this country?

Again, I would ask here what the stakes are.  If there is an exam at stake, then teach toward that exam, it may or may not agree with national content standards.  If you are averse to teaching to the exam, first do that, then if you time or inclination allows, diversify your instruction with national content standards.

NOTE:  with the Common Core standards coming to WA, show your patriotism and get informed on that (webinars at this link)  http://www.k12.wa.us/CoreStandards/updatesevents.aspx

3.  If you discovered that your state’s educational ability tests (a) attempt[ed] to measure too many content standards, (b) are based on badly defined content standards, and (3[c]) don’t supply teachers with per-standard results, how do you think such shortcoming would influence your classroom assessments?  How about your classroom instruction?

Let me be perfectly clear here.  As far as classroom assessments go, the gold standard is the test (assessment) so (a) Teach to the test.  (b) Teach to the test.  (3 [c]) Teach to the test.  When it comes to classroom instruction, I would endeavor to (a) Teach to the test.  (b) Teach to the test.  (3 [c]) Teach to the test.  And while you are at it, put the best minds in the country on designing the test (bias free, reliable, engaging).  This is the only way to ensure that everyone is being pulled to a higher standard.

My classroom assessment and instruction in all of these cases would be to teach to the test.  My Gedankenexperiment for (a) is as follows:  I as the teacher have limited time, the test is also administered in finite time, there thus cannot be “too many content standards” on the test, the test has as many content standards as it has.  Therefore, study the test (prior editions), learn how to assess like it is assessing, i.e. know how it is designed, how it is put together, how it is administered.  There is some statistical inference that needs to go on to decide what the most central content standards will be on the test, teach to those first, then try to cover the rest.  In all cases teach test-taking skills so that students don’t turn a cognitive assessment into an affective assessment.

Let’s think a moment about “badly defined content standards”.  That can really mean only a few different possible things:  the standard is either too narrow, too broad, or unrelated.  Here is the example:

1.  Students will be able to simplify 5x + 13 = 28 (too narrow)
2.  Students will be able to recognize and solve equations of one variable (too broad?)
3.  Students will be able to walk & chew gum at same time.

All it would take is one sample test, or gathering recollections of tests from students to ascertain where each standard was lacking.  I also really do not think that too narrow standards is the problem, otherwise teachers would teach the exact standards, the tests would cover those standards, and we wouldn’t be having this discussion!

The real problem must be that standards are invariably somewhere between #2 and #3.  In which case we are really a subset of question (a), namely there is too much being covered on exams relative to state standards.  But I counter that here again information about the exam is crucial for determining where the instruction and assessment should tend.

I just don’t get what I consider to be the arrogance of a teacher saying they don’t teach to the standardized test.  Are they teaching something better?  Great, so teach enough so that students ace exams and then add your own flavor.  Or are they saying that the exams and standards are flawed?  Fine, design your own, but first cover the ones that exist otherwise your students and parents will go elsewhere as they realize that you are more high-minded than the high stakes exams that are bearing down on your students every day!

%d bloggers like this: