We need to use numbers to evaluate our schools and our students. However, we need to understand what numbers can tell us, and more important, what the numbers can not tell us. Even though many academic measures exist, the errors we make when reading academic measures typically result from just a few incorrect notions. The many misleading conclusions we hear typically result from the same few errors.

• Unreasonable expectations - Most often, we simply expect numbers to mean more than they actually can. We believe that numbers contain information that they actually don't contain. This error is not intrinsic to the numbers themselves, it results from our desire to have a simple answer.
• Oversimplification - Any time we take a large amount of data and reduce it to a single number, we lose information. When we oversimplify information to a single number, we typically lose the very information we want to discuss.
• Accuracy & Precision Problems - Sometimes we can not directly measure what we want to. This creates accuracy problems. Sometimes we can not get reliable measures consistently. This creates precision problems. Many testing factors contribute to both of these problems. These include too many variables in a single measure, arbitrary scaling methods, arbitrary norming methods, and failure to make meaningful distinctions.

Below, we look at how these errors cause most of the commonly reported numbers use to describe schools and students to be misleading to most people. We will also suggest alternative measures that would be less misleading for the desired results, or provide more information.

DRAFT: Last Updated: January 2011

Unreasonable Expectations: We want grades to accurately inform us how the students have performed and precisely sort the students by rank. But performance is made of many vague variables, thus grading always depends on arbitrary decisions. No single number can reasonably account for all those variables, and each classroom makes the arbitrary decisions differently.
Oversimplification: There is no mathematically correct means to reduce performance to a single meaningful number. Cognition, success, and achievement are too complex to reduce to a single number. Each test and each grading system arbitrarily defines 100%. The definition of 100% is totally arbitrary, and is not founded in what can be achieved. It is intrinsically misleading to compare success to an arbitrary 100%. Most grading systems fail to distinguish between high-level achievements and accelerated learning - distinctly different concepts.
Accuracy Problems: Grading systems involve many arbitrary choices. Many of those arbitrary choices result from the demands of school administrators, politicians, and parents. As such these choices reflect social pressures, not measures of academic success.
Alternate Measures: Separate out distinct aspects of performance: behavior, low level skills, and high level skills

IQ & GPA
Unreasonable Expectations: We want a measure that tells us how well people will succeed in life, and indicate how their mental strength will bring them success. But academic measures don't tell us a person's potential for success in business or relationships; they only tell us a person's potential for success in school and on tests. Many business leaders and political leaders were low academic performers. Many high achievers in school never rose to high levels in business. Success frequently has more to do with social skills, and emotional fortitude than with either IQ or GPA.
Oversimplification: Human cognition has many dimensions. Potential for success depends on many social and cultural factors not included in cognitive measures. Because both cognition and success are made up of multiple independent dimensions, or factors, attempts to reduce them to a single dimension create meaningless, arbitrarily biased results.
Accuracy Problems: No consistent means to define intellectual performance has ever been created. Neither intellectual nor academic performance can reasonably be reduced to a single number. GPA and IQ both depend on grouping information together using mathematically invalid methods. Cognition is not even clearly defined so as to create a reliable measure for IQ. Success in classrooms is not consistently defined to create consistent scaling. 100% can not be consistently defined between classrooms. So the adding and averaging processes create meaningless results.
Alternate Measures: intelligence profile, temperament profile, goodness of fit, attitudes, & emotions

Class Average Test Scores and Pass Rates
Unreasonable Expectations: We want a single number to report how well a class or even an entire school is doing. But cognition and academic success are both multifaceted. Achievement is not the same as test scores. Understanding is not the same as knowledge. Attitude and behavior usually matter more in success than academic background. Even worse, factors outside the schools have greater impacts on student success than teacher skill or curriculum. Studies repeatedly show that test scores correlate strongly to real estate values, and test scores drop when the local economy slumps. Other studies have shown that student performance correlates strongly to nutrition, sleep, and noise levels. Before scores can be considered meaningful measures of school performance, they must be normalized to the known factors.
Oversimplification: Reports usually tell us averages score or pass rate. But the distribution of the scores tells a lot more than average scores or pass rates. A school may have a group of super high achievers, even while it has low averages or pass rates. Another school might have a high pass rate while it still fails to support any students reaching high level achievement. But when all the scores are averaged together this information is lost and can not be discerned from the score.
Most testing is normed (compared to) either large group averages (norm-referenced), or arbitrary performance levels. Neither of these norming methods really provides information as to how well specific classes are doing. To evaluate learning, tests need to be growth-normed. The test must measure academic growth. Very few tests even attempt to measure growth. None do it well.
There is no simple means of keeping tests similar over the years. If the tests don't change teachers will teach directly to the test, and students will share answers. Test results will no longer be valid. As the culture and economy changes, testing standards must change to match. Since the tests must change, the measure of the standards will change. Evidence suggests that high stakes testing leads to lower standards.
Accuracy Problems: Many uncounted variables make average test score, or pass rate meaningless. When dropout rates rise average test scores and pass rates rise also, even though this represents a decline in real school performance. Both cognition and curriculum are intrinsically nonlinear and multidimensional. Averages taken on nonlinear scales and multidimensional spaces are meaningless. Testing methods, particularly multiple choice tests, result in false positives and false negatives. This lowers the precision of the test.
Alternative Measures: growth-normed,

Summation:
We have looked at just a few examples of how we let numbers mislead us. Typically, the numbers are not wrong; we simply expect the numbers to tell us more than they reasonably can. This usually results from accounting methods that oversimplify information into a single number, while washing out the information we really want to know.
Above, we just gave a few common examples. You should learn to ask the same for all reported numbers. How have I expected too much information from this number? How does this number oversimplify the concepts? What information about range and distribution was washed out in the averaging process? If you regularly ask these questions, you will not be deceived by numbers.

Footnotes:

Testing Bias: All standardized tests are biased towards questions that can be answered quickly, and against large complex problems. But, life's successes are comprised of large complex problems, not quick answers. Tests are also biased towards individual work, even though most life successes are derived from cooperation and communication. Most IQ tests, which include the biases already mentioned are heavily biased towards verbal, spatial and logical realms and against social, physical, and musical problems.

Accelerated learning vs. high level learning: A quick distinction is that high level accomplishments typically take more than one day. NCTM suggests that high school students should regularly solve problems that take a week. Tests made of questions that students can answer in two minutes or less address low level skills.

Good guessers vs. poor test takers (test anxiety): Some students are very skilled at test taking. They can guess the right answers even when they didn't really learn the material. Some are poor test takers. Even though they learn the material better than others, they do not get higher scores. Many factors contribute to poor test taking: anxiety, fatigue, not understanding the structure of the test, knowing the subject, but having trouble with the test language, etc.

Knowledge without understanding vs. understanding with minimal knowledge: We have all met people who are proud of their knowledge, or grades, even though they do not seem to really understand the material. We've also met people who can tell stories that demonstrate deep understanding even though their subject knowledge is extremely limited. Methods to test knowledge are different from the methods to test understanding.

Elementary learning skills: When a student starts elementary school he has to learn how to function cooperatively and respectfully in group settings where the goals and needs of the group differ from his own desires. As he progresses through elementary school he has to learn how to manage his own learning, taking notes, guiding his own studying, learning how to use resources, etc. By the end of sixth grade if he has not learned how to master his own learning, he is limited to dependent learning styles. Regardless of his test scores, he is not a strong learner. Yet, none of these things are scored on standardized tests. Few even can be. The tests do not measure the most important elements of learning.

Growth-normed assessment: If a student starts 8th grade reading at the 2nd grade level, and ends 8th grade reading at the 5th grade level, he will have demonstrated 3 years worth of growth in one year. Yet, he will still fail the test, and both he and his teacher will be judged poorly. If another 8th grade student starts the year reading at a 9.0 level and ends the year reading at a 9.1 level he will have only demonstrated one month's worth of growth in a whole year. Yet, he will pass the test, and he and his teacher will be considered successful. Our current system of level testing judges both students and teachers without accounting for growth.

Author's note: In college I was a member of a student club that require potential initiates to go through a probationary period where they had to prove themselves. During that time, we created a point system to evaluate the performance of the potential new members. Before the point system was implemented we discussed what the initiate was doing, how he was interacting with current members, and what specific issues we might have to deal with. After the point system was implemented, we discussed who did or didn't have enough points. We actually found ourselves having to reject an initiate with a high point score, because he caused too many problems. I argued in club meeting that with the point system we knew less about our initiates, not more. But I was the only one willing to abandon the point system.
Years later, I found myself working at a school that valued student-centered learning. We specialized in learning students' strengths and weaknesses, and helping students understand how their attitudes and group skills affected their learning. But we adopted a testing system. Administrators instructed us to understand students in terms of test scores. Test scores replaced strengths, weaknesses and attitudes in how we dealt with students. Our understanding of students decreased significantly. The top level of student accomplishment decreased also. Students became less cooperative and more dependent as learners. Once again, I argued that the scores replaced understanding. Once again, administrators refused to give up a point system, even though most teachers agreed that scores were not helpful.
I had imagined that as a numbers person I would spend my life encouraging people to use numbers. Instead, I've found myself encouraging people not to let numbers interfere with their understanding.