research on standardized testing

Do Standardized Tests Improve Education in America?

History of Standardized Testing

Standardized tests have been a part of American  education  since the mid-1800s. Their use skyrocketed after 2002’s  No Child Left Behind Act  (NCLB) mandated annual testing in all 50 states. However, failures in the education system have been blamed on rising  poverty  levels, teacher quality, tenure policies, and, increasingly, on the pervasive use of standardized tests.

Standardized tests are defined as “any test that’s administered, scored, and interpreted in a standard, predetermined manner,” according to by W. James Popham, former President of the American Educational Research Association. The tests often have multiple-choice questions that can be quickly graded by automated test scoring machines. Some tests also incorporate open-ended questions that require human grading. Read more history…

Pro & Con Arguments

Pro 1 Standardized tests offer an objective measurement of education. Teachers’ grading practices are naturally uneven and subjective. An A in one class may be a C in another. Teachers also have conscious or unconscious biases for a favorite student or against a rowdy student, for example. Standardized tests offer students a unified measure of their knowledge without these subjective differences. [ 56 ] “At their core, standardized exams are designed to be objective measures. They assess students based on a similar set of questions, are given under nearly identical testing conditions, and are graded by a machine or blind reviewer. They are intended to provide an accurate, unfiltered measure of what a student knows,” says Aaron Churchill, Ohio Research Director for the Thomas B. Fordham Institute. [ 56 ] Frequently states or local jurisdictions employ psychometricians to ensure tests are fair across populations of students. Mark Moulon, CEO at Pythias Consulting and psychometrician, offers an example: “What’s cool about psychometrics is that it will flag stuff that a human would never be able to notice. I remember a science test that had been developed in California and it asked about earthquakes. But the question was later used in a test that was administered in New England. When you try to analyze the New England kids with the California kids, you would get a differential item functioning flag because the California kids were all over the subject of earthquakes, and the kids in Vermont had no idea about earthquakes.” [ 57 ] With problematic questions removed, or adapted for different populations of students, standardized tests offer the best objective measure of what students have learned. Taking that information, schools can determine areas for improvement. As Bryan Nixon, former Head of School, noted, “When we receive standardized test data at Whitby, we use it to evaluate the effectiveness of our education program. We view standardized testing data as not only another set of data points to assess student performance, but also as a means to help us reflect on our curriculum. When we look at Whitby’s assessment data, we can compare our students to their peers at other schools to determine what we’re doing well within our educational continuum and where we need to invest more time and resources.” [ 58 ] Read More
Pro 2 Standardized tests help students in marginalized groups. “If I don’t have testing data to make sure my child’s on the right track, I’m not able to intervene and say there is a problem and my child needs more. And the community can’t say this school is doing well, this teacher needs help to improve, or this system needs new leadership…. It’s really important to have a statewide test because of the income disparity that exists in our society. Black and Brown excellence is real, but… it is unfair to say that just by luck of birth that a child born in [a richer section of town] is somehow entitled to a higher-quality education… Testing is a tool for us to hold the system accountable to make sure our kids have what they need,” explains Keri Rodrigues, Co-founder of the National Parents Union. [ 59 ] Advocates for marginalized groups of students, whether by race, learning disability, or other difference, can use testing data to prove a problem exists and to help solve the problem via more funding, development of programs, or other solutions. Civil rights education lawsuits wherein a group is suing a local or state government for better education almost always use testing data. [ 61 ] Sheryl Lazarus, Director of the National Center on Educational Outcomes at the University of Minnesota, states, “a real plus of these assessments is that… they have led to improvements in access to instruction for students with disabilities and English learners… Inclusion of students with disabilities and English learners in summative tests used for accountability allows us to measure how well the system is doing for these students, and then it is possible to fill in gaps in instructional opportunity.” [ 60 ] A letter signed by 12 civil rights organizations including the NAACP and the American Association of University Women, explains, “Data obtained through some standardized tests are particularly important to the civil rights community because they are the only available, consistent, and objective source of data about disparities in educational outcomes, even while vigilance is always required to ensure tests are not misused. These data are used to advocate for greater resource equity in schools and more fair treatment for students of color, low-income students, students with disabilities, and English learners… [W]e cannot fix what we cannot measure. And abolishing the tests or sabotaging the validity of their results only makes it harder to identify and fix the deep-seated problems in our schools.” [ 62 ] Read More
Pro 3 Standardized tests scores are good indicators of college and job success. Standardized tests can promote and offer evidence of academic rigor, which is invaluable in college as well as in students’ careers. Matthew Pietrafetta, Founder of Academic Approach, argues that the “tests create gravitational pull toward higher achievement.” [ 65 ] Elaine Riordan, senior communications professional at Actively Learn, states, “creating learning environments that lead to higher test scores is also likely to improve students’ long-term success in college and beyond… Recent research suggests that the competencies that the SAT, ACT, and other standardized tests are now evaluating are essential not just for students who will attend four-year colleges but also for those who participate in CTE [career and technical education] programs or choose to seek employment requiring associate degrees and certificates…. all of these students require the same level of academic mastery to be successful after high school graduation.” [ 66 ] Standardized test scores have long been correlated with better college and life outcomes. As Dan Goldhaber, Director of the Center for Analysis of Longitudinal Data in Education Research, and Umut Özek, senior researcher at the American Institutes for Research, explain, “students who score one standard deviation higher on math tests at the end of high school have been shown to earn 12% more annually, or $3,600 for each year of work life in 2001.… Similarly… test scores are significantly correlated not only with educational attainment and labor market outcomes (employment, work experience, choice of occupation), but also with risky behavior (teenage pregnancy, smoking, participation in illegal activities).” [ 67 ] Read More
Pro 4 Standardized tests are useful metrics for teacher evaluations. While grades and other measures are useful for teacher evaluations, standardized tests provide a consistent measure across classrooms and schools. Individual school administrators, school districts, and the state can compare teachers using test scores to show how each teacher has helped students master core concepts. [ 63 ] Timothy Hilton, a high school social studies teacher in South Central Los Angeles, states, “No self-respecting teacher would use a single student grade on a single assignment as a final grade for the entirety of a course, so why would we rely on one source of information in the determination of a teacher’s overall quality? The more data that can be provided, the more accurate the teacher evaluation decisions will end up being. Teacher evaluations should incorporate as many pieces of data as possible. Administration observation, student surveys, student test scores, professional portfolios, and on and on. The more data that is used, the more accurate the picture it will paint.” [ 64 ] Read More
Con 1 Standardized tests only determine which students are good at taking tests. Standardized test scores are easily influenced by outside factors: stress, hunger, tiredness, and prior teacher or parent comments about the difficulty of the test, among other factors. In short, the tests only show which students are best at preparing for and taking the tests, not what knowledge students might exhibit if their stomachs weren’t empty or they’d had a good night’s sleep. [ 68 ] [ 69 ] Further, students are tested on grade-appropriate material, but they are not re-tested to determine if they have learned information they tested poorly on the year before. Instead, as Steve Martinez, Superintendent of Twin Rivers Unified in California, and Rick Miller, Executive Director of CORE Districts, note: each “state currently reports yearly change, by comparing the scores of this year’s students against the scores of last year’s students who were in the same grade. Even though educators, parents and policymakers might think change signals impact, it says much more about the change in who the students are because it is not measuring the growth of the same student from one year to the next.” And, because each state develops its own tests, standardized tests are not necessarily comparable across state lines, leaving nationwide statistics shaky at best. [ 69 ] [ 71 ] [ 72 ] Brandon Busteed, Executive Director, Education & Workforce Development at the time of the quote, stated, “Despite an increased focus on standardized testing, U.S. results in international comparisons show we have made no significant improvement over the past 20 years…. The U.S. most recently ranked 23rd, 39th and 25th in reading, math and science, respectively. The last time Americans celebrated being 23rd, 39th and 25th in anything was … well, never. Our focus on standardized testing hasn’t helped us improve our results!” [ 73 ] Busteed asks, “What if our overreliance on standardized testing has actually inhibited our ability to help students succeed and achieve in a multitude of other dimensions? For example, how effective are schools at identifying and educating students with high entrepreneurial talent? Or at training students to apply creative thinking to solve messy and complex issues with no easy answers?” [ 73 ] Read More
Con 2 Standardized tests are racist, classist, and sexist. The origin of American standardized tests are those created by psychologist Carl Brigham, PhD, for the Army during World War I, which was later adapted to become the SAT. The Army tests were created specifically to segregate soldiers by race, because at the time science inaccurately linked intelligence and race. [ 74 ] Racial bias has not been stripped from standardized tests. “Too often, test designers rely on questions which assume background knowledge more often held by White, middle-class students. It’s not just that the designers have unconscious racial bias; the standardized testing industry depends on these kinds of biased questions in order to create a wide range of scores,” explains Young Whan Choi, Manager of Performance Assessments Oakland Unified School District in Oakland, California. He offers an example from his own 10th grade class, “a student called me over with a question. With a puzzled look, she pointed to the prompt asking students to write about the qualities of someone who would deserve a ‘key to the city.’ Many of my students, nearly all of whom qualified for free and reduced lunch, were not familiar with the idea of a ‘key to the city.’” [ 76 ] Wealthy kids, who would be more familiar with a “key to the city,” tend to have higher standardized test scores due to differences in brain development caused by factors such as “access to enriching educational resources, and… exposure to spoken language and vocabulary early in life.” Plus, as Eloy Ortiz Oakley, Chancellor of California Community Colleges, points out, “Many well-resourced students have far greater access to test preparation, tutoring and taking the test multiple times, opportunities not afforded the less affluent…. [T]hese admissions tests are a better measure of students’ family background and economic status than of their ability to succeed” [ 77 ] [ 78 ] Journalist and teacher Carly Berwick explains, “All students do not do equally well on multiple choice tests, however. Girls tend to do less well than boys and [girls] perform better on questions with open-ended answers, according to a [Stanford University] study, …which found that test format alone accounts for 25 percent of the gender difference in performance in both reading and math. Researchers hypothesize that one explanation for the gender difference on high-stakes tests is risk aversion, meaning girls tend to guess less.” [ 68 ] Read More
Con 3 Standardized tests scores are not predictors of future success. At best, Standardized tests can only evaluate rote knowledge of math, science, and English. The tests do not evaluate creativity, problem solving, critical thinking, artistic ability, or other knowledge areas that cannot be judged by scoring a sheet of bubbles filled in with a pencil. Grade point averages (GPA) are a five times stronger indicator of college success than standardized tests, according to a study of 55,084 Chicago public school students. One of the authors, Elaine M. Allensworth, Lewis-Sebring Director of the University of Chicago Consortium, states, “GPAs measure a very wide variety of skills and behaviors that are needed for success in college, where students will encounter widely varying content and expectations. In contrast, standardized tests measure only a small set of the skills that students need to succeed in college, and students can prepare for these tests in narrow ways that may not translate into better preparation to succeed in college.” [ 83 ] “Earning good grades requires consistent behaviors over time—showing up to class and participating, turning in assignments, taking quizzes, etc.—whereas students could in theory do well on a test even if they do not have the motivation and perseverance needed to achieve good grades. It seems likely that the kinds of habits high school grades capture are more relevant for success in college than a score from a single test,” explains Matthew M. Chingos, Vice President of Education Data and Policy at the Urban Institute. [ 84 ] Read More
Con 4 Standardized tests are unfair metrics for teacher evaluations. As W. James Popham, former President of the American Educational Research Association, notes, “standardized achievement tests should not be used to determine the effectiveness of a state, a district, a school, or a teacher. There’s almost certain to be a significant mismatch between what’s taught and what’s tested.” [ 81 ] “An assistant superintendent… pointed out that in one of my four kindergarten classes, the student scores were noticeably lower, while in another, the students were outperforming the other three classes. He recommended that I have the teacher whose class had scored much lower work directly with the teacher who seemed to know how to get higher scores from her students. Seems reasonable, right? But here was the problem: The “underperforming” kindergarten teacher and the “high-performing” teacher were one and the same person,” explains Margaret Pastor, Principal of Stedwick Elementary School in Maryland. [ 82 ] As a result, 27 states and D.C. have stopped using standardized tests in teacher evaluations. [ 79 ] [ 80 ] [ 88 ] Read More
Did You Know?
1. The earliest known standardized tests were administered to government job applicants in 7th Century Imperial China. [ ]
2. The Kansas Silent Reading Test (1914-1915) is the earliest known published multiple-choice test, developed by Frederick J. Kelly, a Kansas school director. [ ]
3. In 1934, International Business Machines Corporation (IBM) hired a teacher and inventor named Reynold B. Johnson (best known for creating the world’s first commercial computer disk drive) to create a production model of his prototype test scoring machine. [ ] [ ]
4. The current use of No. 2 pencils on standardized tests is a holdover from the 1930s through the 1960s, when scanning machines scored answer sheets by detecting the electrical conductivity of graphite pencil marks. [ ] [ ]
5. In 2020, states were allowed to cancel standardized testing due to the COVID-19 (coronavirus) pandemic. [ ]

research on standardized testing

People who view this page may also like:
1.
2.
3.

Our Latest Updates (archived after 30 days)

ProCon/Encyclopaedia Britannica, Inc. 325 N. LaSalle Street, Suite 200 Chicago, Illinois 60654 USA

Natalie Leppard Managing Editor [email protected]

© 2023 Encyclopaedia Britannica, Inc. All rights reserved

  • Standardized Tests – Pros & Cons
  • Pro & Con Quotes
  • History of Standardized Tests
  • Did You Know?

Cite This Page

  • Artificial Intelligence
  • Private Prisons
  • Space Colonization
  • Social Media
  • Death Penalty
  • School Uniforms
  • Video Games
  • Animal Testing
  • Gun Control
  • Banned Books
  • Teachers’ Corner

ProCon.org is the institutional or organization author for all ProCon.org pages. Proper citation depends on your preferred or required style manual. Below are the proper citations for this page according to four style manuals (in alphabetical order): the Modern Language Association Style Manual (MLA), the Chicago Manual of Style (Chicago), the Publication Manual of the American Psychological Association (APA), and Kate Turabian's A Manual for Writers of Term Papers, Theses, and Dissertations (Turabian). Here are the proper bibliographic citations for this page according to four style manuals (in alphabetical order):

[Editor's Note: The APA citation style requires double spacing within entries.]

[Editor’s Note: The MLA citation style requires double spacing within entries.]

Standardized tests aren’t the problem, it’s how we use them

Subscribe to the brown center on education policy newsletter, andre m. perry andre m. perry senior fellow - brookings metro @andreperryedu.

March 30, 2021

This piece originally appeared in  The Hechinger Report ; the version below has been lightly edited for style.

Education Secretary Miguel Cardona is refusing to back down on a federal requirement that states must administer standardized tests this year, although a letter to state leaders from the Department of Education last month said that states will have flexibility on how to apply results. States concerned about the safety of administering a test during a pandemic may implement shortened versions of assessments.

This relief from the hammer of accountability, if not from the tests themselves, has gotten a mixed reception from anti-testing advocates, school leaders, and teachers who are still trying to ready schools for face-to-face learning. They’re right: Greater accountability and standardized testing won’t give students the technology they need, give teachers the necessary PPE to stay safe, or give families the income to better house and feed themselves during the pandemic so that kids can focus on learning. And if there was ever a time to see how misguided our accountability systems are in relation to addressing root causes of achievement disparities, it’s now.

On its face, relieving students, teachers, and families from the grip of test-based accountability makes sense. We know student achievement, particularly in low-income schools and districts, will dip due to circumstances related to the pandemic and social distancing. We know the source of the decline.

And we currently use standardized tests well beyond what they were designed to do, which is to measure a few areas of academic achievement. Achievement tests were not designed for the purposes of promoting or grading students, evaluating teachers, or evaluating schools. In fact, connecting these social functions to achievement test data corrupts what the tests are measuring. In statistics, this is called Campbell’s Law. When a score has been connected to a teacher’s pay or job status, educators will inevitably be drawn toward teaching to the test, and schools toward hiring to the test and paying to the test, rather than making sure students get the well-rounded education they need and deserve.

However, there is still a role for testing and assessment. We need to know the full extent of the damage from the last 12 months beyond the impact on academics. For one, the federal government should have states take a roll call to see who hasn’t been in school. The government must also assess families’ technological needs if it is to properly support the states financially. In other words, states should be using multiple assessments to address the range of needs of students and their teachers. This is what the focus of academic and non-academic assessment should have always been, not a means to punish the people who are dealing with conditions that erode the quality of an education.

As many have said in different contexts, the pandemic exposed existing structural inequalities that are driving racial disparities. This is as true in education as it is in other sectors. Limited broadband and computer access, home and food insecurity, deferred maintenance on buildings, uneven employment benefits among non-teaching school staff, and fewer resources for schools that serve children of color were throttling academic achievement before the pandemic. They will certainly widen achievement gaps during and after.

As a condition for receiving a waiver, Cardona is requiring states to report on the number of chronically absent students and students’ access to computers and high-speed internet, a request that raised the ire of some Republican lawmakers. Sen. Richard Burr (R-N.C.) and Rep. Virginia Foxx (R-N.C.) objected in a March 25 letter that the requirements for information on chronic absenteeism and access technologies as conditions are “not permitted under ESEA as amended by ESSA.” The letter continued: “They are both outside the scope of what states are seeking to be waived and violate specific prohibitions on the Secretary requiring states to report new data beyond existing reporting requirements.”

Cardona is right in his effort to use tests properly. Gathering information is essential if we really care about closing gaps in educational opportunity and achievement. Information shines light on structural problems. When the effects of structural problems on student learning are ignored, teachers and school boards are blamed for any deficiencies in student performance. Racism ends up pointing a finger at Black education leaders, teachers, and kids for disparities that result from systemic racism.

This is why we should rethink how we use tests in the future.

States have historically found ways to starve majority-Black and -Brown districts of the resources they need to thrive. Let’s be clear: We need to hold racist policies and practices accountable.

Segregation and school financing systems that reinforce segregated housing arrangements reflect the application of racist attitudes about Black people and communities that show up in outcomes. And since No Child Left Behind ushered in an era of accountability in 2001, those accountability systems have largely failed to address those sources of inequality. Black districts in particular have felt as much pain from testing as from the negative conditions that surround schooling. School and district takeovers, mass firings, and the imposition of charter schools have not been applied fairly or evenly because testing didn’t identify the real problems.

Amid a pandemic, testing is a necessary inconvenience to help us understand how we can better address structural racism and other root causes of academic disparities. But if tests aren’t used as a way to support Black districts, students, and families by leading to solutions for structural inequities, then they will only facilitate the epidemic of racism that existed before the pandemic.

Related Content

Tom Loveless

March 18, 2021

Anna Saavedra, Morgan Polikoff, Dan Silver, Amie Rapaport

March 23, 2021

Education Policy K-12 Education

Brookings Metro Governance Studies

Brown Center on Education Policy

Sofoklis Goulas

June 27, 2024

June 20, 2024

Modupe (Mo) Olateju, Grace Cannon, Kelsey Rappe

June 14, 2024

Effects of Standardized Testing on Students & Teachers: Key Benefits & Challenges

A group of high school students sit at desks taking a test.

The use of standardized testing to measure academic achievement in US schools has fueled debate for nearly two decades. Understanding the effects of standardized testing—its key benefits and challenges—requires a closer examination of what standardized testing is and how it’s used in academic settings.

Developing ways to effectively and fairly measure academic achievement is an ongoing challenge for school administrators. For those inspired to promote greater equity in education, American University’s online Doctor of Education (EdD) in Education Policy and Leadership provides the knowledge and training to address such challenges.

What Are Standardized Tests?

Standardized tests are examinations administered and scored in a predetermined, standard manner. They typically rely heavily on question formats, such as multiple choice and true or false, that can be automatically scored. Not limited to academic settings, standardized tests are widely used to measure academic aptitude and achievement.

The ACT and SAT, standardized tests used broadly for college admissions, assess students’ current educational development and their aptitude for completing college-level work. Standardized academic achievement tests are mandatory in primary and secondary schools in the US, where they’re designed and administered at the state or local level and used to assess requirements for federal education funding.

Standardized testing requirements are designed to hold teachers, students, and schools accountable for academic achievement and to incentivize improvement. They provide a benchmark for assessing problems and measuring progress, highlighting areas for improvement.

Despite these key benefits, standardized academic achievement tests in US public schools have been controversial since their inception. Major points of contention have centered on who should design and administer tests (federal, state, or district level), how often they should be given, and whether they place some school districts at an advantage or disadvantage. More critically, parents and educators have questioned whether standardized tests are fair to teachers and students.

Effects of Standardized Testing on Students

Some of the challenging potential effects of standardized testing on students are as follows:

  • Standardized test scores are often tied to important outcomes, such as graduation and school funding. Such high-stakes testing can place undue stress on students and affect their performance.
  • Standardized tests fail to account for students who learn and demonstrate academic proficiency in different ways. For example, a student who struggles to answer a multiple-choice question about grammar or punctuation may be an excellent writer.
  • By placing emphasis on reading, writing, and mathematics, standardized tests have devalued instruction in areas such as the arts, history, and electives.
  • Standardized tests are thought to be fair because every student takes the same test and evaluations are largely objective, but a one-size-fits-all approach to testing is arguably biased because it fails to account for variables such as language deficiencies, learning disabilities, difficult home lives, or varying knowledge of US cultural conventions.

Effects of Standardized Testing on Teachers

Teachers as well as students can be challenged by the effects of standardized testing. Common issues include the following:

  • The need to meet specific testing standards pressures teachers to “teach to the test” rather than providing a broad curriculum.
  • Teachers have expressed frustration about the time it takes to prepare for and administer tests.
  • Teachers may feel excessive pressure from their schools and administrators to improve their standardized test scores.
  • Standardized tests measure achievement against goals rather than measuring progress.
  • Achievement test scores are commonly assumed to have a strong correlation with teaching effectiveness, a tendency that can place unfair blame on good teachers if scores are low and obscure teaching deficiencies if scores are high.

Alternative Achievement Assessments

Critics of standardized testing often point to various forms of performance-based assessments as preferable alternatives. Known by various names (proficiency-based, competency-based), they require students to produce work that demonstrates high-level thinking and real-world applications. Examples include an experiment illustrating understanding of a scientific concept, group work that addresses complex problems and requires discussion and presentation, or essays that include analysis of a topic.

Portfolio-based assessments emphasize the process of learning over letter grades and normative performance. Portfolios can be made up of physical documents or digital collections. They can include written assignments, completed tests, honors and awards, art and graphic work, lab reports, or other documents that demonstrate either progress or achievement. Portfolios can provide students with an opportunity to choose work they wish to reflect on and present.

Performance-based assessments aren’t a practical alternative to standardized tests, but they offer a different way of evaluating knowledge that can provide a more complete picture of student achievement. Determining which systems of evaluation work best in specific circumstances and is an ongoing challenge for education administrators.

Work for Better Student Outcomes with a Doctorate in Education

Addressing the most critical challenges facing educators, including fair and accurate assessment of academic achievement, requires administrators with exceptional leadership and policy expertise. Discover how the online EdD in Education Policy and Leadership at American University prepares educators to create equitable learning environments and effect positive change.

EdD vs. PhD in Education: Requirements, Career Outlook, and Salary

Education Policy Issues in 2020 and Beyond

Path to Becoming a School District Administrator

American University School of Education, Creative Alternatives to Standardized Test Taking

Scholars Strategy Network, How to Improve American Schooling with Less High-Stakes Testing and More Investment in Teacher Development

The Washington Post Magazine , “The Demise of the Great Education Saviors”

U.S. Department of Education, Every Student Succeeds Act (ESSA)

Request Information

AU Program Helper

This AI chatbot provides automated responses, which may not always be accurate. By continuing with this conversation, you agree that the contents of this chat session may be transcribed and retained. You also consent that this chat session and your interactions, including cookie usage, are subject to our  privacy policy .

  • Our Mission

An illustration of large scale pencils approaching a standardized test

What Does the Research Say About Testing?

There’s too much testing in schools, most teachers agree, but well-designed classroom tests and quizzes can improve student recall and retention.

For many teachers, the image of students sitting in silence filling out bubbles, computing mathematical equations, or writing timed essays causes an intensely negative reaction.

Since the passage of the No Child Left Behind Act (NCLB) in 2002 and its 2015 update, the Every Student Succeeds Act (ESSA), every third through eighth grader in U.S. public schools now takes tests calibrated to state standards, with the aggregate results made public. In a study of the nation’s largest urban school districts , students took an average of 112 standardized tests between pre-K and grade 12.

This annual testing ritual can take time from genuine learning, say many educators , and puts pressure on the least advantaged districts to focus on test prep—not to mention adding airless, stultifying hours of proctoring to teachers’ lives. “Tests don’t explicitly teach anything. Teachers do,” writes Jose Vilson , a middle school math teacher in New York City. Instead of standardized tests, students “should have tests created by teachers with the goal of learning more about the students’ abilities and interests,” echoes Meena Negandhi, math coordinator at the French American Academy in Jersey City, New Jersey.

The pushback on high-stakes testing has also accelerated a national conversation about how students truly learn and retain information. Over the past decade and a half, educators have been moving away from traditional testing —particularly multiple choice tests—and turning to hands-on projects and competency-based assessments that focus on goals such as critical thinking and mastery rather than rote memorization.

But educators shouldn’t give up on traditional classroom tests so quickly. Research has found that tests can be valuable tools to help students learn , if designed and administered with format, timing, and content in mind—and a clear purpose to improve student learning.

Not All Tests Are Bad

One of the most useful kinds of tests are the least time-consuming: quick, easy practice quizzes on recently taught content. Tests can be especially beneficial if they are given frequently and provide near-immediate feedback to help students improve. This retrieval practice can be as simple as asking students to write down two to four facts from the prior day or giving them a brief quiz on a previous class lesson.

Retrieval practice works because it helps students retain information in a better way than simply studying material, according to research . While reviewing concepts can help students become more familiar with a topic, information is quickly forgotten without more active learning strategies like frequent practice quizzes.

But to reduce anxiety and stereotype threat—the fear of conforming to a negative stereotype about a group that one belongs to—retrieval-type practice tests also need to be low-stakes (with minor to no grades) and administered up to three times before a final summative effort to be most effective.

Timing also matters. Students are able to do fine on high-stakes assessment tests if they take them shortly after they study. But a week or more after studying, students retain much less information and will do much worse on major assessments—especially if they’ve had no practice tests in between.

A 2006 study found that students who had brief retrieval tests before a high-stakes test remembered 60 percent of material, while those who only studied remembered 40 percent. Additionally, in a 2009 study , eighth graders who took a practice test halfway through the year remembered 10 percent more facts on a U.S. history final at the end of the year than peers who studied but took no practice test.

Short, low-stakes tests also help teachers gauge how well students understand the material and what they need to reteach. This is effective when tests are formative —that is, designed for immediate feedback so that students and teachers can see students’ areas of strength and weakness and address areas for growth. Summative tests, such as a final exam that measures how much was learned but offers no opportunities for a student to improve, have been found to be less effective.

Testing Format Matters

Teachers should tread carefully with test design, however, as not all tests help students retain information. Though multiple choice tests are relatively easy to create, they can contain misleading answer choices—that are either ambiguous or vague—or offer the infamous all-, some-, or none-of-the-above choices, which tend to encourage guessing.

A student takes a standardized test.

While educators often rely on open-ended questions, such short-answer questions, because they seem to offer a genuine window into student thinking, research shows that there is no difference between multiple choice and constructed response questions in terms of demonstrating what students have learned.

In the end, well-constructed multiple choice tests , with clear questions and plausible answers (and no all- or none-of-the-above choices), can be a useful way to assess students’ understanding of material, particularly if the answers are quickly reviewed by the teacher.

All students do not do equally well on multiple choice tests, however. Girls tend to do less well than boys and perform better on questions with open-ended answers , according to a 2018 study by Stanford University’s Sean Reardon, which found that test format alone accounts for 25 percent of the gender difference in performance in both reading and math. Researchers hypothesize that one explanation for the gender difference on high-stakes tests is risk aversion, meaning girls tend to guess less .

Giving more time for fewer, more complex or richer testing questions can also increase performance, in part because it reduces anxiety. Research shows that simply introducing a time limit on a test can cause students to experience stress, so instead of emphasizing speed, teachers should encourage students to think deeply about the problems they’re solving.

Setting the Right Testing Conditions

Test achievement often reflects outside conditions, and how students do on tests can be shifted substantially by comments they hear and what they receive as feedback from teachers.

When teachers tell disadvantaged high school students that an upcoming assessment may be a challenge and that challenge helps the brain grow, students persist more, leading to higher grades, according to 2015 research from Stanford professor David Paunesku. Conversely, simply saying that some students are good at a task without including a growth-mindset message or the explanation that it’s because they are smart harms children’s performance —even when the task is as simple as drawing shapes.

Also harmful to student motivation are data walls displaying student scores or assessments. While data walls might be useful for educators, a 2014 study found that displaying them in classrooms led students to compare status rather than improve work.

The most positive impact on testing comes from peer or instructor comments that give the student the ability to revise or correct. For example, questions like , “Can you tell me more about what you mean?” or “Can you find evidence for that?” can encourage students to improve  engagement with their work. Perhaps not surprisingly, students do well when given multiple chances to learn and improve—and when they’re encouraged to believe that they can.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Chiropr Educ
  • v.33(2); 2019 Oct

A primer on standardized testing: History, measurement, classical test theory, item response theory, and equating

This article presents health science educators and researchers with an overview of standardized testing in educational measurement. The history, theoretical frameworks of classical test theory, item response theory (IRT), and the most common IRT models used in modern testing are presented.

A narrative overview of the history, theoretical concepts, test theory, and IRT is provided to familiarize the reader with these concepts of modern testing. Examples of data analyses using different models are shown using 2 simulated data sets. One set consisted of a sample of 2000 item responses to 40 multiple-choice, dichotomously scored items. This set was used to fit 1-parameter logistic (PL) model, 2PL, and 3PL IRT models. Another data set was a sample of 1500 item responses to 10 polytomously scored items. The second data set was used to fit a graded response model.

Model-based item parameter estimates for 1PL, 2PL, 3PL, and graded response are presented, evaluated, and explained.

Conclusion:

This study provides health science educators and education researchers with an introduction to educational measurement. The history of standardized testing, the frameworks of classical test theory and IRT, and the logic of scaling and equating are presented. This introductory article will aid readers in understanding these concepts.

INTRODUCTION

In the 20th century, the concept of public protection dictated implementation of licensing laws to those professions having a direct relationship to public health and safety. 1 A plethora of discipline-specific prelicensure standardized assessment instruments (tests) exists to ensure compliance with the disciplinary standards. In the chiropractic profession, every year thousands of students take the prelicensure Part I, II, III, and IV examinations of the National Board of Chiropractic Examiners. As with any examination, some students feel that these standardized tests are unfair and have little relevance to clinical practice. Even faculty members often understand little about the boards. This article aims to provide an introduction to the world of standardized assessment not only for chiropractic educators but also for any health sciences educator or educational researcher.

OVERVIEW AND SIMULATED ANALYSES

History of standardized testing.

The early history of standardized testing goes back several centuries. In the 3rd century BCE in imperial China, to qualify for civil service, Chinese aristocrats were examined for their proficiency in music, archery, horsemanship, calligraphy, arithmetic, and ceremonial knowledge. Later, the examinations tested knowledge of civil law, military affairs, agriculture, geography, composition, and poetry. 2 , 3 Those who passed these exams were qualified to serve the Chinese emperor and his family. The exams were accompanied by an atmosphere of solemnity and attention to the young nobles who dared to be scrutinized for the prestigious positions. The topics of the exams were frequently provided by the emperor, and he often examined the applicants during the final stage of the competition.

In the late 1880s, Francis Galton was inspired by the work of his cousin, Charles Darwin, regarding the origin of species and became interested in the hereditary basis of intelligence and the measurement of human ability. Galton developed the theoretical bases of testing—the application of a series of identical tests to a large number of individuals and the statistical processing of the results. 4 In 1904, Alfred Binet, a Parisian with a doctorate in experimental psychology, was commissioned by the French ministry of education to study schoolchildren who were developmentally behind their peers. His task was to develop a method to identify children who were not benefiting from inclusion in regular classrooms and required special education. 5 For this purpose, Binet and his associate, Theodore Simon, designed and administered a 30-item instrument arranged by difficulty that tested ability for judgment, understanding, and reasoning. 1

The field of testing developed rapidly during World War I (1914–1918), when the problem of professional selection for the needs of the army and military production became a priority. During that time, leading psychologists organized the Army Alpha Examination to test army recruits. 6 Their success further inspired psychologists to advocate for civilian testing. During the 20th century, large-scale assessment in the United States became a necessity for college admissions and school accountability. The reliance on standardized tests for college admission was a response to the increasing number of students applying to colleges, and it became a tool to tighten the gates in the face of limited resources. 7

In the 21st century, standardized tests constitute an inseparable part of American culture. Assessment instruments are administered in a wide range of settings: K–12, college admission, academic progression, professional licensure, clinical credentialing, industrial, forensic, and many more. “Gatekeepers of America's meritocracy—educators, academic institutions, and employers—have used test scores to label people as bright or not bright, as worthy academically or not worthy.” 8 The study of measurement processes and the methods used to produce scores in testing evolved into a specialized discipline— psychometrics , a combination of education, psychology, and statistics. 9

Critique of Standardized Tests

As the use of standardized tests for high-stakes exams increased, so did the critique of their use. 10 Counsell 11 conducted a case study exploring the effect of the high-stakes accountability system on the lives of students and teachers. The findings revealed that the culture of testing introduces a continuum of fear and ethical and moral dilemmas related to the pressure experienced by instructors when schools use test scores as a measure of accountability. Often, instructors decontextualize the material to the students with an intention to artificially inflate the test scores. 12 Such a phenomenon is known to researchers as “teaching to the test” and is often controlled for by psychometric procedures. 13

Kohn 14 claimed that admission tests (such as the SAT and ACT) are “not very effective as predictors of future academic performance, even in the freshman year of college, much less as predictors of professional success.” Zwick and Himelfarb 15 predicted 1st-year undergraduate grade-point average (FYGPA) in 34 colleges from high school GPA (HSGPA) and SAT scores using linear regression models. The average R 2 for these regression models was .226 (this coefficient indicates the amount of variance in the regression outcome explained by the linear combination of the predictors). However, in most of the models, the HSGPA was the predictor that accounted for the majority of variance. Zwick and Himelfarb stated, “The only substantial increase in R 2 values occurred when SAT scores are added to a prediction equation that included self-reported HSGPA.”

Furthermore, the study highlighted the overprediction (the predicted outcomes were higher than actual) of FYGPA for African American and Latino students and the underprediction (the predicted outcomes were lower than actual) for Caucasian and Asian students when high school grades and SAT scores were used. Zwick and Himelfarb concluded that these errors in prediction were partially attributed to high school socioeconomic status—African American and Latino students are more likely than Caucasian students to attend high schools with fewer resources.

Measurement and Classification

Two processes are involved when a test is administered—measurement and classification. Measurement is the process of assigning numerical values to a phenomenon. This is a thorny process because numbers are used to categorize the phenomenon, and numerical scales hold qualities such as differentiation (1 is different from 2), order (2 is higher than 1), equality of intervals (the interval between 1 and 2 is equal to the interval between 2 and 3), and a 0 point, which is not always a true absence of value. By assigning numerical values to categories, the rules associated with numbers are carried over to the properties of the measured phenomenon and may not always correspond to the actual properties of the measured objects.

Stevens 16 developed a hierarchy of measurement scales: nominal, ordinal, interval, and ratio. The nominal scale is a system of measurement where numbers are used for the purpose of differentiation only. For example, the numerical part of a street address or apartment number is numbered on the nominal scale. The number on the jersey of a football player is used to differentiate the player from others, and it too is on the nominal scale. The categorical coding of most demographic variables, such as gender, ethnicity, and political party affiliation, constitutes nominal measures. 17 Since nominal enumeration is used only to distinguish categories, the numbers assigned to the categories do not follow any order or presume interval equality. The nominal scale is the most rudimentary form of measurement.

The ordinal scale is a measurement scheme where, in addition to simple differentiation (the attribute specified by the nominal scale), the numbers represent a rank order of the measured phenomenon. Examples of ordinal measures are rankings in the Olympic Games, progressions of the spiciness of a dish in a restaurant (mild, spicy, and very spicy), military rank, birth order, and class rank. Another example of an ordinal measure is the emoji-face pain scale commonly used in health care. An ordinal scale establishes the order of categories but lacks the ability of comparison between the categories' intervals.

The subsequent scale in Stevens's hierarchy is the interval scale, which, in addition to differentiation and rank order, establishes the property of interval equality. On this scale, the intervals between adjacent points are presumed to be equal. One example of the interval scale is a number line, where, going from left to right, each subsequent number is higher in rank, and the intervals between adjacent numbers are equal across the entire domain of the line. Another example is a temperature scale measured in Celsius or Fahrenheit. In the social sciences, items commonly measured on the Likert scale, ranging from “strongly disagree” to “strongly agree,” for the purposes of statistical analysis of opinions, are assumed to be on the interval scale.

The highest measurement scale in the hierarchy is the ratio scale. In addition to the properties established by the nominal, ordinal, and interval scales, a ratio scale has a true 0 point (complete absence of value). Neither the number line nor the Celsius or Fahrenheit temperature scales have an absolute 0 point. The 0 on the number line is nothing more than a separation between the negative and positive numbers and can be rescaled with a simple linear transformation. The 0 on the temperature scale (in Celsius) is also not an absence of value but rather a point at which water becomes ice. An example of a ratio scale is the Kelvin temperature scale, where 0 indicates a complete absence of temperature.

Every assessment is designed to measure and classify the test takers' performance in a specific domain. Depending on the assessment design, the scores can be on the ordinal, interval, or even ratio scale. Then, depending on the score obtained on the test, a test taker can be classified into the mastery or nonmastery categories (in the case of professional testing) or into basic, proficient, or advanced levels of performance in the case of K–12. 18

When test takers present themselves at the test site for an exam administration, they arrive as members of a single population. The goal of the test designer and test administrator is to separate the test takers into subpopulations according to the intended users' objectives for the scores. Thus, each item on the test is a classification tool that helps make the categorization decision regarding each individual test taker. With each item that is answered correctly, a test taker is more likely to be classified into the higher category, while each incorrect response increases the likelihood of classification into a lower category.

Reliability and Validity

The quality of a measurement instrument is expressed in terms of the reliability and validity of the scores collected by this instrument. Reliability is the consistency with which a measure, scale, or instrument assesses a given construct, while validity refers to the degree of relationship, or the “overlap” between an instrument and the construct it is intended to measure. 13 The traditional meaning of reliability is the degree to which respondents' scores on a given administration of a measure resemble their scores on the same instrument administered later within a reasonable time frame. Kerlinger and Lee 19 suggested 3 approaches to reliability: stability, lack of distortion, and being free of measurement error. The first 2 definitions are addressed in this section; the third definition requires an introduction to classical test theory 20 , 21 and is addressed later.

If a measurement instrument or a comparable form is administered multiple times to the same or a similar group of people, we should expect similar scores. This is called temporal stability —the degree to which data obtained in a given test administration resemble those obtained in following administrations. When an assessment is conducted, a score user expects assurance that scores are replicable if the same individuals are tested repeatedly under the same circumstances. 9 There are 2 techniques to assess temporal stability: the test–retest method and the parallel forms method.

In the test–retest method, a set of items is administered to a group of subjects, then the test is readministered later to the same group. The correlation of the 2 sets of scores is then measured. A higher correlation between the scores indicates higher reliability.

In the parallel forms method, 2 different forms of the same test are constructed, both measuring the same critical trait (knowledge base). Next, both forms are administered to the same group of test takers at the same test session. A higher relationship between the 2 sets of scores indicates higher reliability. However, it is very difficult to correctly construct equivalent test forms, and a weak relationship between the 2 sets of scores may actually reflect a lack of equivalence.

Another component of reliability is a scale's internal consistency . The lack of distortion or internal consistency of an instrument refers to the extent to which the individual components of a test are interrelated and thus produce the same or similar results. Items on the test should “hang together.” One of the earlier techniques to establish the internal consistency of a scale is known as the split-half reliability. 22 The test is randomly split in half, and the 2 sets of test scores are compared to each other. Once again, a closer relationship between the 2 sets of scores indicates a higher test reliability.

Cronbach 6 , 23 developed the coefficient alpha , an alternative to the once common split-half technique, which has become the most universal technique for estimating internal consistency reliability. His coefficient alpha assesses reliability as a ratio of the summed variances of individual items and the total variance for the instrument, subtracted from 1 and adjusted for the number of items in the instrument. Cronbach's alpha coefficient is computed as follows:

equation image

Cronbach's alpha ranges from 0 to 1.0 with values closer to 1.0 indicating higher reliability. The internal consistency of a test is considered acceptable if the alpha coefficient is above .70. 24 , 25 An alternative interpretation of Cronbach's alpha is the mean of all interitem correlations. If a correlation coefficient is squared, it becomes a coefficient of determination , which indicates the proportion of variability shared between 2 variables. 19 Thus, when .70 is squared, it becomes .49. This means that at least half of the variability in the responses collected by the instrument is explained by the instrument's internal consistency.

Reliability alone is not sufficient to establish the quality of a test. A good test must also measure what it was designed to measure, which is often referred to as validity . The validity of a scale refers to the extent of correspondence between variations in the scores on the test and the variation among respondents on the underlying construct being tested. 13 The process of validation is closely related to the intended use of the scores. For example, scores collected on a test of general anatomy given in English ideally depict the knowledge of anatomy possessed by a test taker. Yet, if a test is given to a sample of English-language learners, a part of the variability in scores can be explained by English proficiency (or lack thereof). Therefore, the scores collected by the same test in an English-first population of test takers may have higher validity than scores collected from English-language learners.

Importantly, the validity of a test is a matter of degree, not all or none. Further, the existing evidence of validity may be challenged by new findings or by new circumstances. Unavoidably, validity becomes an evolving property, and test validation is a continuous process. 26 This process of validation requires ongoing empirical research efforts outside of those used for reliability. The methods employed for establishing validity of a test include a thorough analysis of the content of the test during the phase of its scale development and quantitative assessment of the relationship between the test scores and the criterion that has been tested. 2 The degree of accuracy with which test scores relate to their intended use may be established by studying the predictive validity.

Test scores with low validity can still be reliable, while reliability is a prerequisite for validity. Establishing reliability is more of a technical matter, whereas validity requires much deeper thinking and consideration; it is much more than a statistical procedure. Continuous vigilant consideration of each item in terms of content representation and its statistical performance as well as the reflection on the populations of test takers are all essential for confirming a test score's validity.

Classical Test Theory

Any measurement is an inference, and any statistical inference is subject to error. All measurements are susceptible to random error and, if repeated, may vary. To comprehend the size and the origin of the error, ideally, the measurement should be repeated several times, as the average of a series of measurements is more precise than any individual measurement by a factor equal to the square root of the number of measurements. 27 Classical test theory (CTT) postulates that any observation is a linear combination of the true score and error. The fundamental equation of CTT states the following:

equation image

where O i is the observed score for an examinee i , T i is the true score for that examinee, and E i is the error in the measurement. Thus, every test could be seen as a combination of 2 hypothetical components: the true score (true knowledge of the material tested) and the deviations from the true score due to random or systematic factors. Any systematic errors in measurement become part of an individual's true score and affect the validity since the score is no longer an estimate only of the latent trait but also of the systematic variability. The random errors, on the other hand, affect the reliability of the score and create a distortion in the observed score's precision over repeated administrations of the test.

Test scores can be described as random variables. 9 A random variable X is an outcome of a process that is determined by a probability distribution. The term “expectation” or “expected value,” denoted as E ( X ), is used to signify the mean of the probability distribution. Assuming that all systematic variability in the observed score is accounted for by the true score and the error component consists of only random error, we can specify the distribution of the errors as follows:

equation image

which means that if examinee i takes the exam an infinite number of times, by definition of random, the same amount of error will be distributed above and below the true score. Thus, the error will average at 0. The relationship between the observed score and the true score can be clarified by taking the expectation of the observed score:

equation image

Meanwhile, if the expectation of error is 0 (see equation 3) and the expected value of the observed score is the true score,

equation image

Then it follows from equations 2 and 5 that

equation image

There are 3 other fundamental assumptions made by CTT: it is assumed that the correlation between true score and error is 0, that the correlation between error score on test 1 and error score on test 2 is 0, and that the correlation between the true score on test 1 and the error score on test 2 is 0.

The definition of reliability can be formulated in the framework of CTT if the following extension is made to the equation 2:

equation image

where Var ( O i ), the observed score variability, is partitioned into the true score variability, Var ( T i ), and the variability of error, Var ( E i ). Reliability is the proportion of the true score variability to the observed score variability or the proportion of the error variability to the observed score variability subtracted from 1.0:

equation image

with ρ O 1, O 2 being the reliability coefficient.

The variability of the scores, as viewed by CTT, provides the explanation for score stability. Test takers who are not satisfied with their exam scores may choose to repeat the test. While an examinee repeating a test is interested in the increase of the observed score, psychometricians consider any increase in the true score separately from the increase in the error component. If a test is reliable, it is very hard to increase the true score component when the assessment is repeated over a short period of time. Only long-term learning is associated with an increase in the true score component. 28 , 29 At the same time, the scores for a repeat test taker will vary from 1 administration to another, and, usually, improved performance may be seen on a second measurement occasion, even if different questions are used. 12 This is due to the known phenomenon called the practice effect , 30 which is defined as an increase in an examinee's test score from 1 administration of the same assessment to the next in the absence of learning, coaching, or other factors that are known to increase the score. 31

Other sources of measurement error may include temporary or momentary fatigue, fluctuations of memory or mood, or fortuitous conditions at a particular time that temporarily affect the outcomes measured by the test. 19 Test scores may also be influenced by the content of the material that appeared on the test, guessing, state of alertness, and even scoring errors.

Another likely explanation of the differences in scores from 1 measurement occasion to another is the phenomenon known as regression to the mean . 32 Each form of a test will tend to favor certain students but not others in a nonsystematic way. Students may get a test with items representing the material they are most familiar with or have studied the most. However, students who were favored by 1 form of the test are not likely to be favored by another when they retake the test. Therefore, the scores obtained on the second or third testing occasions will tend to be closer to the mean than the scores obtained on the first testing occasion. 33

Even though it is never possible to measure exactly how much an increase in the observed score is influenced by the error component, CTT allows for estimation of the standard error of measurement (SEM), which is a function of the standard deviation of the set of observed scores and the reliability of the test:

equation image

where SD O is the standard deviation of the set of observed scores and ρˆ O 1, O 2 is an estimate of reliability. Estimates of the SEM can be helpful in interpreting increases in individual test scores.

Item Response Theory

Item response theory (IRT) is a collection of statistical and psychometric methods used to model test takers' item responses. 34 The initial development of IRT models took place in the second half of the 20th century. First, Rasch 35 developed a model for analyzing categorical data. Next, Lord and Novick 21 wrote chapters on the theory of latent trait estimation, which gave birth to a new way of data analysis in testing. Prior to the development of IRT, the testing industry relied on CTT methods for modeling test item responses. Since then, IRT has made its way into every aspect of the testing industry. IRT methods are used today in test development, item banking, data analysis, analysis of differential item functioning, adaptive testing, test equating, and test scaling. 36

The early IRT models were first developed for dichotomously scored item responses (eg, 0 = wrong, 1 = right). These models included the 1-parameter logistic model (1PL), the 2-parameter logistic model (2PL), and the 3-parameter logistic model (3PL). Common assumptions for the early IRT models include unidimensionality —only 1 latent trait is necessary to explain the pattern of item-level responses 37 —and local independence —after accounting for the latent trait, there is no dependency among the items. 36 Later, models for polytomous responses were developed: the partial credit model 38 and the generalized partial credit model. 35

In the early 1990s, significant efforts were made to develop multidimensional IRT models 39 , 40 and models that were able to account for item dependency over and above the dependency explained by the common trait. 41 , 42 Due to the introductory nature of this article, I will present the mathematical logic and graphical examples of the 1PL, 2PL, and 3PL models only.

One advantage of IRT over traditional testing theories is that IRT defines a scale for the underlying latent variable that is being measured by the test items. 43 IRT assumes that responses on a unidimensional test are underlined by a single latent trait ( θ ), often called the test taker's “ability.” This latent trait is not able to be observed directly; however, it can be constructed using observed responses to the items on a test. Assuming IRT, the probability of a response to an item on a test is conditional on θ :

equation image

The student's ability and the item difficulty are on the same scale; therefore, θ j = β i corresponds to θ – β = 0, meaning that there is an exact match between an examinee's ability and item difficulty; θ j > β i corresponds to θ – β > 0, which means that the item is easy for the examinee's ability level; and θ j < β i means that when θ – β < 0, the item is difficult for the test taker. Thus, the probability of providing a correct response by an examinee j to an item i is a function of the difference between theta and beta; formulaically,

equation image

where f is a function that relates the ability and the probability (ICC).

In this model, the probability of the response to an item is a function of the difference between the test taker's ability and the item's difficulty. The following is the equation for 1PL:

equation image

where D is a scaling factor, set to D = 1.7, so the values of P ( θ ) for 2-parameter normal ogive and the values for 2PL differ by less than 0.01.

Illustration

The computing language R (an open-source environment for statistical computing and graphics) is often used to fit IRT models to data and estimate item parameters. Presented here is an example by means of the “irtoys” package 44 to fit various IRT models using a set of simulated responses (n = 2000) to a 40-item test. The items were scored dichotomously. Table 1 presents estimates of model parameters and associated standard errors for the 1PL model. The item difficulty is the only parameter that was estimated, while the item discrimination was fixed at 1. Figure 1 a presents the ICC curves for the 40 items. The curves differ by their location in relation to the x-axis, which is a reference scale for the test takers' ability and item difficulty—more difficult items are to the right, while less difficult items are to the left. The 1PL model assumes that all items relate to the latent trait (ability) equally and differ only in the amount of difficulty.

Item-Parameter Estimates, 1-Parameter Logistic Model (N/A = Not Applicable)

11−1.17N/AN/A0.08N/A
21−1.66N/AN/A0.09N/A
31−1.71N/AN/A0.10N/A
41−1.24N/AN/A0.09N/A
51−2.87N/AN/A0.14N/A
61−3.34N/AN/A0.17N/A
71−3.78N/AN/A0.20N/A
81−3.32N/AN/A0.16N/A
91−2.30N/AN/A0.11N/A
101−3.15N/AN/A0.15N/A
111−1.18N/AN/A0.09N/A
1211.60N/AN/A0.09N/A
1310.31N/AN/A0.08N/A
141−0.60N/AN/A0.08N/A
151−1.26N/AN/A0.09N/A
161−3.82N/AN/A0.20N/A
171−1.67N/AN/A0.09N/A
181−3.17N/AN/A0.15N/A
191−3.75N/AN/A0.20N/A
201−1.67N/AN/A0.09N/A
2110.32N/AN/A0.08N/A
2210.48N/AN/A0.08N/A
231−0.82N/AN/A0.08N/A
241−1.49N/AN/A0.09N/A
251−2.68N/AN/A0.13N/A
261−0.58N/AN/A0.08N/A
271−0.92N/AN/A0.08N/A
281−1.56N/AN/A0.09N/A
2910.00N/AN/A0.08N/A
301−2.03N/AN/A0.10N/A
311−1.67N/AN/A0.09N/A
321−1.70N/AN/A0.09N/A
331−1.27N/AN/A0.09N/A
341−1.86N/AN/A0.10N/A
351−2.62N/AN/A0.12N/A
361−2.42N/AN/A0.12N/A
371−1.45N/AN/A0.09N/A
381−1.90N/AN/A0.10N/A
391−0.64N/AN/A0.08N/A
401−1.47N/AN/A0.09N/A

An external file that holds a picture, illustration, etc.
Object name is i1042-5055-33-2-151-f01.jpg

a) Item characteristic curves for the 40 items, 1-parameter logistic model. b) Item information functions for the 40 items, 1-parameter logistic model.

Figure 1 b presents the item information functions (IIF) for the 40 items. The IIF shows the point on the ability scale for which the item provides maximum information. Assuming that these curves are Gaussian, the ranges of ability for which an item provides the most information can be estimated using the 3-sigma empirical rule. 45 The IIF depends on the slope of the item response function as well as the conditional variance at each ability level. The greater the slope and the smaller the variance, the greater the information and the smaller the standard error of measurement (SEM). 32 In 1PL, the slopes are held constant; therefore, there is no variability in the height of the curves.

The 2PL model estimates another parameter—the discrimination of an item, seen as the slope of the ICC. The discrimination is between those test takers who know the right answer and the population of test takers who do not demonstrate that knowledge. The items with better discriminating qualities have steeper slopes. The following equation represents the 2PL model:

equation image

where a i is the discrimination parameter for item i . Table 2 presents the model parameter estimates and related standard errors for the 2PL model. Figure 2 a presents the ICCs for the same 40 items as Figure 1 a; it is now obvious that some items are better at discriminating between the 2 populations (have steeper slopes) than others.

Item-Parameter Estimates, 2-Parameter Logistic Model (N/A = Not Applicable)

10.39−2.77N/A0.090.67N/A
20.70−2.33N/A0.120.35N/A
30.62−2.66N/A0.110.45N/A
40.81−1.55N/A0.110.20N/A
50.84−3.47N/A0.180.62N/A
61.01−3.48N/A0.220.60N/A
70.84−4.57N/A0.251.17N/A
81.35−2.81N/A0.240.36N/A
90.89−2.65N/A0.150.37N/A
101.39−2.64N/A0.240.31N/A
111.00−1.26N/A0.120.14N/A
120.453.30N/A0.110.74N/A
130.880.37N/A0.110.09N/A
140.78−0.77N/A0.100.13N/A
150.50−2.36N/A0.100.46N/A
160.64−5.82N/A0.252.04N/A
170.41−3.75N/A0.110.93N/A
180.87−3.71N/A0.200.70N/A
191.06−3.76N/A0.260.73N/A
200.77−2.16N/A0.120.30N/A
210.790.41N/A0.100.10N/A
220.123.48N/A0.082.32N/A
230.49−1.57N/A0.090.31N/A
240.57−2.49N/A0.110.44N/A
250.70−3.75N/A0.160.75N/A
260.75−0.77N/A0.100.13N/A
270.16−5.10N/A0.092.73N/A
281.08−1.57N/A0.140.16N/A
290.250.02N/A0.080.26N/A
300.33−5.66N/A0.121.94N/A
310.92−1.89N/A0.130.23N/A
320.32−4.90N/A0.111.58N/A
330.66−1.88N/A0.110.28N/A
340.76−2.45N/A0.130.35N/A
351.64−2.01N/A0.230.18N/A
360.53−4.33N/A0.141.04N/A
370.81−1.80N/A0.120.23N/A
380.52−3.45N/A0.120.72N/A
390.39−1.49N/A0.090.36N/A
400.62−2.26N/A0.110.36N/A

An external file that holds a picture, illustration, etc.
Object name is i1042-5055-33-2-151-f02.jpg

a) Item characteristic curves for the 40 items, 2-parameter logistic model. b) Item information functions for the 40 items, 2-parameter logistic model.

The estimation of the slope relaxes the assumption of an invariant relationship between the items and the latent trait. This relationship can now be estimated, and it is similar to the factor loadings in factor analysis. 46 The items with higher discrimination coefficients are more responsive to small changes in the latent trait, whereas the items with low discrimination coefficients require large changes in the latent trait to reflect a change in the probability. Figure 2 b presents the items' information curves, which now show variability in the amount of information they provide.

The 3PL model is a 2PL model with an additional parameter, γ i , which is the lower asymptote of the ICC and represents the probability of a test taker with a low ability providing a correct answer to an item i . The inclusion of this parameter suggests that test takers who score low on the latent trait may still provide a correct response by chance. This parameter is referred to as “guessing.” The following is the mathematical representation of the 3PL model:

equation image

where γ i is the guessing parameter. Referring back to equation 14, if a test taker guessed ( γ i = 1), then the probability of the correct response is entirely explained by guessing (the term after the plus sign disappears). However, if the test taker did not guess ( γ i = 0), the model defaults to the 2PL. Table 3 presents model parameter estimates for the 3PL, while Figure 3 a and b presents ICCs and IIFs, respectively, for the 40 items.

Item-Parameter Estimates, 3-Parameter Logistic Model (N/A = Not Applicable)

10.37−2.680.050.970.460.04
20.71−2.300.000.170.780.69
30.86−0.840.490.710.760.11
40.79−1.500.040.350.860.53
50.82−3.420.090.110.690.10
61.01−3.280.170.130.440.10
70.81−4.600.070.370.110.47
81.32−2.840.040.840.480.11
91.14−1.180.560.370.830.31
102.98−0.880.760.380.200.08
110.97−1.260.010.150.410.17
120.772.820.080.410.151.06
130.890.370.000.430.820.24
140.85−0.530.080.140.940.73
151.840.700.640.130.820.14
160.64−5.750.062.200.340.04
170.920.340.671.000.290.06
180.90−3.590.030.420.580.19
190.98−3.890.090.390.500.15
200.88−1.330.310.170.530.07
210.870.560.050.190.540.09
220.226.280.240.460.300.10
230.49−1.480.020.120.660.35
240.57−2.420.030.300.630.18
250.86−2.020.550.410.590.13
260.82−0.490.090.510.420.11
270.17−3.940.100.090.600.10
281.07−1.510.050.101.160.12
290.962.100.410.100.510.03
300.33−5.400.070.110.100.02
311.25−0.820.410.130.650.11
320.44−1.280.520.490.530.15
330.67−1.850.000.301.040.38
340.72−2.480.040.350.440.15
351.63−2.030.000.090.150.03
360.52−4.340.020.450.680.14
371.54−0.150.531.450.260.04
381.690.370.740.120.150.01
390.37−1.540.010.710.950.34
405.720.600.701.160.420.06

An external file that holds a picture, illustration, etc.
Object name is i1042-5055-33-2-151-f03.jpg

a) Item characteristic curves for the 40 items, 3-parameter logistic model. b) Item information functions for the 40 items, 3-parameter logistic model.

Polytomous IRT Models

Various polytomous IRT models have been developed to account for ordered categorical responses. Samejima 47 developed a logistic model for graded responses in which the probability that an examinee j with a particular level of ability will provide a response to an item i of the category k is the difference between the cumulative probability of a response to that category or higher and the cumulative probability of a response to the next highest category or higher. Consider the following:

equation image

where b ik is the difficulty parameter for category k i and a i is the discrimination parameter for item j . 47

A different model for ordered categorical response was developed by Masters. 33 In this partial credit model, the probability that an examinee j will provide a response x on item i with M i thresholds is a function of student's ability and the difficulties from the M i thresholds in item i is given by the following:

equation image

Samejima's graded response model was fitted to a simulated data set of n = 1500 responses to 10 polytomous items scored using the following categories: 0, 1, 2, and 3. Table 4 presents model-based parameter estimates; Figure 4 a presents ICC curves for items 1–4 of the 10 polytomous items. Figure 4 b and c presents ICC curves for items 5–8 and 9 and 10, respectively.

Item-Parameter Estimates, Graded Response

1−0.97−0.56−0.135.51
2−2.64−1.84−0.552.00
3−2.94−1.220.261.23
4−1.27−0.60.680.77
5−3.77−1.980.250.93
6−1.61−0.570.461.23
7−2.14−0.870.811.39
8−1.96−0.481.221.19
9−0.621.752.940.66
10−1.59−0.471.071.76

An external file that holds a picture, illustration, etc.
Object name is i1042-5055-33-2-151-f04.jpg

a) Item characteristic curves for items 1-4, graded response. b) Item characteristic curves for items 5-8, graded response. c) Item characteristic curves for items 9 and 10, graded response.

Measurements of the same construct collected at different times or by different forms must be brought to the same scale to be comparable. In the field of testing, when tests are used to make high-stakes decisions, the scores for examinees who took the test on 1 occasion using 1 test form should be comparable to the scores of examinees who took the test on another occasion using a different test form. Due to the security of test programs, it is common practice to administer different forms of the test on different testing occasions. However, it is hard to construct 2 truly parallel forms, and often these test forms differ in difficulty. Yet it is important to avoid a situation where 1 group of test takers has an unfair advantage because they were administered an easier form of the exam. 48 Therefore, the test scores must be equated to account for the possible differences in difficulty between the test forms or differences in ability between the groups of test takers.

Equating is a statistical process used to adjust scores on test forms so that scores on the forms can be used interchangeably. 36 After equating, alternate forms of the same test yield scaled scores that can be used interchangeably even though they are based on different sets of items. 49 It is important to point out that statistical adjustment is not possible for differences in content. The responsibility for the content equivalence between 2 forms of a test lies entirely on test developers.

For the past 30 years, equating has received much deserved attention and research. Many new equating methods have been proposed and tested in both research and operational testing programs. I will introduce only general principles related to equating here, as my goal is to make the reader aware of the procedure. Those who wish to expand their knowledge of equating should turn to the literature published in the field of educational measurement.

The first step in the process of equating is to decide on an equating design. Test scores can be equated using either the same populations or the same items. Single-group design assumes that 2 test forms can be equated if they are given to the same population of examinees. Since the same examinees take both tests, the difficulty levels are not confounded by the ability of the examinees. 37 Equivalent-group design assumes that 2 test forms are given to similar but not the same populations of examinees. Reasonable group equivalence may be achieved through random assignment. 13

Common-item design requires that both forms of the test contain a set of the same items, usually called “anchor” items; the forms are then administered to different populations of examinees. Subsequently, a function that relates the statistics computed for each anchor set will account for the differences in difficulty. This mathematical function is then used to equate the nonanchor items on both forms. 36 , 37

An appropriate equating methodology must be chosen, depending on which theoretical framework is preferred by the testing program, to obtain the test-taker statistics and the item-level statistics. Equating methods have been developed based on both CTT and IRT. When pairs of statistical values for 2 forms have been obtained, a decision is made regarding the methods to be used to relate these exams. Several methods can be selected from the framework of linear modes for this; they include regression methods, mean and sigma procedures, or characteristic curves methods.

Equating is the strongest form of linking. The tests can be similar or even equivalent in content and different in difficulty, or they can be different in content and also in difficulty. When tests are different in content, the scores obtained on these exams may still need to be put on the same scale. In this case, the statistical process of adjusting the scores for difficulty is called linking. When linking is used for equating, the relationship is invariant across different populations. 36 The term equating is reserved for the situation when scores from 2 tests of the same content are linked. The statistical procedures used in equating may not differ for linking; however, no linking procedures can adjust for differences in content.

This article presents researchers and clinicians in the health sciences with an introduction to educational measurement—the history, theoretical frameworks of the CTT and IRT, and the most common IRT models used in modern testing.

ACKNOWLEDGMENTS

This article is dedicated to Dr Howard B. Lee, a mentor and friend.

FUNDING AND CONFLICTS OF INTEREST

No funding was received for this work, and the author has no conflicts of interest to declare relevant to this work.

Can Standardized Tests Predict Adult Success? What the Research Says

BRIC ARCHIVE

  • Share article

The use of standardized tests as a measure of student success and progress in school goes back decades, with federal policies and programs that mandated yearly assessments as part of state accountability systems significantly accelerating this trend in the past 20 years. But the tide has turned sharply in recent years.

Parents, advocates, and researchers have increasingly raised concerns about the role of testing in education . The shift in people’s attitudes about the use of tests and about the consequences of relying (or possibly over-relying) on test scores for the purposes of both school and teacher accountability raises the question: What can tests tell us about the contributions of schools and teachers to student success in the future?

We think it is important to ask this foundational question: How much do we know about whether there is a causal link between higher test scores and success later in life? After all, that is the purpose of education—preparing students to be successful in the future. We explored this question and the role of tests in a recently published article in Educational Researcher . We conclude that any debate about the use of test scores in educational accountability should: (1) consider the significant evidence connecting test scores to later life outcomes; (2) take into account the difficulty of establishing causality between test achievement and later life outcomes; and (3) consider what alternative measures of success are out there and how reliable they are.

What can tests tell us about the contributions of schools and teachers to student success in the future?"

It is certainly reasonable to argue that we should hold schools and teachers accountable for the test performance of their students, but we likely care a whole lot more about tests if they reflect increased learning in school that translates into future success.

There is a vast research literature linking test scores and later life outcomes, such as educational attainment, health, and earnings. These observed correlations , however, do not necessarily reflect causal effects of schools or teachers on later life outcomes. Maybe students who do well on tests are the same students who wake up early in the morning, go to work on time, and work hard, and that’s the reason for their success, not necessarily what they learned in school. Also, differences in test scores could reflect differences in learning opportunities outside of school, including the supportiveness of families or the communities in which students live.

What we do know more definitively about the causality of this relationship comes from a limited number of studies that examine the effects of different educational inputs (for example, schools, teachers, classroom peers, special programs) on both student test scores and later life outcomes. For instance, if a study finds test-score impacts and adult-outcome impacts that are in the same direction, this could be regarded as evidence that test scores (and the learning they represent) have an impact on later life outcomes.

Our view is that studies that might be considered causal do tend to find alignment between effects on test scores and later life outcomes. Perhaps the most influential studies in this strand were published in 2014 by Raj Chetty, John Friedman, and Jonah Rockoff , who found that students who were assigned to teachers deemed highly effective learned more as measured by tests and also were more likely to have better adult outcomes, such as attending college and earning higher salaries.

Another study by Chetty and co-authors examines the long-term effects of peer quality in kindergarten (once again, as indicated by test scores) using the Tennessee Student/Teacher Achievement Ratio experiment. The 2011 study finds that students who are assigned to classrooms with higher achieving peers have higher college attendance rates and adult earnings. Similarly, using that same Tennessee STAR experiment, a study by Susan Dynarski and colleagues that same year looks at the effects of smaller classes in primary school and finds that the test-score effects at the time of the experiment are an excellent predictor of long-term improvements in postsecondary outcomes.

It is also important to recognize that we might not always expect test-score effects of educational interventions to align with adult outcomes. It is easy to make the case that interventions can improve later life outcomes without affecting the cognitive skills of children. Choice schools may, for instance, have stronger pipelines into college, leading to better college-going results while not affecting learning and test results, but we don’t know this conclusively.

Irrespective of one’s views on the degree to which tests predict later life outcomes, we need to think carefully about what abandoning the use of test scores altogether might mean for education policy and practice. From a practical perspective, we can’t wait many years to get long-term measures of what schools are contributing to students. This does not mean that test scores ought to be the exclusive or even primary short-term measures, but if one believes in some form of educational accountability, it is important to consider what alternative measures of success are out there and how reliable they are.

Lessening the weight of tests in accountability calculations is consistent with ESSA, but there are concerns about how “gameable” many of the alternative measures might be. And there is no doubt that we know less empirically about the causal connections between many of these alternative measures and long-term student prospects.

For example, are students assigned to teachers who get good classroom observation ratings likely to have better future prospects? Perhaps, but there is less evidence about this type of measure than there is about test-based measures. And if we do not use test scores in teacher evaluations at all, are we going back to the era of teacher accountability when 99 percent of all teachers across the country were rated satisfactory or better?

People clearly have strong feelings about the worth of—and the harm done by—testing. But whatever our personal feelings, we need to evaluate the power of test scores to predict the outcomes we want for our students and consider what the alternatives might be.

A version of this article appeared in the October 09, 2019 edition of Education Week as How Predictive Are Tests?

Sign Up for EdWeek Update

Edweek top school jobs.

research on standardized testing

Sign Up & Sign In

module image 9

Tests and Stress Bias

  • Posted February 12, 2019
  • By Grace Tatter

Chronic Stress

A new study suggests that changes in levels of cortisol, a hormone associated with stress, during weeks of standardized testing hurt how students in one New Orleans charter school network performed — and kids coming from more stressful neighborhoods, with lower incomes and more incidents of violence, were most affected.

Published in a  recent working paper from the National Bureau of Economic Research, the first-of-its-kind study contributes to conversations about chronic stress and testing, and helps clarify where those conversations intersect, indicating that one reason that family income tends to correlate with test scores may be because stress — both from the test and home environments — affects scores.

The Findings

The researchers — Jennifer Heissel of the Naval Postgraduate School, Emma Adam and David Figlio of Northwestern, and Jennifer Doleac and Jonathan Meer of Texas A&M University — measured the stress-levels of children at the New Orleans charter school network, comparing the cortisol in their spit during weeks with high-stakes standardized tests — those that have implications for course placement, school sanctions or rewards, or education policy — and weeks without testing.

What they found is that, on average, students had 15 percent more cortisol in their systems the homeroom period before a standardized test than on days with no high-stakes testing. Students who showed the largest variations in cortisol between testing and non-testing weeks tended to perform worse on tests than expected given their classwork and performance on non-high-stakes tests, among other measures. Cortisol spikes weren’t the only culprit; some students’ cortisol dropped on testing days, which was also associated with lower performance.

“The decreases in cortisol is more a sign that your body is facing an overwhelming task and your body does not want to engage with the test,” Heissel says.

Students from the most disadvantaged neighborhoods, with both the highest rate of poverty and crime, saw the largest changes in cortisol in advance of testing, suggesting that their scores were the most affected — and therefore the least valid measures of what they actually knew.

Boys also tended to see more variation in cortisol, supporting pre-existing research that boys get more stressed about achievement-related texts, while girls are often more affected by social pressures.

Stress Bias?

More research needs to be done, Heissel warns. This study only included 93 students across three schools in New Orleans. Nearly all of the students were black and from low-income families, although there was variation in the violence and level of poverty in their neighborhoods. Future research would benefit from larger, more diverse samples, although, Heissel notes, it’s hard to find schools willing to let researchers visit during testing weeks.

It also raises questions about how to temper the effects of cortisol variation on testing day. “How can we reduce that stress response? There are lots of questions raised by this research and I hope other people pick up the baton,” Heissel says.

But considering the importance of and frequency of high-stakes testing, the need for that research is urgent for anyone who interprets and makes decisions based off of test scores.

 “It calls into question what we’re really measuring,” Heissel says.

  • Stress and its effect on the brain might be one reason that students from low-income neighborhoods tend to fare worse on high-stakes tests.
  • Children are affected by standardized testing, with some seeing their cortisol levels spike on testing days, and others seeing it drop, which might lead them to disengage.
  • Boys’ cortisol levels were more affected by standardized tests than girls’.
  • Stress Levels and the Developing Brain
  • When Testing Takes Over
  • Harvard EdCast: The Testing Charade
  • Harvard EdCast: Childhood Adversity's Lasting Effect

Usable Knowledge Lightbulb

Usable Knowledge

Connecting education research to practice — with timely insights for educators, families, and communities

Related Articles

Illustration by Dana Smith

Is the SAT Still Needed?

We look at the yeas and nays for keeping — or dropping — the test that’s been called the great leveler and the enemy of equity

Graduation caps being tossed

Strategies for Leveling the Educational Playing Field

New research on SAT/ACT test scores reveals stark inequalities in academic achievement based on wealth   

FAFSA Illustration

Can School Counselors Help Students with "FAFSA Fiasco"?

Support for low-income prospective college students and their families more crucial than ever during troubled federal financial aid rollout   

Fighting for fair school construction funding in California

What can colleges learn from the pro-Palestinian protesters’ deal at a UC campus?

Why California schools call the police

How earning a college degree put four California men on a path from prison to new lives | Documentary 

Patrick Acuña’s journey from prison to UC Irvine | Video

Family reunited after four years separated by Trump-era immigration policy

research on standardized testing

Calling the cops: Policing in California schools

research on standardized testing

Black teachers: How to recruit them and make them stay

research on standardized testing

Lessons in Higher Education: California and Beyond

research on standardized testing

Superintendents: Well paid and walking away

research on standardized testing

Keeping California public university options open

research on standardized testing

College in Prison: How earning a degree can lead to a new life

research on standardized testing

May 14, 2024

Getting California kids to read: What will it take?

research on standardized testing

April 24, 2024

Is dual admission a solution to California’s broken transfer system?

research on standardized testing

College Readiness

Research tells us standardized admissions tests benefit under-represented students

Wayne camara and michelle croft, april 9, 2020.

research on standardized testing

We continue to hear arguments against the use of ACT and SAT scores in admission decisions at California universities.

research on standardized testing

Wayne Camara

Such arguments however ignore important facts: ACT and SAT scores benefit under-represented students, in particular, and college admissions decisions, in general, for University of California admissions.

As some institutions, including the UC system , make temporary adjustments to their admissions criteria to mitigate coronavirus impact on applications and enrollment, we’re reminding students and colleges of this fact.

In late January, the University of California Standardized Testing Task Force completed a yearlong review of testing as a college admissions tool. The comprehensive report made the following findings:

  • Standardized tests are the best predictor of a student’s first-year success, retention and graduation.
  • The value of admissions test scores in predicting college success has increased since 2007, while the value of grades has decreased, due in part to high school grade inflation and different grading standards.
  • In regard to equity, testing does not worsen disparities for under-represented minority applicants and low-income students; instead, large differences in high school grades and course-taking are responsible for much of the difference in admissions rates across groups.

research on standardized testing

Michelle Croft

These findings should have laid to rest any talk about making test scores optional in the UC college admission process. Those who are continuing to make such arguments are focusing on myths, not facts. Some also continue to suggest that the ACT and SAT should be replaced in admissions by tests from the Smarter Balanced Assessment Consortium , which are administered to 11th-graders in California’s public high schools and those in six other states .

The UC report specifically rejects the proposal to use Smarter Balanced as a replacement of the ACT and SAT, and we agree.

There are several problems with using Smarter Balanced for admissions, many of which revolve around access and test security. With only seven states participating in the Smarter Balanced Test Consortium, students in the other 43 states and the District of Columbia — plus California students who attend private schools or are home schooled — have no opportunity to take the exam. And even students who have access to Smarter Balanced tests have no opportunity to retest, which typically results in a moderate increase in test scores.

Additionally, the computer adaptive processes (tailored testing that is adaptive to each test-taker’s abilities) used in the Smarter Balanced tests are rudimentary, resulting in some students gaining unfair exposure to test questions in advance. The task force report points out other challenges in employing Smarter Balanced tests for high-stakes decisions and published data demonstrate that group differences on Smarter Balanced assessments mirror those on admissions tests and would have little to no impact on diversity.

Finally, there is the argument that the Smarter Balanced tests cover more content taught in California classrooms than the ACT or SAT. It’s worth noting that the testing time for the Smarter Balanced tests is 7.5 hours, requiring multiple days of testing and representing nearly double the time required by the ACT and SAT. With double the testing time, one can obviously cover more content.

But measuring lots of skills is less important than measuring the right skills — the ones that are most important in terms of college readiness, upon which the ACT focuses. What matters most, really, is the test’s ability to effectively predict college success. ACT and College Board research has yielded many studies that have demonstrated the validity and usefulness of the ACT and SAT for admissions decisions across all student groups. We believe that every student should have the tools, support and resources they need to succeed in college and their careers.

When properly used by considering the context of student experience, opportunities and other achievements, admissions tests lead to accurate and fair decisions. Our job as leaders in academia is to work and collaborate across all sectors and to level the playing field in college admissions so the dream of a higher education is within reach for all who seek it.

Wayne Camara is the Horace Mann Research Chair and Michelle Croft is a principal research  scientist at  ACT , which owns and administers the ACT test.

The opinions in this commentary are those of the author. EdSource is interested in hearing from schools and districts about how they are adapting to distance learning. If you would like to submit a commentary, please review our  guidelines  and  contact us .

Share Article

Comments (2)

Leave a comment, your email address will not be published. required fields are marked * *.

Click here to cancel reply.

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Comments Policy

We welcome your comments. All comments are moderated for civility, relevance and other considerations. Click here for EdSource's Comments Policy .

Robert Anderson 3 years ago 3 years ago

What will be used, high school grades? What about the problem of poor schools with easy classes and lax grading?

Molly 3 years ago 3 years ago

Loud, dumb and wrong. Standardized Tests are the sibling of course work and grades. It's all a meritocracy that benefits students with the most resources. Your article neglects to mention the already-existing racial disparities in American public universities. You claim that Black students will have a harder time without test scores as a requirement, but there are a multitude of peer reviewed journals and articles that negate your preposterous assumption. This a terribly inadequate and … Read More

Loud, dumb and wrong. Standardized Tests are the sibling of course work and grades. It’s all a meritocracy that benefits students with the most resources. Your article neglects to mention the already-existing racial disparities in American public universities. You claim that Black students will have a harder time without test scores as a requirement, but there are a multitude of peer reviewed journals and articles that negate your preposterous assumption.

This a terribly inadequate and misleading article, and the writers should be ashamed of such poorly reviewed research.

EdSource Special Reports

research on standardized testing

San Bernardino County: Growing hot spot for school-run police

Why open a new district-run police department? “We need to take our safety to another level.”

research on standardized testing

Safety concerns on the rise in LAUSD; Carvalho looks to police

The Los Angeles Unified School District has reinstated police to two of its campuses — continuing an ebb and flow in its approach to law enforcement.

research on standardized testing

Going police-free is tough and ongoing, Oakland schools find

Oakland Unified remains committed to the idea that disbanding its own police force can work. Staff are trained to call the cops as a last resort.

research on standardized testing

When California schools summon police 

EdSource investigation describes the vast police presence in K-12 schools across California.

EdSource in your inbox!

Stay ahead of the latest developments on education in California and nationally from early childhood to college and beyond. Sign up for EdSource’s no-cost daily email.

Stay informed with our daily newsletter

Center for American Progress

Future of Testing in Education: The Way Forward for State Standardized Tests

  • Report    PDF (377 KB)

There are valid criticisms about the current structure of state standardized testing in schools; the solution is not to get rid of these assessments but rather to design them differently.

research on standardized testing

Advancing Racial Equity and Justice, Building an Economy for All, Education, Education, K-12, Investment and Funding Equity for Public Education +2 More

Media Contact

Mishka espey.

Senior Manager, Media Relations

[email protected]

Sarah Nadeau

Associate Director, Media Relations

Government Affairs

Madeline shepherd.

Director, Federal Affairs

Part of a Series

 (Fourth grade students work on their math at an elementary school in Waukegan, Illinois, January 2016.)

The Future of Testing in Education

In this article.

research on standardized testing

This series is about the future of testing in America's schools.

Part one of the series presents a theory of action that assessments should play in schools. Part two reviews advancements in technology, with a focus on artificial intelligence that can powerfully drive learning in real time. And the third part—this report—looks at assessment designs that can improve large-scale standardized tests.

View the series

Introduction and summary

Federal law requires all public school students in grades three to eight to take an annual assessment in reading and math at the end of the year and requires students to take an assessment once during high school. The goal of this assessment is to measure the extent to which all students are meeting the state’s academic standards. These standards must align with the knowledge and skills in reading and math that students need to succeed in first-year college reading and math courses. Ensuring all students are held to rigorous standards is a key goal of equity in education.

Yet many question the value of yearly standardized testing in schools since the opportunity to receive a high-quality education and graduate high school adequately prepared for college-level academics is still wholly inequitable. Students who are Black, Indigenous, and Hispanic graduate high school at lower rates than their white peers, and they require catch-up coursework in college more often. 1 What is more, the costs and time associated with assessments, delayed results, and failure of tests alone to improve students’ academic results leave many to wonder if they are worth the effort at best, and at worst, if they harm students and punish teachers and schools. 2

InProgress Stay informed on the most pressing issues of our time.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Still, there are ways to design an assessment to reduce the amount of time it takes to administer, ensure that it collects information about students throughout the year, or base the test on performing tasks. This report describes the advancements in testing technology that make such assessments possible, and it concludes with recommended changes in federal testing policy to make the use of these designs effective. Apart from greater investments in research and development of new assessment designs, the federal government should also loosen regulations on the assessment pilot included in the recent reauthorization of the Elementary and Secondary Education Act. 3

The Center for American Progress’ companion report in this series, “Future of Testing in Education: Effective and Equitable Assessment Systems,” 4 separates fact from fiction regarding the criticisms against standardized testing. It also underscores CAP’s theory that, when well designed, tests can provide insights into what students know and do not know, allowing education stakeholders to drive student learning forward. This information is critical to teachers in the design of daily instruction, as well as to school administrators and policymakers who decide on and fund supports when students need them.

The yearly state standardized assessment alone cannot ensure a high-quality education for every child. But without it, educators do not know their progress toward meeting that goal. Despite the valid concerns that remain about standardized assessments and their role in education, there are a number of upsides and value to standardized testing. For instance, it is the only common measure of grade-level academic standards for all public school students. As such, it is one measure to determine if students are on track for college or career readiness when they graduate high school. The Every Student Succeeds Act (ESSA)—a federal law requiring all students to be held to the same high standards—is one way to help ensure an equitable opportunity to a high-quality education.

New ways to test students in the spring

Advances in technology—and even some decades-old assessment designs—can reduce testing time and improve the quality of the standardized tests themselves by addressing the drawbacks discussed in CAP’s issue brief in this series, “Future of Testing in Education: Artificial Intelligence.” 5

Long testing times, cultural bias, and limited usefulness to teachers are just some of the criticisms against today’s state standardized assessment. However, advances in technology can alleviate some of these concerns. Some tests, for example, can use sampling testing techniques to reduce testing time. Overall, there are three new ways to assess students discussed in this report.

Matrix sampling cuts testing time

Today’s state standardized tests require students to sit for eight to nine hours total, in two-hour segments. Matrix sampling—which provides individual students with a representative sample of assessment questions—reduces testing time. Rather than all students taking all test items, one approach to matrix sampling involves selecting a limited set of test questions in a way that allows evaluators to estimate results for the entire test. In other words, no individual student takes the entire exam. In addition to decreasing testing time, the results produce group-level information rather than individual test scores. This is the approach used by the National Assessment of Educational Progress, or NAEP—a test given to students since the 1970s that takes about 90 minutes to complete. 6

Using another type of matrix sampling, test developers select specific test questions that will predict performance on the entire test. All students take these selected items. 7

State education departments already use matrix sampling in current state tests, often to pilot new questions and see how students interact with them. For widescale use, a matrix sampling design would provide similar information about student performance at the student group level as current state tests do.

Furthermore, experts advise that with additional innovative techniques, matrix sampling could measure growth for individual students by giving them enough test questions in common and applying additional statistical analysis to the questions. 8 Other statistical techniques could also produce individual student reports. These methods could also support results comparison between students and from one year to the next, allowing policymakers and administrators to identify trends.

In sum, a matrix sampling assessment design could give enough of the benefits of a full-length assessment without the significant drawbacks that long testing time has on students. However, without further innovation on current matrix sampling designs, the lack of individual test scores poses a significant barrier. The authors discuss the current policy barriers to using a matrix sampling design in a later section of this report.

Through-year assessments would eliminate a summative assessment in the spring

Through-year tests have been piloted in some states involved in the assessment pilot created by ESSA. The concept of this design is simple: Develop tests that are administered throughout the year, and aggregate the results of some questions into a summative score.

Two states involved in this pilot are using through-year assessments but have different theories of action for how the tests should support student learning. These differences lead to distinct test designs. 9 Louisiana, for example, wants to eliminate the gap between what students learn and what they are tested on. To do so, Louisiana bases its test on optional statewide curricula that are aligned with the state’s standards. Experts in the content of Louisiana’s curriculum create the tests, with the involvement of teachers. The state designed this approach to allow teachers to go deep on the standards as well as other skills such as critical thinking.

Georgia’s approach focuses on developing tests that meet students where they are, regardless of their grade level. When students are behind, the state standardized assessment tests knowledge and skills that these students cannot perform. Thus, the results have a limited ability to inform teachers and other education stakeholders of the path to help students catch up. For students who are ahead, testing only grade-level material provides a disincentive to push students further.

The idea in both states is that the tests will return data throughout the year that teachers can use to shape instruction while kids are still in the classroom, rather than getting scores after the school year is over. When provided with this information, teachers can intervene earlier and use data to meaningfully close gaps for students, work toward the goal of achieving grade-level proficiency throughout the year, and adjust their instruction if the gaps are not closing as they hoped. The potential to test information closer to students’ real-time learning and intervene sooner if students are not meeting critical benchmarks are significant upsides to through-year test designs.

Both states are very early in their development of these new tests, and it is not yet known how well these methods meet states’ goals for their new tests. For example, they are still working to develop a single, summative student score at the end of the year as federal law requires.

Currently, through-year assessments face three challenges. First, it is too early to know if these tests can still innovate despite the design constraints required by law, as the pilots are still in their early phases. Second, it is also too early to know if through-year tests help ease the anxiety around testing or if they amplify it, because instead of just one test that will be used for accountability, there will be three. Test designers hope that by giving more frequent tests that are more tightly connected to what students learn and are closer to students’ academic level, this practice, over time, may reduce test anxiety. Third, through-year tests work best in states where students are learning the same things at the same time. However, decisions about what students learn and when they learn it happen at the district level, not the state level. The only state that currently has symmetry of curriculum and instructional materials across some school districts is Louisiana. 10

Performance-based assessments use hands-on tasks where students can demonstrate their knowledge and ability

Researchers consider performance tasks to be authentic measures of standards because the tasks are extended performances of student work, showing multiple stages of the thought process and how students arrived at the solution. 11 Students do not just learn the specific academic content; they develop a range of skills in the process of completing complex tasks, such as presenting and defending their work, leading or participating in individual or group projects, and performing other multifaceted tasks.

There are two approaches to designing tests based on performance tasks: (1) standardize the tasks, meaning all students perform the same tasks, or (2) standardize the scoring rubric, meaning students perform different tasks that are scored using a common scoring tool.

New Hampshire uses the first approach in its program called Performance Assessment of Competency Education, or PACE. 12 The New York Consortium on Performance Based Assessment uses the second approach: 13 Participating educators come together a few times a year to develop the scoring rubric, which also functions as professional development for teachers on how to use standards-based grading. To do this, educators review a sample student work product, such as an essay, to produce a common scoring rubric for all participating schools. 14

Proponents argue that performance-based tests are more motivating to students and allow for more holistic review of student work. 15 As a result, the performance-based assessments can analyze students’ skills at a deeper level and are better suited to measuring crosscutting skills such as critical thinking and teamwork. 16 Thus, performance-based tests can be tools for learning as well as measures of learning.

But despite the advantages of allowing students creativity in demonstrating their work, performance tasks suffer from some drawbacks. Current performance-based assessments may not serve as a complete replacement for summative tests. The complexity of the performance tasks themselves and the sheer number of academic standards may make it difficult for the tasks alone to measure the full range of standards without significant effort to group the standards. As a result, New Hampshire uses a combination of performance tasks, local tests, and the statewide test to measure the full range of standards. 17

Performance tasks also have challenges with scalability. For example, a state would need to create a process to norm the scoring rubrics statewide, based on a large set of sample work; it would then need to create a process for how educators become certified graders for their school. If the state chose to standardize the tasks as well, they would repeat a similar process. There are also challenges in hand-scoring and observing any student demonstrations occurring in real time, such as an essay defense—not just in the time it takes to complete, but also to ensure that no bias occurs.

The performance tasks also have implications for how teachers teach and how schools are organized to deliver instruction. That is, schools must redesign their approach to teaching in order to help students build the prerequisite skills and allow for hands-on learning. For example, schools would need to invest significant time and resources in creating learning experiences designed to improve skills such as critical thinking, time management, collaboration, and communication, in addition to teaching content based on the state’s academic standards.

Some of these factors explain the slow-growth approach New Hampshire intentionally takes in its PACE program. In its first 10 years, four districts out of 167 total joined the PACE effort. 18 During a 2016 visit of New Hampshire schools involved in the program, then-state assessment director Paul Leather said the program would be optional for districts given the complexities involved in implementing the program successfully. 19

Finally, it is unclear how much more effective performance-based assessments are at improving student outcomes. Studies of New Hampshire’s PACE program show small observed differences and less improvement among low-income students, students with disabilities, and male students. 20

How the law can allow states to test and use these innovations in testing

Technical requirements in existing laws and regulations could prevent states from trying some of the innovative designs highlighted above. However, this report does not advocate for wholesale waivers of these requirements. It offers points of consideration for the U.S. Department of Education and states as they develop policies that allow for innovation.

This section discusses the following standardized testing requirements:

  • Validity, reliability, and comparability
  • Grade-level measurement
  • Individual score reports

Defining validity, reliability, and comparability, and why they matter

Validity refers to how accurately and fully a test measures the skills it intends to evaluate. 21 For example, if an algebra test includes some geometry questions, the test would not be a valid measure of algebra. Reliability refers to the consistency of the test scores across different testing sessions, different editions of the test, and different people scoring the exam. 22 Reliability informs how consistently the test measures the knowledge and skills it is intended to measure. Comparability allows for comparing of test scores, even if students took the test at different times, in different places, and under different conditions. 23 For example, test developers will design a test that may be given via computer or paper and pencil to account for these differences so results can be compared. 24

These three technical qualities apply to the ESSA pilot program for state assessments as well. Piloted designs must meet the highest standards for validity, reliability, and comparability, as state tests do. Lawmakers did this not only because the piloted tests were required across states to fulfill the yearly student and school performance requirements in federal law, but also because they are fundamental to test quality, fairness, and equity. 25

While these technical requirements help ensure a high-quality test, they also constrict states’ abilities to try new approaches to testing. As a result, the requirements confine the pilots to look and behave like today’s standardized tests. 26

Where can policymakers be flexible on test reliability and comparability?

State tests should always be valid measures of student knowledge and skills of interest. But state tests can be reliable and comparable enough while still serving as high-quality measures of what students know and can do. For instance, to provide flexibility on how reliable scores must be from one student to the next, policymakers could allow scores to be less consistent as long as the test still measures the same skills. Test developers would call this maintaining comparability of the standards.

Test makers can compare scores even if they are not 100 percent equivalent. In fact, test developers compare different tests all the time through a concordance table, which allows the scores from different kinds of tests to be equated. Universities use these, for example, when they accept scores from both the ACT and the SAT as part of freshman application requirements; a concordance table can indicate what scores on each test are roughly equivalent. 27 The same is true for states when they change from one yearly summative test to another. They create concordance tables that compare the same test construct rather than the results of different tests altogether. While these tables do not perfectly convert scores from one test to another, they make the closest comparison possible. Although this is not a complete solution to achieving 100 percent comparability of scores, as state tests currently have, this offers some degree of comparison. Policymakers could set a minimum threshold for comparability of scores that is less than 100 percent.

How do tests’ technical requirements affect the future of testing?

Each of the new ways to test students highlighted in this report may be constrained by the requirements discussed above. As a result, the U.S. Department of Education may need to grant states some flexibility to try these designs.

For example, states would need a waiver from ESSA section 1111(b)(2)(B)(x)—which requires individual student score reports—in order to try a matrix design. Although matrix design assessments do not typically provide individual scores, test developers may be able to apply additional statistical analysis to approximate student results for the full range of standards, allowing them to provide insights into individual students’ performance.

Additionally, states could include information in an individual student report about the specific part of the domain of standards, or the specific knowledge and skills, on which the student was tested. For example, administrators can let a parent know their child will be randomly assigned an assessment at the end of the year that deals with one of these areas or domains. The parents would receive a report about how students do in the area they are tested in and information about how the school performed as a whole. However, this approach requires flexibility both in reporting and potentially in the provision that all students get the same test.

Recommendations: The path forward for states and ways the federal government can help

Regarding state assessments, states can choose between two divergent paths. First, they could reduce the footprint of their annual assessment by making it shorter and leaving more time for instruction through a matrix design. Proponents of this approach also say that integrating summative test questions into tests given throughout the year, such as with through-year exams, reduces the footprint by making the testing experience more normalized, especially since the assessment tests the content students have just learned. Alternatively, they could increase its footprint through a performance-based design, but in a way that would provide a broader array of data about what students know and can do.

Congress and the U.S. Department of Education can put the following policies into place to allow for more innovation within state summative assessments.

Congress should:

  • Revise the innovative assessment pilot policy. The pilot requires states to use their piloted designs statewide within five years. Rather than instituting an arbitrary deadline, Congress should allow states to determine the length of time for their pilot. It should also recognize that not all pilots will be successful and allow states to abandon their pilots if they are not working well. Effective research in testing requires money, so Congress should also fund this pilot. Finally, pilot results should inform the next authorization of ESSA. In addition to using the pilot evaluation for this purpose, Congress should hold a hearing with pilot participants to gather their perspectives.
  • Give additional funding to states for testing and related research and development to support cutting-edge technology. Congress can do so by increasing funding for three programs. First is the Competitive Grant for State Assessments (CGSA) program, which currently provides about $8 million every year to states wanting to develop additional assessments. In order to reach the full potential of these new technologies to improve the teaching and learning process, the CGSA program needs far more funding. 28 Second is the Grants for State Assessments and Related Activities program, which helps fund states’ yearly assessment systems and the activities needed to carry it out; any increase should not be reallocated to the competitive grant program as current law does. 29 Third is the Small Business Innovation Research program, which provides up to $1.1 million in awards to develop education-related learning technologies. 30 Congress could also orient this program to have more of an assessment angle rather than one focused on general education technology. As technologies evolve, there could be ways to better use them to make assessments more effective. These innovations could challenge the norms for how today’s test developers and researchers are trained. Congress should also fund doctoral student training for future psychometricians who will build the assessments of tomorrow, emphasizing their education in new and emerging technologies as one way to continue advancing technological innovations.

The U.S. Department of Education should:

  • Design their pilot and respond to the existing technical requirements for assessments according to their theory of action. For example, if proposed designs have less strict requirements for score comparability, states should describe how this improves student learning, such as through higher-quality interactions between students and teachers based on student work.
  • Develop their application in collaboration with testing experts and the pilot site communities. This helps piloted designs reflect what local communities want and need.
  • Propose a realistic timeline for the pilot and its analysis, revisions, and rollout statewide.
  • Improve the evaluation of the assessment pilot. Current plans for the evaluation include reviewing existing documents from piloting states and asking about participant experiences through a survey. 31 It is not clear that these methods will provide the most useful information about the technical merits and challenges of the tested designs or the policy and design choices and their impacts. The final regulations for this evaluation should include a much more robust plan for gathering this kind of information, evaluating the extent to which states realized their theories of action and their intended outcomes.
  • A bank of test questions suitable for diagnostic tests, as well as formative and summative exams.
  • A bank of test questions that align with the highest quality textbooks and curriculum, as rated by experts. The test questions could also take cultural relevancy and cultural representation into account.
  • Large datasets to be used for machine learning applications, such as automated essay scoring. For example, the Automated Student Assessment Prize awarded scientists funding to train machines to score essays similarly to how human experts would. 32 Datasets like these are key to the training of new tools and models.
  • Easier to understand and more actionable student and school score reports. 33 About half of states still do not report federally required data such as levels of teacher experience and student college attendance rates. 34 Innovation in this area could make it easier for states to produce consumable data.
  • Training for educators and education leaders in states, districts, and schools to use test results in ways that are more useful to student-teacher interactions and ensure that students get the support they need.
  • More tools and funding for formative assessments and the ways that formative assessment can be rolled up into summative assessments.
  • Revisit the assessment peer review process to ensure it provides flexibility while maintaining a high bar —for example, by evaluating grade-level mastery based on as few items as necessary. Regulations could also allow states to show comparability through concordance tables, allowing for greater differences in scores from one student to the next.
  • Provide technical assistance to states. Specifically, the Education Department can guide states on how to write requests for proposals in ways that result in the creation of tests that are more innovative.

Advancements in assessment technology can make state standardized tests more streamlined and capable of providing better information about what students know and can do. If states are encouraged and funded to take the new approaches described in this report, they can increase the value that testing data provide to educators, parents, and policymakers.

About the authors

Laura Jimenez is the director of standards and accountability on the K-12 Education team at the Center for American Progress.

Ulrich Boser is a nonresident senior fellow at the Center.

Acknowledgments

The authors would like to thank the following people for their advice in writing this report:

Abby Javurek, Northwest Evaluation Association

Alina Von Davier, Duolingo

Ashley Eden, New Meridian

Bethany Little, Education Counsel

Edward Metz, Institute of Education Sciences

Elda Garcia, National Association of Testing Professionals

Jack Buckley, Robox and the American Institutes for Research

James Pellegrino, University of Illinois at Chicago

John Whitmer, Institute of Education Sciences

Krasimir Staykov, Student Voice

Kristopher John, New Meridian

Laura Slover, Centerpoint Education Solutions

Margaret Hor, Centerpoint Education Solutions

Mark DeLoura, Games and Learning Inc.

Mark Jutabha, WestEd

Michael Rothman, Independent consultant formerly of Eskolta School Research Design

Michael Watson, New Classrooms

Mohan Sivaloganathan, Our Turn

Neil Heffernan, Worcester Polytechnic Institute

Osonde Osoba, RAND Corp.

Roxanne Garza, UnidosUS

Sandi Jacobs, Sean Worley and Scott Palmer of Education Counsel

Terra Wallin, The Education Trust

Tim Langan, National Parents Union

Vivett Dukes, National Parents Union

  • National Center for Education Statistics, “Public High School Graduation Rates,” available at https://nces.ed.gov/programs/coe/indicator/coi (last accessed July 2021); National Center for Education Statistics, “Remedial Coursetaking at Public 2- and 4-year Institutions: Scope, Experience and Outcomes” (Washington: U.S. Department of Education, 2016), available at https://nces.ed.gov/pubs2016/2016405.pdf .
  • Kirwan Institute for the Study of Race and Ethnicity, “Standardized Testing and Stereotype Threat,” March 12, 2013, available at https://kirwaninstitute.osu.edu/article/standardized-testing-and-stereotype-threat ; Richard J. Shavelson and others, “Problems with the use of student test scores to evaluate teachers” (Washington: Economic Policy Institute, 2010), available at https://www.epi.org/publication/bp278/ ; Christian Barnard, “To Ensure Equitable Funding for Low-Income Students, Fixing Title I Isn’t Good Enough — It Needs to Be Rebuilt From Scratch,” The 74, June 18, 2019, available at https://www.the74million.org/article/barnard-to-ensure-equitable-funding-for-low-income-students-fixing-title-i-isnt-good-enough-it-needs-to-be-rebuilt-from-scratch/ .
  • Every Student Succeeds Act, Public Law 114-95, 114th Cong., 1st sess., December 10, 2015, available at https://www.govinfo.gov/content/pkg/BILLS-114s1177enr/pdf/BILLS-114s1177enr.pdf .
  • Laura Jimenez and Jamil Modaffari, “Future of Testing in Education: Effective and Equitable Assessment Systems” (Washington: Center for American Progress, 2021), available at https://americanprogress.org/?p=502607 .
  • Laura Jimenez and Ulrich Boser, “Future of Testing in Education: Artificial Intelligence” (Washington: Center for American Progress, 2021), available at https://americanprogress.org/?p=502663 .
  • Emmanuel Sikali and Cadelle Hemphill, “Focus on NAEP: National Assessment of Educational Progress Sampling,” Nation’s Report Card, available at https://www.nationsreportcard.gov/focus_on_naep/files/sampling_infographic.pdf (last accessed July 2021).
  • Ibid.; Edward Roeber, “What does it mean to use matrix sampling in student assessment?”, Michigan Assessment Consortium, available at http://michiganassessmentconsortium.org/wp-content/uploads/ThinkPoint_MatrixSampling3.pdf.pdf (last accessed July 2021).
  • Summary of interview with Jack Buckley, head of assessment and learning, Roblox, interview via video conference, October 7, 2020, on file with author; Summary of interview with James Pellegrino, Liberal Arts and Sciences distinguished professor and distinguished professor of education at the University of Illinois at Chicago, interview via video conference, December 29, 2020, on file with author.
  • Abby Javurek, Northwest Education Association (a testing company helping states carry out the pilots), interview via video conference, October 5, 2020 and December 18, 2020, on file with author.
  • A few years ago, Louisiana provided incentives to encourage districts to adopt curriculum from a small handful of high-quality curricula, reviewed and vetted by educational experts in the state. Louisiana Department of Education, “Curriculum,” available at https://www.louisianabelieves.com/academics/curriculum (last accessed March 2021).
  • Rosario Martinez Arias, “Performance Assessment,” Papeles de Psicologo 31 (1) (2010): 85–96, available at http://www.psychologistpapers.com/English/1799.pdf ; Emily R. Lai, “Performance-based Assessment: Some New Thoughts on an Old Idea,” Pearson Research Bulletin 20 (2011): 1–4, available at http://images.pearsonclinical.com/images/tmrs/Performance-based-assessment.pdf .
  • New Hampshire Department of Education, “Performance Assessment of Competency Education,” available at https://www.education.nh.gov/who-we-are/division-of-learner-support/bureau-of-instructional-support/performance-assessment-for-competency-education (last accessed March 2021).
  • New York Performance Standards Consortium, “The Assessment System,” available at http://www.performanceassessment.org/how-it-works (last accessed March 2021).
  • Lai, “Performance-based Assessment.”
  • New Hampshire Department of Education, “Performance Assessment of Competency Education.”
  • Ballotpedia, “List of school districts in New Hampshire,” available at https://ballotpedia.org/List_of_school_districts_in_New_Hampshire (last accessed July 2021).
  • Information conveyed to the author during a visit to New Hampshire in 2016. Notes on file with authors.
  • Carla M. Evans, “Effects of New Hampshire’s Innovative Assessment and Accountability System on Student Achievement Outcomes After Three Years,” Education Policy Analysis Archives 27 (10) (2019), available at https://www.education.nh.gov/sites/g/files/ehbemt326/files/files/inline-documents/effectnhpace3years.pdf .
  • Fiona Middleton, “The four types of validity,” Scribbr, September 6, 2019, available at https://www.scribbr.com/methodology/types-of-validity/ .
  • Samuel A. Livingston, “Test reliability—basic concepts” (Princeton, NJ: Educational Testing Seervice, 2018), available at https://www.ets.org/Media/Research/pdf/RM-18-01.pdf .
  • Amy I. Berman, Edward H. Haertel, and James W. Pellegrino, “Comparability of Large-Scale Educational Assessments: Issues and Recommendations” (Washington: National Academy of Education, 2020), available at  https://naeducation.org/wp-content/uploads/2020/06/Comparability-of-Large-Scale-Educational-Assessments.pdf .
  • Phoebe C. Winter, “Evaluating the Comparability of Scores from Achievement Test Variations” (Washington: Council of State School Officers, 2010), available at https://files.eric.ed.gov/fulltext/ED543067.pdf .
  • Office of Elementary and Secondary Education, “A State’s Guide to the U.S. Department of Education’s Assessment Peer Review Process” (Washington: U.S. Department of Education, 2018), available at https://www2.ed.gov/admins/lead/account/saa.html#Standards_and_Assessments_Peer_Review ; Office of Elementary and Secondary Education, “Application for New Authorities under the Innovative Assessment Demonstration Authority” (Washington: U.S. Department of Education, 2020), available eat https://www2.ed.gov/admins/lead/account/iada/iadaapplication2020.pdf .
  • Office of Elementary and Secondary Education, “Application for New Authorities under the Innovative Assessment Demonstration Authority” (Wasihington: U.S. Department of Education, 2020), available at https://www2.ed.gov/admins/lead/account/iada/iadaapplication2020.pdf .
  • Megan Stubbendeck, “ New SAT/ACT Concordance Tables,” ArborBridge, June 14, 2018, available at https://blog.arborbridge.com/new-sat-act-concordance-tables .
  • U.S. Department of Education Office of Elementary and Secondary Education, “Competitive Grants for State Assessments,” available at https://www2.ed.gov/programs/cgsa/index.html (last accessed July 2021).
  • Ibid.; U.S. Department of Education, “School Improvement Programs: Fiscal Year 2021 Budget Request” (Washington: 2021), available at https://www2.ed.gov/about/overview/budget/budget21/justifications/d-sip.pdf ; U.S. Department of Education, “President’s FY 2021 Budget Request for the U.S. Department of Education,” available at https://www2.ed.gov/about/overview/budget/budget21/index.html (last accessed July 2021).
  • Institute of Education Sciences, “ED/IES Small Business Innovation Research,” available at https://ies.ed.gov/sbir/ (last accessed July 2021).
  • U.S. Department of Education, “Agency Information Collection Activities; Comment Request; Evaluation of the Innovative Assessment Demonstration Authority Pilot Program-Survey Data Collection,” Federal Register 85 (171) (2020): 54541-54542, available at https://www.govinfo.gov/content/pkg/FR-2020-09-02/pdf/2020-19421.pdf .
  • Kaggle, “The Hewlett Foundation: Automated Essay Scoring Develop an automated scoring algorithm for student-written essays,” available at https://www.kaggle.com/c/asap-aes (last accessed July 2021).
  • Data Quality Campaign, “Show Me the Data: DQC’s Annual Analysis of Report Cards,” available at https://dataqualitycampaign.org/resource/show-me-the-data-reports/ (last accessed July 2021).
  • Data Quality Campaign, “Show Me the Data 2020,” available at https://dataqualitycampaign.org/showmethedata-2020/ (last accessed July 2021).

The positions of American Progress, and our policy experts, are independent, and the findings and conclusions presented are those of American Progress alone. A full list of supporters is available here . American Progress would like to acknowledge the many generous supporters who make our work possible.

Laura Jimenez

Former Director, Standards and Accountability

Ulrich Boser

Former Senior Fellow

Explore The Series

In this series, the Center for American Progress examines how assessments in public schools can become effective instruments that help to measure whether schools and educators are meeting the goals of education. It considers how assessments are designed, and how their results are used and understood, and emphasizes that when done purposefully, these tests can be part of the solution in creating a high-quality education for every child. This series is designed to be useful to federal, state, and local policymakers, as well as to practitioners, by challenging the norms on which current assessment policy and practice are based in order to present new and fresh thinking on this issue.

Future of Testing in Education: Artificial Intelligence

Stay informed on the most pressing issues of our time.

Are Standardized Tests Racist, or Are They Anti-racist?

Photo shot from above, of students sitting at desks taking a test

They’re making their lists, checking them twice, trying to decide who’s in and who’s not. Once again, it’s admissions season, and tensions are running high as university leaders wrestle with challenging decisions that will affect the future of their schools. Chief among those tensions, in the past few years, has been the question of whether standardized tests should be central to the process.

In 2021, the University of California system ditched the use of all standardized testing for undergraduate admissions. California State University followed suit last spring, and in November, the American Bar Association voted to abandon the LSAT requirement for admission to any of the nation’s law schools beginning in 2025. Many other schools have lately reached the same conclusion. Science magazine reports that among a sample of 50 U.S. universities, only 3 percent of Ph.D. science programs currently require applicants to submit GRE scores, compared with 84 percent four years ago. And colleges that dropped their testing requirements or made them optional in response to the pandemic are now feeling torn about whether to bring that testing back .

Proponents of these changes have long argued that standardized tests are biased against low-income students and students of color, and should not be used. The system serves to perpetuate a status quo, they say, where children whose parents are in the top 1 percent of income distribution are 77 times more likely to attend an Ivy League university than children whose parents are in the bottom quintile. But those who still endorse the tests make the mirror-image claim: Schools have been able to identify talented low-income students and students of color and give them transformative educational experiences, they argue, precisely because those students are tested.

These two perspectives—that standardized tests are a driver of inequality, and that they are a great tool to ameliorate it—are often pitted against each other in contemporary discourse. But in my view, they are not oppositional positions. Both of these things can be true at the same time: Tests can be biased against marginalized students and they can be used to help those students succeed. We often forget an important lesson about standardized tests: They, or at least their outputs, take the form of data; and data can be interpreted—and acted upon—in multiple ways. That might sound like an obvious statement, but it’s crucial to resolving this debate.

I teach a Ph.D. seminar on quantitative research methods that dives into the intricacies of data generation, interpretation, and application. One of the readings I assign —Andrea Jones-Rooy’s article “ I’m a Data Scientist Who Is Skeptical About Data ”—contains a passage that is relevant to our thinking about standardized tests and their use in admissions:

Data can’t say anything about an issue any more than a hammer can build a house or almond meal can make a macaron. Data is a necessary ingredient in discovery, but you need a human to select it, shape it, and then turn it into an insight.

When reviewing applications, admissions officials have to turn test scores into insights about each applicant’s potential for success at the university. But their ability to generate those insights depends on what they know about the broader data-generating process that led students to get those scores, and how the officials interpret what they know about that process. In other words, what they do with test scores—and whether they end up perpetuating or reducing inequality—depends on how they think about bias in a larger system.

First, who takes these tests is not random. Obtaining a score can be so costly—in terms of both time and money —that it’s out of reach for many students. This source of bias can be addressed, at least in part, by public policy. For example, research has found that when states implement universal testing policies in high schools, and make testing part of the regular curriculum rather than an add-on that students and parents must provide for themselves, more disadvantaged students enter college and the income gap narrows. Even if we solve that problem, though, another—admittedly harder—issue would still need to be addressed.

The second issue relates to what the tests are actually measuring. Researchers have argued about this question for decades, and continue to debate it in academic journals. To understand the tension, recall what I said earlier: Universities are trying to figure out applicants’ potential for success . Students’ ability to realize their potential depends both on what they know before they arrive on campus and on being in a supportive academic environment. The tests are supposed to measure prior knowledge, but the nature of how learning works in American society means they end up measuring some other things, too.

In the United States, we have a primary and secondary education system that is unequal because of historic and contemporary laws and policies . American schools continue to be highly segregated by race, ethnicity, and social class, and that segregation affects what students have the opportunity to learn . Well-resourced schools can afford to provide more enriching educational experiences to their students than underfunded schools can. When students take standardized tests, they answer questions based on what they’ve learned, but what they’ve learned depends on the kind of schools they were lucky (or unlucky) enough to attend.

This creates a challenge for test-makers and the universities that rely on their data. They are attempting to assess student aptitude, but the unequal nature of the learning environments in which students have been raised means that tests are also capturing the underlying disparities; that is one of the reasons test scores tend to reflect larger patterns of inequality . When admissions officers see a student with low scores, they don’t know whether that person lacked potential or has instead been deprived of educational opportunity.

So how should colleges and universities use these data, given what they know about the factors that feed into it? The answer depends on how colleges and universities view their mission and broader purpose in society.

From the start, standardized tests were meant to filter students out. A congressional report on the history of testing in American schools describes how , in the late 1800s, elite colleges and universities had become disgruntled with the quality of high-school graduates, and sought a better means of screening them. Harvard’s president first proposed a system of common entrance exams in 1890; the College Entrance Examination Board was formed 10 years later. That orientation—toward exclusion—led schools down the path of using tests to find and admit only those students who seemed likely to embody and preserve an institution’s prestigious legacy. This brought them to some pretty unsavory policies. For example, a few years ago, a spokesperson for the University of Texas at Austin admitted that the school’s adoption of standardized testing in the 1950s had come out of its concerns over the effects of Brown v. Board of Education . UT looked at the distribution of test scores, found cutoff points that would eliminate the majority of Black applicants, and then used those cutoffs to guide admissions.

Read: The college-admissions process is completely broken

These days universities often claim to have goals of inclusion . They talk about the value of educating not just children of the elite, but a diverse cross-section of the population . Instead of searching for and admitting students who have already had tremendous advantages and specifically excluding nearly everyone else, these schools could try to recruit and educate the kinds of students who have not had remarkable educational opportunities in the past.

A careful use of testing data could support this goal. If students’ scores indicate a need for more support in particular areas, universities might invest more educational resources into those areas. They could hire more instructors or support staff to work with low-scoring students. And if schools notice alarming patterns in the data—consistent areas where students have been insufficiently prepared—they could respond not with disgruntlement, but with leadership. They could advocate for the state to provide K–12 schools with better resources.

Such investments would be in the nation’s interest, considering that one of the functions of our education system is to prepare young people for current and future challenges. These include improving equity and innovation in science and engineering , addressing climate change and climate justice , and creating technological systems that benefit a diverse public. All of these areas benefit from diverse groups of people working together —but diverse groups cannot come together if some members never learn the skills necessary for participation.

Read: The SAT isn’t what’s unfair

But universities—at least the elite ones—have not traditionally pursued inclusion, through the use of standardized testing or otherwise. At the moment, research on university behavior suggests that they operate as if they were largely competing for prestige . If that’s their mission—as opposed to advancing inclusive education—then it makes sense to use test scores for exclusion. Enrolling students who score the highest helps schools optimize their marketplace metrics—that is, their ranking.

Which is to say, the tests themselves are not the problem. Most components of admissions portfolios suffer from the same biases. In terms of favoring the rich, admissions essays are even worse than standardized tests; the same goes for participation in extracurricular activities and legacy admissions . Yet all of these provide universities with usable information about the kinds of students who may arrive on campus.

None of those data speak for themselves. Historically, the people who interpret and act upon this information have conferred advantages to wealthy students. But they can make different decisions today. Whether universities continue on their exclusive trajectories or become more inclusive institutions does not depend on how their students fill in bubble sheets. Instead, schools must find the answers for themselves: What kind of business are they in, and whom do they exist to serve?

  • Library Home
  •   MDR Home
  • Student Research
  • Graduate Research
  • Graduate Education Programs
  • Master of Education (M.Ed.) Theses

Standardized Testing and its Effectiveness for All Students

Thumbnail

Collections

More From Forbes

Research shows what state standardized tests actually measure.

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

Nothing to see here.

A new paper from Jamil Maroun and Christopher Tienken sets out to determine whether a state’s big standardized test measures student learning, teacher effectiveness, or something else. The answer, it turns out, is something else.

“The tests are not measuring how much students learned or can learn,” says Tienken. “They are predominately measuring the family and community capital of the student.”

Tienken has studied this territory before. He started his career as an elementary teacher, and served at various levels of school administration before entering his current work as an associate professor of leadership, management, and policy at Seton Hall University. In 2016 he published a study that showed how, with some census data, he and his team could predict what percentage of students at a school would score proficient on the state standardized test. The results held true for several different states, whether the district was rich or poor.

In other words, one could, with a high degree of accuracy, predict the results of the annual test of student learning and teacher effectiveness without actually giving students a single test.

Maroun and Tienken’s new publication, “ The Pernicious Predictability of State-Mandated Tests of Academic Achievement in the United States ,” focuses on standardized state math tests in New Jersey, finding the same result and offering an explanation for the effect.

The key concept is background knowledge. We’ve long known that background knowledge is directly related to reading comprehension. The classic study from way back in 1987 found that student ability to comprehend a reading passage about baseball was heavily influenced by their prior knowledge (or lack thereof) of baseball. It’s not a shocking insight. The more you know about a topic, the better you can understand writing about that topic. But background knowledge can apply in a broader sense as well. As Maroun and Tienken write:

Students utilize their background knowledge to establish connections, infer meanings, and aid their overall comprehension of the text.

Who has the best chance to build up a wealth of background knowledge (especially the type likely to turn up on a standardized test)? Children raised around rich resources and social capital. As the writers note, “students from impoverished backgrounds often encounter barriers that limit their access to life experiences that build background knowledge often found on standardized tests, such as travel, leisure activities, and extracurricular pursuits.”

Best High-Yield Savings Accounts Of 2024

Best 5% interest savings accounts of 2024.

Wealthier families don’t just have more resources in their homes (including their own educations). They have a wealth of social capital, what Robert Putnam in Our Kids , a powerful study of children and social capital, calls "informal ties to family, friends, neighbors and acquaintances involved in civic associations, religious institutions, athletic teams, volunteer activities, and so on.” Social capital provides easier access to varied experiences, which in turn build background knowledge.

The connection to scores on standardized reading tests may seem obvious, but why did the researchers find the effect in standardized math scores?

If you have seen a standardized math test in the last twenty years, you know the answer. Math tests now involve lots of reading. While earlier generations may have only dealt with the occasional “story problem,” those are now frequent and routine. This can skew students’ math test results. Says Tienkin, “They understand the mathematics, but they don’t comprehend what the question is asking them to compute.”

What conclusions can we draw from this addition to the research on the subject of test scores?

First, we can once again recognize that the standardized tests used to make definitive statements about student learning and teacher effectiveness, to assess the quality of administrators, to declare a school “failing,” to pinpoint student academic weaknesses and strength— these tests are in fact simply reflecting the demographics of the students’ families. Maroun and Tienkin write:

The results from the state-mandated standardized tests used in New Jersey have not been independently validated for all of the ways the results are used, yet some education leaders rely on them for decision-making purposes.

That stands true for every state where big standardized test data is used in these ways.

Second, if policy makers must insist that the big standardized test scores must be used for this wide variety of policy purposes, research like this suggests that the best way to improve test scores for students from less resource-filled backgrounds might be to provide them with wider and deeper experiences aimed at building background knowledge, rather than bombarding them with test prep exercises and workbooks.

Education centered around high-stakes testing has been pushing schools down the wrong road for twenty-some years. This study is a reminder that the big standardized test generates data that actually says far more about a school’s demographics than its effectiveness.

Peter Greene

  • Editorial Standards
  • Reprints & Permissions

Join The Conversation

One Community. Many Voices. Create a free account to share your thoughts. 

Forbes Community Guidelines

Our community is about connecting people through open and thoughtful conversations. We want our readers to share their views and exchange ideas and facts in a safe space.

In order to do so, please follow the posting rules in our site's  Terms of Service.   We've summarized some of those key rules below. Simply put, keep it civil.

Your post will be rejected if we notice that it seems to contain:

  • False or intentionally out-of-context or misleading information
  • Insults, profanity, incoherent, obscene or inflammatory language or threats of any kind
  • Attacks on the identity of other commenters or the article's author
  • Content that otherwise violates our site's  terms.

User accounts will be blocked if we notice or believe that users are engaged in:

  • Continuous attempts to re-post comments that have been previously moderated/rejected
  • Racist, sexist, homophobic or other discriminatory comments
  • Attempts or tactics that put the site security at risk
  • Actions that otherwise violate our site's  terms.

So, how can you be a power user?

  • Stay on topic and share your insights
  • Feel free to be clear and thoughtful to get your point across
  • ‘Like’ or ‘Dislike’ to show your point of view.
  • Protect your community.
  • Use the report tool to alert us when someone breaks the rules.

Thanks for reading our community guidelines. Please read the full list of posting rules found in our site's  Terms of Service.

Psychological and Brain Sciences

Clas pbs professor receives nih grant to research how to improve lifeguard training using virtual reality.

CM

Cathleen Moore , professor in the Department of Psychological and Brain Sciences in the College of Liberal Arts and Sciences and Starch Faculty Fellow, received a grant from the National Institutes of Health for $413,267 to study how lifeguard training can be improved using virtual reality. 

Moore and her team will research the limitations and impact of attention and perception on lifeguarding. They will use virtual reality to test various methods of training. 

Understanding  the limitations will allow for development of better safety training and increased injury prevention, Moore said.  

“The basic problem is that the surveillance component of lifeguarding requires that lifeguards actively monitor a complex and constantly changing scene for poorly specified critical events,” Moore said.  

“For example, they have to notice if one swimmer goes under for too long while other swimmers are simultaneously going under for variable durations. The attentional and perceptual demands of the task are enormous but are rarely considered when identifying safety vulnerabilities at aquatics facilities.” 

Moore co-leads the Visual Perception Research Group at the University of Iowa where she researches the strengths and limitations of human perception. She also received a UI Injury Prevention Research Center pilot grant in 2022 for studying a swimming pool lifeguarding environment.  

Moore focuses on how perception interacts with cognitive processing and how this can affect our experiencing of the physical world. For lifeguarding, this means finding what causes something to stick out about the observed environment, how that is then processed, and how or why action or inaction follows.  

Assuming a lifeguard will notice all critical events if they are paying attention is misguided, Moore said. The environment is complex and simply “focusing” won’t guarantee active processing and reactions. Carefully focusing can even be the cause of missing incidents elsewhere, she added.  

 Studying lifeguarding scenarios can be challenging because complex environments and unique critical events are hard to control and keep standardized. There is also the issue of studying events that endanger human lives.  

Virtual reality will allow for precise control over the environment and what events are being studied with easy replicability, Moore said. The research team can introduce specific “critical events” at controlled times to see how perception is impacted. 

“Given our simplified controlled environment, we can compare what simulated lifeguards are doing differently than real-life lifeguards,” Moore said. “Then, we can test specific impacts of cognitive and perceptual limitations in a simulated lifeguarding task. This will allow us to identify what vulnerabilities are greatest and what kinds of mitigating factors can be introduced to reduce surveillance failures.”   

Once research is complete, the hope is to make the training system available to the public for future lifeguard training. In the lab, the team use commercially available VR equipment so the system will be accessible for other organizations once it's available.  

Moore told the Injury Prevention Research Center the team will apply for longer-term funding in the future to test alternative training programs for local pools. 

“We hope to develop customized environments that simulate real pools. For example, the City Park pool in Iowa City or parts of the water park at Adventure Land in Des Moines,” Moore said .

NOTICE: The University of Iowa Center for Advancement is an operational name for the State University of Iowa Foundation, an independent, Iowa nonprofit corporation organized as a 501(c)(3) tax-exempt, publicly supported charitable entity working to advance the University of Iowa. Please review its full disclosure statement.

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

  • Alzheimer's disease & dementia
  • Arthritis & Rheumatism
  • Attention deficit disorders
  • Autism spectrum disorders
  • Biomedical technology
  • Diseases, Conditions, Syndromes
  • Endocrinology & Metabolism
  • Gastroenterology
  • Gerontology & Geriatrics
  • Health informatics
  • Inflammatory disorders
  • Medical economics
  • Medical research
  • Medications
  • Neuroscience
  • Obstetrics & gynaecology
  • Oncology & Cancer
  • Ophthalmology
  • Overweight & Obesity
  • Parkinson's & Movement disorders
  • Psychology & Psychiatry
  • Radiology & Imaging
  • Sleep disorders
  • Sports medicine & Kinesiology
  • Vaccination
  • Breast cancer
  • Cardiovascular disease
  • Chronic obstructive pulmonary disease
  • Colon cancer
  • Coronary artery disease
  • Heart attack
  • Heart disease
  • High blood pressure
  • Kidney disease
  • Lung cancer
  • Multiple sclerosis
  • Myocardial infarction
  • Ovarian cancer
  • Post traumatic stress disorder
  • Rheumatoid arthritis
  • Schizophrenia
  • Skin cancer
  • Type 2 diabetes
  • Full List »

share this!

July 3, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

peer-reviewed publication

trusted source

Experts discuss new screening tool developed for lipoprotein(a) detection

by Rachel Martin, Yale University

What to know about lipoprotein(a)

Many patients receive a standardized lipid panel as part of a yearly physical that includes testing of their "good" cholesterol (high-density lipoprotein) and "bad" cholesterol (low-density lipoprotein). However, most people are unfamiliar with another type of cholesterol, lipoprotein(a) or Lp(a). This type of lipoprotein is not included in the standard lipid panel but is an independent risk factor for cardiovascular disease.

In the following Q&A, Yale clinicians and researchers share background about Lp(a), guidance for caring for patients with elevated levels, and new approaches to improve testing.

What is lipoprotein(a)?

Lp(a)—which is pronounced as 'lp little a'—is a lipoprotein whose serum level is genetically determined without much fluctuation after early childhood.

"We've known for many years that it is an independent risk factor for cardiovascular disease ," said Erica Spatz, MD, MHS, associate professor ( cardiovascular medicine ) and associate professor of epidemiology (chronic diseases).

"It contributes to atherosclerosis (plaque buildup in the heart arteries), thrombosis ( blood clotting ), and early aortic stenosis (narrowing of the valve that connects the main chamber of the heart to the rest of the body). It can also lead to major adverse cardiovascular events, like heart attacks."

How is Lp(a) tested?

"It is checked with a simple blood test that is often covered by insurance," said Spatz. "However, today, only about 0.5% of people in the United States are tested for Lp(a). This is because we haven't had therapies to reduce Lp(a). It is not lowered by lifestyle interventions and statins, which are first-line agents for high cholesterol. So, it hasn't been on most physicians' radar to check."

How can providers care for patients with elevated Lp(a) levels?

Although no targeted therapies are available, providers with expertise in preventive cardiology can work with patients to reduce risk.

"In our practice, we double down on all the other risk factors considered modifiable, like lifestyle and other LDL cholesterol," said Spatz, who directs the Yale Preventive Cardiovascular Health Program, which provides care for patients at higher risk of developing cardiovascular conditions.

"If a person has cardiovascular disease and elevated Lp(a) levels, we may use non-statin medications, like PCSK9 inhibitors, because clinical trials showed greater benefit for these patients. We also recommend family members get tested."

What research is in progress to help us better understand Lp(a)?

"We are excited about research in the pipeline that may give us new knowledge about how to treat patients with elevated Lp(a)," said Spatz. "I hope we will soon have some targeted treatment options. But the field of medicine needs to be prepared for a potential change in caring for these patients," said Spatz.

"Unfortunately, enrollment in these clinical trials is slow," added Rohan Khera, MD, MS, assistant professor of medicine (cardiovascular medicine), assistant professor of biostatistics ( health informatics ), and director of the Cardiovascular Data Science (CarDS) Lab.

"Only 1 in 8 people have elevated levels, so researchers need to test many people just to identify one who may be eligible for enrollment in the trial. To speed up research, we need a more efficient way to determine who would most benefit from testing."

To meet this goal, the researchers recently conducted a study published in Nature Cardiovascular Research to find an approach to determine who would benefit most from testing. This is part of the strategy to create algorithmic diagnostics to improve health promotion at the population level.

How was the new screening tool developed?

The team of researchers developed a machine learning model that uses structured clinical elements found in the electronic health records to determine patients who most likely have elevated Lp(a) levels.

"While Lp(a) is an independent risk factor for cardiovascular disease, it does have company that it tends to keep," Khera explained. "For example, we found that people with high Lp(a) were more likely to be women, Black, and to have hypertension or premature heart disease (presenting before the age of 60). They were also more likely to have a family history of atherosclerotic cardiovascular disease (ASCVD)."

The algorithmic model is based on the analysis of large, representative data sets, including the UK Biobank, ARIC, CARDIA, and MESA studies.

"We also evaluated the model to ensure that it was not introducing any bias in terms of social determinants of health, age, sex, race or ethnicity," said Arya Aminorroaya, MD, MPH, postdoctoral associate in the CarDS Lab, and first author of the paper.

"The good news is that the model's performance is consistent across all clinical subgroups. This tool is not going to replace the universal screening. But in this rapidly evolving landscape, this tool can help accelerate trial progress and help determine which patients will most benefit from screening."

How can the new screening tool be used?

"This tool is not diagnostic. You still need to test the person," said Khera. "But we have found that we can conduct half the number of tests to get the same number of patients with elevated levels."

Khera, Spatz, Aminorroaya, and the team hope this tool will help researchers leading clinical trials determine how to select trial participants and help hospitals or health systems decide which patients may benefit most from screening.

"I also see the potential to take this tool directly to patients," said Spatz. "Patients are aware of Lp(a) and wonder if they are at risk. This tool allows us to reach patients directly to help gauge their risk and prioritize testing in those at higher risk of elevated Lp(a)."

"This tool is not going to replace the universal screening, which is recommended by the European guideline and the National Lipid Association, and will likely be adopted more broadly soon," said Aminorroaya. "But in this rapidly evolving landscape, this tool can help accelerate trial progress and help determine which patients will most benefit from screening."

Explore further

Feedback to editors

research on standardized testing

WHO agency says talc is 'probably' cancer-causing

3 hours ago

research on standardized testing

New discovery reveals TRP14 is a crucial enzyme for cysteine metabolism, disease resistance

research on standardized testing

Researchers find biological clues to mental health impacts of prenatal cannabis exposure

4 hours ago

research on standardized testing

Nanoscopic motor proteins in the brain build the physical structures of memory, study finds

5 hours ago

research on standardized testing

Smoking is a key lifestyle factor linked to cognitive decline among older adults

research on standardized testing

Researchers aim to change contraceptive technology with new iron IUDs

research on standardized testing

Military service's hidden health toll: Servicewomen and their families endure increased chronic pain, finds study

research on standardized testing

Genomic variants study points to improved detection of thyroid cancer

research on standardized testing

Scientists map the distribution of lipids in the human brain

research on standardized testing

Study examines scale of US pharmaceutical industry sponsored events

6 hours ago

Related Stories

research on standardized testing

Lipoprotein(a): What to know about elevated levels

Jan 17, 2024

research on standardized testing

A machine-learning approach to managing diabetes and cardiovascular risk

Feb 10, 2022

research on standardized testing

New study prompts call for considering cholesterol screening earlier in life

Apr 24, 2024

Patients with high lipoprotein(a) levels may benefit from taking PCSK9 inhibitors

May 21, 2018

research on standardized testing

Two-step screening strategy could reduce diabetic heart failure

Jan 2, 2024

research on standardized testing

Assessing risk of heart disease no different for individuals with obesity

Nov 10, 2020

Recommended for you

research on standardized testing

Study of 1,800 first-time moms shows protein screening ineffective for hypertension prediction

Jul 3, 2024

research on standardized testing

Deep machine-learning speeds assessment of fruit fly heart aging and disease, a model for human disease

research on standardized testing

Team develops AI model to improve patient response to cancer therapy

research on standardized testing

Researchers develop low volume resuscitant for prehospital treatment of severe hemorrhagic shock

research on standardized testing

A predictive model for cross-border COVID spread

research on standardized testing

Mobile phone data helps track pathogen spread and evolution of superbugs

Let us know if there is a problem with our content.

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Medical Xpress in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

IMAGES

  1. (PDF) ASSESSMENT: Standardized Tests Predict Graduate Students' Success

    research on standardized testing

  2. Do Standardized Test Scores Measure Education Quality?

    research on standardized testing

  3. PPT

    research on standardized testing

  4. Research on standardized testing

    research on standardized testing

  5. Theories That Help Explain Differences in Standardized Tests

    research on standardized testing

  6. standardized test definition essay

    research on standardized testing

VIDEO

  1. Assessment, Standardized, Testing: Dr. Daniel Bolt

  2. The impact of standardized testing on student learning and educational equity

  3. The 10 Stages of Monitoring a Standardized Test 📝

  4. Impacts of Standardized Testing Pros and Cons

  5. Standardized Testing in the 90s

  6. Standardized testing administration is a tough but necessary part of education. #education #teacher

COMMENTS

  1. Standardized Testing Pros and Cons

    Standardized tests are defined as "any test that's administered, scored, and interpreted in a standard, predetermined manner," according to by W. James Popham, former President of the American Educational Research Association. The tests often have multiple-choice questions that can be quickly graded by automated test scoring machines.

  2. Missing the mark: Standardized testing as epistemological erasure in U

    Research shows that standardized tests are biased against students of color and induce stereotype threat or performance anxiety related to beliefs about deficiencies in one's own race (Ryan and Ryan, 2005; Steele and Aronson, 1998). Yet we see that the dominant narrative suggests that students, not the tests, are the problem.

  3. Standardized tests aren't the problem, it's how we use them

    But if tests aren't used as a way to support Black districts, students, and families by leading to solutions for structural inequities, then they will only facilitate the epidemic of racism that ...

  4. Effects of Standardized Testing on Students & Teachers

    Learn what standardized tests are and how they are used in US schools. Explore the benefits and challenges of standardized testing for students and teachers, and alternative assessment methods.

  5. (PDF) The Double Effects of Standardized Testing on Students and

    However, some researchers are against using standardized test scores to judge teachers' achievement and performance as they believe it is insufficient to evaluate teachers (Salend, 2019;Turnipseed ...

  6. What Does the Research Say About Testing?

    In a study of the nation's largest urban school districts, students took an average of 112 standardized tests between pre-K and grade 12. This annual testing ritual can take time from genuine learning, say many educators , and puts pressure on the least advantaged districts to focus on test prep—not to mention adding airless, stultifying ...

  7. A primer on standardized testing: History, measurement, classical test

    Critique of Standardized Tests. As the use of standardized tests for high-stakes exams increased, so did the critique of their use. 10 Counsell 11 conducted a case study exploring the effect of the high-stakes accountability system on the lives of students and teachers. The findings revealed that the culture of testing introduces a continuum of ...

  8. Can Standardized Tests Predict Adult Success? What the Research Says

    Umut Özek is a principal researcher at the American Institutes for Research. The use of standardized tests as a measure of student success and progress in school goes back decades, with federal ...

  9. Harvard's education podcast: Standardized tests and learning loss

    Professor Andrew Ho argues that standardized testing is still important to assess student learning during COVID times, but it should be done differently and with caution. He explains why testing is not enough and why we need to consider other factors such as student health and attendance.

  10. Tests and Stress Bias

    A new study suggests that changes in levels of cortisol, a hormone associated with stress, during weeks of standardized testing hurt how students in one New Orleans charter school network performed — and kids coming from more stressful neighborhoods, with lower incomes and more incidents of violence, were most affected.

  11. The Value of Standardized Testing: A Perspective From Cognitive

    Recent years have seen an increased push toward the standardization of education in the United States. At the federal level, both major national political parties have generally supported the institution of national guidelines known as Common Core—a curriculum developed by states and by philanthropic organizations.A key component of past and present educational reform measures has been ...

  12. Standardized Testing is Still Failing Students

    If you didn't answer "all of the above," then: A. You haven't been paying attention, or B. You work for a testing software company. Most of us know that standardized tests are inaccurate, inequitable, and often ineffective at gauging what students actually know. The good news is, there's a better way: Performance-based assessment provides an essential piece of the puzzle in measuring ...

  13. Testing, Stress, and Performance: How Students Respond Physiologically

    Abstract. We examine how students' physiological stress differs between a regular school week and a high-stakes testing week, and we raise questions about how to interpret high-stakes test scores. A potential contributor to socioeconomic disparities in academic performance is the difference in the level of stress experienced by students outside of school. Chronic stress—due to neighborhood ...

  14. Standardizing America: Why it Should Be a Method of the Past

    The reviewed research suggests that standardized testing has a negative effect on students and their learning environments. Research from the University of Virginia found that students show mental and physical implications of standardized testing including headaches, anxiety, and trouble sleeping (Moon et al.).

  15. Research tells us standardized admissions tests benefit under

    In late January, the University of California Standardized Testing Task Force completed a yearlong review of testing as a college admissions tool. The comprehensive report made the following findings: Standardized tests are the best predictor of a student's first-year success, retention and graduation. The value of admissions test scores in ...

  16. Future of Testing in Education: The Way Forward for State Standardized

    Give additional funding to states for testing and related research and development to support cutting-edge technology. ... "Standardized Testing and Stereotype Threat," March 12, ...

  17. Standardized Tests: The Benefits and Impacts of Implementing

    Standardized testing doesn't measure intelligence. While advocates claim that standardized examinations give an objective assessment of student success, the facts are more nuanced. Evidence reveals that socioeconomic class, rather than schooling or grade level, is the biggest predictor of SAT achievement. Opponents of the SAT contend that ...

  18. (PDF) Standardized Testing

    Standardized Tests can be defined as "an appropriate measure of student's, teacher's, and school's performance" (Herman & Golan, 1991). Popham (1999) also described. Standardized Tests ...

  19. The Racist Beginnings of Standardized Testing

    The SAT debuted in 1926, joined by the ACT (American College Testing) in the 1950s. By the 21st century, the SAT and ACT were just part of a barrage of tests students may face before reaching college. The College Board also offers SAT II tests, designed for individual subjects ranging from biology to geography.

  20. Are Standardized Tests Racist, or Are They Anti-racist?

    Yes. By Neil Lewis Jr. The Atlantic; Tim Macpherson / Getty. January 23, 2023. They're making their lists, checking them twice, trying to decide who's in and who's not. Once again, it's ...

  21. Standardized Testing and its Effectiveness for All Students

    Standardized testing is ingrained in the education system. Educators, administrators, and school systems rely on data from standardized testing to guide instruction, assess students, and determine teacher effectiveness. Previous research has detailed the potential ineffectiveness of standardized testing, while other studies have concluded that ...

  22. PDF Beyond Exit Exams: The Effects of Standardized Tests on Student

    Specifically, for every increase of 1. standardized test per state, the state average SAT score decreases by just over 5 points (5.1). Beyond the number of standardized tests, states that have HSEEs have a statistically. significant decrease of 45 points on the average SAT scores for the state.

  23. Research Shows What State Standardized Tests Actually Measure

    getty. A new paper from Jamil Maroun and Christopher Tienken sets out to determine whether a state's big standardized test measures student learning, teacher effectiveness, or something else ...

  24. CLAS PBS professor receives NIH grant to research how to improve

    Cathleen Moore, professor in the Department of Psychological and Brain Sciences in the College of Liberal Arts and Sciences and Starch Faculty Fellow, received a grant from the National Institutes of Health for $413,267 to study how lifeguard training can be improved using virtual reality.. Moore and her team will research the limitations and impact of attention and perception on lifeguarding.

  25. Bloomberg.com

    Bloomberg.com

  26. Experts discuss new screening tool developed for lipoprotein(a) detection

    Many patients receive a standardized lipid panel as part of a yearly physical that includes testing of their "good" cholesterol (high-density lipoprotein) and "bad" cholesterol (low-density ...