How are research-based assessment instruments developed and validated?

posted April 10, 2021 and revised September 29, 2023
by Adrian Madsen, Sarah B. McKagan and Eleanor C. Sayre

How are research-based assessments instruments of content and beliefs developed and validated?

Good research-based assessment instruments (RBAIs) are different from typical exams in that their creation involves extensive research and development by experts in Physics Education Research (PER) and/or Astronomy Education Research (AER) to ensure that the questions represent concepts that faculty think are important, responses represent real student thinking and make sense to students, and that students’ scores reliably tell us something about their understanding. The typical process of developing a research-based assessment of content or beliefs includes the following steps (Adams and Wieman 2011; Engelhardt 2009)

  1. Gathering students’ ideas about a given topic, usually with interviews or open-ended written questions.
  2. Using students’ ideas to write multiple-choice conceptual questions where the incorrect responses cover the range of students’ most common incorrect ideas using the students’ actual wording.
  3. Testing these questions with another group of students. Usually, researchers use interviews where students talk about their thinking for each question.
  4. Testing these questions with experts in the discipline to ensure that they agree on the importance of the questions and the correctness of the answers.
  5. Revising questions based on feedback from students and experts.
  6. Administering assessment to large numbers of students. Checking the reproducibility of results across courses and institutions. Checking the distributions of answers. Using various statistical methods to ensure the reliability of the assessment.
  7. Revising again.

Beichner described a similar process for developing RBAIs that might also be of interest to a new RBAI developer (Beichner 1994). This rigorous development process produces valid and reliable assessments that can be used to compare instruction across classes and institutions.

How are observations protocols developed and validated? 

Observation protocols on PhysPort are all developed using research-based approaches but there are differences in how they are developed, structured, and used as compared to assessments of content and beliefs. Further, there is a lot more variety in how individual observation protocols developed and validated. Here are some commonalities in that process: 

  1. Draft protocol items based on existing protocols, unstructured observations of students or faculty focusing on a certain aspect of their behavior or developers write items to focus on a particular theoretical construct that they want to capture.
  2. Initial items are often reviewed by experts and revised (often several times).
  3. Initial draft of the protocol is used by more than one observer in the same classroom.
  4. Results of observations are compared between multiple observers. Discrepancies are discussed and protocol items are revised.
  5. Protocol is used in more classroom observations (and possibly in different disciplines or institutions and by different observers), and revised until observers have strong agreement between one another, and are confident that the observation protocol is capturing what the developers intended. Often an inter-rater reliability metric is calculated.
  6. Developers look at some measure of the validity, e.g., comparing the observation results to another measure, looking at how observation results predict the results of another measure etc.
  7. Developers create training materials to help new users use the protocol correctly and revise them as appropriate. 

How are research validation summaries determined on PhysPort?

Based on the steps to developing a good research-based assessment, we have created a list of seven categories of research validation for assessments of content and beliefs (Table 1). Each of these categories says different things about the research validation behind the instrument. “Studied with student interviews” and “questions based on research into student thinking” are two different ways of connecting test questions with students’ ideas. “Studied with expert review” ensures that the questions are relevant to physics educators. “Appropriate use of statistical analysis” compares students’ performance on the questions in a robust way. “Administered at multiple institutions” ensures that the RBAI is applicable to more than one institution. “Research published by someone other than developers” and “at least one peer-reviewed publication” are two different ways of measuring community buy-in about the research behind the RBAI. Different members of the research community value these different methods in different ways. Several articles discuss the affordances and constraints of these categories in more depth (Adams and Wieman 2011; Engelhardt 2009; Lindell, Peak, and Foster 2006)

We have developed separate levels of research validation for observation protocols because the development process for these is substantially different than for the other kinds of assessments. Because faculty, and not students, use protocols, it does not make sense to look at student thinking or do student interviews. Instead, when developing observation protocols, it is vital to ensure that the categories of observation are grounded in real classrooms. The protocol is iteratively developed through use in real classrooms, there is a high level of inter-rater reliability (which means that the observers can interpret and apply the protocol similarly), and the training materials for using the protocol have been tested and refined. To reflect the differences between observation protocols and other types of RBAIs, we developed a parallel set of research validation categories for observation protocols.

Table 1. Research validation categories for different types of assessments 

Categories for content, belief, and reasoning RBAIs

Categories for observation protocols

Categories for rubrics

Questions based on research into student thinking 

Categories based on research into classroom behavior

Items based on relevant theory and/or data

Studied with student interviews

Studied using iterative observations

Tested and refined through iterative use of rubric

Studied with expert review

Tested using inter-rater reliability

Tested using inter-rater reliability

Appropriate use of statistical analysis

Training materials are tested

Studied with expert review

Research conducted at multiple institutions
Research conducted by multiple research groups
At least one peer-reviewed publication

We determine the level of research validation for an assessment based on how many of the research validation categories apply to the RBAI (Table 2). RBAIs will have a gold level validation when they have been rigorously developed and recognized by a wider research community. Silver-level RBAIs are also well-validated, but are missing 1 and 2 levels of research validation. In many cases, silver RBAIs have been validated by the developers but not the larger community, often because these assessments are new. Bronze-level assessments are those where developers have done some validation but are missing pieces. Finally, research-based validation means that an assessment is likely still in the early stages. While the research validation category given for each assessment in this paper is informative, you may be interested in knowing exactly what levels of research validation were completed for a particular assessment. To do this, go to the research tab on the PhysPort assessment  you are interested in. There you will find a list of the validation categories indicating which have been completed, and a short description of the research done for that assessment (Figure 1).

Table 2. Determination of the level of research validation for an assessment on PhysPort.
# Categories Research validation level 
All 7 Gold
5-6 Silver
3-4 Bronze
1-2 Research-based


Figure 1. Examples of research validation summary for an assessment of content, the FMCE from PhysPort.

Figure 2. Examples of research validation summary for an observation protocol, the TDOP from PhysPort.