Measurements for the success of assessment tools part II: validity

August 4, 2019

For an admission tool to be useful in assessing applicants it must first be reliable. However, that is not sufficient – the tool must also be valid. In a previous blog post on “reliability“, we discussed the various types of reliability and how they are useful in examining the replicability of test scores.

But just because a score is replicable does not mean it is measuring what we want to measure. Therefore, along with high reliability, high validity is also necessary for a good assessment tool.

The next step would be to determine the validity of the test, which is evaluating whether the assessment tool is actually measuring what it claims to measure.

Let’s say you want to measure your height with a measuring tape you randomly found in the office. You measure yourself a few times to make sure that you get a consistent measurement (test retest reliability). To be extra sure, you ask your coworkers to measure you with the measuring tape to see if they get the same height (inter-rater reliability). To be even more sure, you check to see that the measurement you get in centimetres aligns with the measurement you get in inches on the measuring tape (internal consistency). However, all you’ve done so far is demonstrated that the measuring tape is a reliable measure of height, there is no indication that the measuring tape is measuring your actual height.

There are a number of possible reasons as to why the measuring tape could be invalid. The measuring tape might be defective; the measurement intervals may not be evenly distributed or the manufacturer may have made a printing error and missed some numbers when printing the measurements on the tape. Someone could even be pulling a prank on you with the measuring tape and cut a part of it and pieced it back together, so you might actually be taller than what the measuring tape tells you. In other words, it is possible that the measuring tape may be providing reliable measurements but is not accurately measuring your true height. You need to make sure that the measuring tape is not only reliable but also valid in measuring an individual’s true height.

Like reliability, there are many types of validity, depending on what is being measured and how it is being measured. When dealing with admission tools, the most applicable type of validity is test validity. Test validity is the degree to which the test measures what it intends to measure, and can be further broken down into categories. These forms of validity are:

Construct validity

Construct validity is a measure of how well our assessment tool is measuring the targeted constructs, and not the irrelevant constructs. An example would be how well an IQ test actually measures intelligence, and not just memorization skills. Whether a test does or does not demonstrate construct validity depends on whether it is assessing the construct that you want to measure. So for instance, a thermometer is a valid tool when its intended use is to assess the temperature of a room, but it would not be a valid tool if the intention is to measure the brightness of a room.

Criterion validity

Criterion validity is measured by seeing how well scores on a test predict outcomes that the test is designed to predict. An example is seeing how an IQ test predicts academic performance among students. If both the scores on the test and the outcome measure are assessed at the same time, it is known as concurrent validity (e.g. taking an IQ test and taking the MCATs in the same time period). If the test is being correlated with a future measure, it is known as predictive validity (e.g. how well your undergraduate GPA will predict your grades in medical school).

In medical schools, two main constructs are being taught and measured: knowledge of the medical field and clinical skills. Admitting applicants to medical schools requires measuring for aptitude in both these constructs. Academic performance measures, such as GPA and the MCAT, are demonstrated to have high predictive validity for medical knowledge on a national medical licensing exam. However, admission tools for measuring personal and professional characteristics that are important for clinical practice are not standardized. Commonly used methods, such as personal statements, reference letters, and traditional interviews have demonstrated little reliability, much less any predictive validity.

Two new tools currently exist for academic admissions to predict personal and professional characteristics: the multiple-mini interview (MMI) and Computer-Based Assessment of Personal Characteristics (Casper). The MMI has been adopted by many schools as it has been shown to be useful in predicting successful performance on the objective structured clinical examinations (OSCE) exam. However, they require significant time and resource investment from the admissions department, hence it is only feasible to invite a subset of applicants to participate in the MMI.

In comparison, Casper is cost effective, administered online, and accessible to all applicants while also demonstrating predictive validity, as the test predicts performance on the MMI (r = 0.4). In addition, Casper has demonstrated construct validity as it essentially does not correlate with GPA, thus measuring a trait distinct from academic performance (r = -0.04 – 0.08). Casper also demonstrates predictive validity, as performance on Casper prior to medical admission can predict future performance on both the Medical Council of Canada Qualifying Examination (MCCQE) Parts I (end of medical school) and Part II 18 months into specialty training (r = 0.3-0.5).

So next time you’re thinking about incorporating a new tool into the admissions process, make sure to consider both the reliability and the validity of the test – we should want the tests to provide accurate and consistent results for the specific areas we want to assess.

By: Patrick Antonacci, M.A.Sc., Data Scientist

Photo by tumsasedgars on iStock