Title: A Scoping Review Conducted to Identify Interrater Reliability Studies of Various Techniques Used in Detecting Thoracic Somatic Dysfunction
Authors: Angie K. Maxson, OMS IV; Kristen P. W. Gavin, OMS IV; Phillip G. Munoz, OMS III; Tianfu Shang, OMS I; Crystal Martin, DO
Introduction
Osteopathic medical schools have standards to which osteopathic manipulative medicine is taught, as outlined by the Educational Council on Osteopathic Principles in A Teaching Guide for Osteopathic Manipulative Medicine (Teaching Guide), and student somatic dysfunction (SD) diagnostic abilities are frequently assessed to encourage competency beyond pre-graduate years. Therefore, we inquire, what evidence exists about interrater reliability (IRR) for visual and palpatory techniques designed to detect SD? We limit our scope to the thorax, as part of a larger review.
Methods
PubMed queries beginning with (interrater[tiab] OR interobserver[tiab] OR interexaminer[tiab] OR intertester[tiab]) AND (humans[Filter]) and ending with NOT (“diagnostic imaging”[mesh] OR radiography[mesh]) were separated by AND to integrate Teaching Guide key terms and synonyms, followed by [tiab], that related to tissue texture change, asymmetry, restriction, and SD of the cervicothoracic and thoracolumbar junctions and thorax. Cochran Reviews, Journal of Osteopathic Medicine (JOM), and OSTMED.DR® search terms included interrater, interobserver, interexaminer, and intertester, which were limited to titles in Cochran Reviews and OSTMED.DR®. Database feedback was combined, and deduplicated abstracts were screened by 2 investigators independently for inclusion criteria: available in English, healthcare practitioners/students as raters, human patient(s) or proxy examined, IRR data published for visual and/or palpatory technique(s) assessing thoracic SD. Investigators reviewed qualifying articles with IRR results examined for those meeting inclusion criteria. A third investigator resolved any discrepancies.
Results
A total of 191 items were returned with corresponding counts indicated: Cochran (8), JOM (7), OSTMED.DR® (14), and PubMed (162). One-hundred-eleven unique abstracts were screened. Of the 38 articles reviewed, 14 met inclusion criteria. Five studies exclusively related to the thorax, assessing IRR of palpation of thoracic spine (Cronbach alpha scores ranged from -0.38 to 0.28; kappa values from 0.01-0.65) or the superior thoracic inlet (kappa values of 0.59-0.70). All others incorporated alternative body regions, the thorax and/or thoracolumbar junction, reporting a vast range of kappa values, percent dis/agreement, and/or correlation coefficients.
Discussion
This investigation highlights the underwhelming number of IRR studies conducted with regards to visual and palpatory techniques used to detect thoracic SD. However, this review was limited by its queries, search/key terms, and databases. How to best assess the IRR of visual and palpatory skills has yet to be defined and their application deemed reliable, as alpha scores were weak and kappa values ranged across the scale. Although, existing literature provides a pathway for expansion of this work, including other body regions.
Thank you for your presentation, Student Dr. Maxson. Looking at your Results tables, not all articles have a reported K. Can you tell me what is considered a “good” or “high” level of interrater reliability? What would you propose to improve interrater reliability in thoracic dysfunction?
Whoops, I am also supposed to let you know that I was assigned as a judge for your project. Thanks again!
Thank you for your questions and comments, Dr. Briggs Early. Interpretation of a kappa value depends on the number of raters and the statistics used. For measuring agreement between 2 raters, Cohen’s kappa values and their interpretations are as follows: 0.00 = no agreement; 0.10 – 0.20 = slight agreement; 0.21 – 0.40 = fair agreement; 0.41 – 0.60 moderate agreement; 0.61 – 0.80 = substantial agreement; 0.81 – 0.99 = near perfect agreement; 1.00 = perfect agreement. Fleiss’ kappa is used when there are more than 2 raters: < 0.00 = poor agreement; 0 – 0.20 = slight agreement; 0.21 – 0.40 = fair agreement; 0.41 – 0.60 = moderate agreement; 0.61 – 0.80 = substantial agreement; 0.81 – 1.00 = almost perfect. A weighted kappa allows for disagreements to be weighted differently: < 0.20 = poor strength; 0.21 – 0.40 = fair strength; 0.41 – 0.60 = moderate; 0.61 – 0.80 = good strength; 0.81 – 1.00 = very good strength. As you can see, there is not a prescribed method for investigating the interrater reliability of techniques utilized to detect thoracic somatic dysfunction. Furthermore, the interpretation of one study does not necessarily correlate with that of another. Therefore, replication of studies investigating the interrater reliability of any one method in a standardized fashion should occur. Additionally, it is important to note that palpation of somatic dysfunction in a human model may alter the experience of the rater to follow. The use of a nonhuman model for skills assessment could be useful in determining a student's abilities. Again, thank you.
Two additional points of clarification, Dr. Briggs Early. When I referred to the use of a nonhuman model, I was referring to a manufactured model, such as a plastic model. Additionally, I realized that you may have been looking for a “cut off” with respect to kappa values. According to McHugh (2012) and the article “Interrater reliability: the kappa statistic,” for healthcare related studies, it is argued that “any kappa below 0.60 indicates inadequate agreement among the raters and little confidence should be placed in the study results.” Again, thank you for your attention.
Great, thank you so much for your thorough response, and for your presentation!
Thank you for your presentation. As a judge, I was wondering if you found an overall trend in interrater reliability and if you found one method of evaluation helpful.
Thank you for your interest and question, Dr. Habecker. Within our results table (slides 7 and 8), there does appear to be a trend with many studies reporting unfavorable results. However, replication of these studies will assist in the exploration of their validity and will provide future direction. Overall, it was interesting to learn about how each of these studies attempted to assess their chosen technique. For example, Hutchinson et al. (2017) developed a standardized study protocol for use in assessing the superior thoracic inlet and diagnosing myofascial somatic dysfunction motion patterns using a tri-planar diagnosis (including digit placement for the diagnosing clinician) with results for each of the three planes ranging from fair (translational plane with a kappa of 0.59) to excellent (sagittal plane with a kappa of 0.70). Given their results, they argue that this protocol is reproducible and will result in consistent diagnoses of superior thoracic outlet somatic dysfunction. Again, it would be worth replicating this study to further explore the execution and application of their protocol. Thanks again for your time and interest.
Thanks for this presentation, I am one of your judges. I am wondering if your team had any ideas on how interrater reliability might be improved when detecting thoracic somatic dysfunction.
Thank you for your question, Dr. Mapes. Our team discussed several options for improving interrater reliability when detecting thoracic somatic dysfunction, including the use of standardized protocols as discussed in my response to Dr. Habecker and the use of nonhuman or fabricated models as proposed in my reply to Dr. Briggs Early. Another idea extracted from current literature involves the investigation into the amount of pressure applied with palpation to assure that raters are in fact assessing desired structures. It is quite possible that raters are palpating with too little pressure, assessing structures superficial to the intended target, or they may be apply excessive pressure, palpating deeper than necessary. Again, thank you for your question and interest in our project.