Results: Nineteen original studies involving 19,421 patients were included. Experienced clinicians show good agreement for the diagnosis of TIA (κ = 0.71, 95% confidence interval [CI] = 0.62-0.81). Evaluator variability between the clinician`s WIL diagnosis and administrative data also showed good agreement (κ = 0.68, 95% CI = 0.62-0.74). There was moderate correspondence (κ = 0.41, 95% CI = 0.22-0.61) between referring clinicians and TIA clinic clinicians who received referrals. Sixty percent of the 748 patients referred to TIA clinics were TIA imitators. Although the positive and negative agreement formulas are identical to the sensitivity/specificity formulas, it is important to distinguish between them because the interpretation is different. Another option would be to check if some reviewers are so biased that they usually receive higher or lower reviews than others. One could also determine which images are the subject of most of the disagreements, and then try to identify the specific characteristics of the image that are the cause of the disagreement. Nor is it possible to use these statistics to determine that one test is better than another. Recently, a British national newspaper published an article about a PCR test developed by Public Health England and the fact that it did not agree with a new commercial test in 35 of the 1144 samples (3%).
Of course, for many journalists, this was proof that the PHE test was inaccurate. There is no way to know which test is good and which is wrong in any of these 35 disagreements. We simply do not know the actual state of the subject in compliance studies. Only by further examining these disagreements will it be possible to determine the reason for the discrepancies. Chen CC, Barnhart HX. Assessment of compliance with the class calculation correlation coefficient and the concordance correlation coefficient for data with repeated measures. Comput Stat Data Anal 2013;60:132-45. CLSI EP12: User Protocol for Evaluation of Qualitative Test Performance protocol describes the terms Positive Percent Agreement (PPA) and Negative Percent Agreement (NPA). If you need to compare two binary diagnostics, you can use an agreement study to calculate these statistics. Conclusions: The general agreement among experienced clinicians was good for the diagnosis of TIA, although there are still differences for a significant proportion of cases. Diagnostic agreement for TIA decreased among non-specialists.
The large number of patients referred to TIA clinics with other (often neurological) diagnoses was significant, suggesting that clinicians familiar with TIAs and their imitations should manage TIA clinics. Very often, agreement studies are an indirect attempt to validate a new scoring system or tool. That is, in the absence of a definitive criterion variable or a « gold standard », the accuracy of a scale or instrument is assessed by comparing its results when used by different evaluators. Here you may want to use methods that address the problem of real concern – to what extent do the reviews reflect the true trait you want to measure? While we cannot « prove zero » that there is no difference between test results, we can use equivalence tests to determine whether the mean difference between test results is small enough to be considered (clinically) insignificant. Bland and Altman`s Compliance Limits (LOA) address the problem in this way by providing an estimate of a range where 95% of the differences between test results are expected to decrease (assuming that these differences are roughly normally distributed).2,3 LOAs are calculated as ( bar{d} pm 1.96 cdot s_{text{d}}), where ( bar{d} ) is the mean of the sample of differences and s d is the default deviation of the sample. If the LOA range contains what are considered clinically significant differences, this result would indicate that the agreement between the tests is not satisfactory. LOAs are also often represented graphically by plotting the average result for each subject in relation to the difference between these results. Figure 1 illustrates this tool using a hypothetical example. Bland and Altman warn that the LOA is only a significant result if the mean and variance of the differences between the test results are constant across the range of test results.3 In other words, the LOA should not be used if the correspondence between the tests varies with the quantity measured. Such a situation can occur if the tests provide similar results for subjects whose test results are within a normal range, but have a poor match for subjects outside that range. In their work, Dunet et al1 use both Bland and Altman matching limits and Lin concordance correlation coefficients to evaluate the correspondence between software packages.
Both of these methods provide additional information. .