Too Many Thyroid Biopsies Result in Too Many Thyroid Surgeries: What To Do About?
- The prevalence of thyroid nodules in the general population is increasingly high.
- At least half of those biopsied prove to be benign.
- Ultrasound systems are being proposed as “rule-out” tests to identify nodules that do not require FNA.
- Four hundred seventy-seven patients (358 females, 75.2%); mean (SD) age, 55.9 (13.9) years.
- Application of the US rule out system FNA criteria would have reduced the number of biopsies performed by 17.1% to 53.4%.
- The ACR Thyroid Imaging Reporting and Data System (TIRADS) allowed the largest reduction (268 of 502).
- Internationally endorsed sonographic risk stratification systems vary widely in their ability to reduce the number of unnecessary thyroid nodule FNAs. The ACR TIRADS outperformed the others, classifying more than half the biopsies as unnecessary with a FNR of 2.2%.
- This is why patients need to have a second opinion before surgery because if the ultrasound pattern confirmed there was no need for the biopsy, they might be saved for the surgery.
- Call me for a second opinion on the necessity for the biopsy.
- 310-393-8860 or secure email at [email protected]
- Ask for Alicia
Reducing the Number of Unnecessary Thyroid Biopsies While Improving Diagnostic Accuracy: Toward the “Right” TIRADS
The prevalence of thyroid nodules in the general population is increasingly high, and at least half of those biopsied prove to be benign. Sonographic risk-stratification systems are being proposed as “rule-out” tests that can identify nodules that do not require fine-needle aspiration (FNA) cytology.
To comparatively assess the performances of five internationally endorsed sonographic classification systems [those of the American Thyroid Association, the American Association of Clinical Endocrinologists, the American College of Radiology (ACR), the European Thyroid Association, and the Korean Society of Thyroid Radiology] in identifying nodules whose FNAs can be safely deferred and to estimate their negative predictive values (NPVs).
Prospective study of thyroid nodules referred for FNA.
Single academic referral center.
Four hundred seventy-seven patients (358 females, 75.2%); mean (SD) age, 55.9 (13.9) years.
Number of biopsies classified as unnecessary, false-negative rate (FNR), sensitivity, specificity, predictive values, and diagnostic ORs for each system.
Application of the systems’ FNA criteria would have reduced the number of biopsies performed by 17.1% to 53.4%. The ACR Thyroid Imaging Reporting and Data System (TIRADS) allowed the largest reduction (268 of 502) with the lowest FNR (NPV, 97.8%; 95% CI, 95.2% to 99.2%). Except for the Korean Society of Thyroid Radiology TIRADS, all other systems exhibited significant discriminatory performance but produced significantly smaller reductions in the number of procedures.
Internationally endorsed sonographic risk stratification systems vary widely in their ability to reduce the number of unnecessary thyroid nodule FNAs. The ACR TIRADS outperformed the others, classifying more than half the biopsies as unnecessary with a FNR of 2.2%.
The number of individuals harboring sonographically detected thyroid nodules continues to rise, with an estimated 219 million in the United States alone. The challenge for clinicians is to identify those rare nodules harboring a clinically relevant malignancy (1). Fine-needle aspiration (FNA) has traditionally been used for this purpose (2–4). However, at least half of all biopsied nodules prove to be benign (5), and up to one third have cytological findings that are inconclusive (6). Strategies for the clinical management of thyroid nodule patients have therefore evolved (6): ultrasonography now plays a larger role (1), informing decisions on the need for FNA (7, 8) and plans for long-term follow-up (7, 8). To improve the accuracy of this guidance, ultrasound-based risk-stratification systems have now been developed by many national and international thyroid societies (7–10) and by the American College of Radiology (ACR) (11).
Robust evidence is lacking on the relative strengths and weaknesses of the various systems. Some systems have been validated in multicenter studies (12, 13). Independent validation and comparison studies (generally involving two to three of the systems) have mostly been retrospective (14–19). In a recent study involving central re-evaluation of nodules in a multi-institutional database, the 2017 ACR Thyroid Imaging Reporting and Data System (TIRADS) was found to compare favorably with the American Thyroid Association (ATA) guidelines and the Korean Society of Thyroid Radiology TIRADS (K-TIRADS), primarily because it more effectively reduced the number of biopsies performed on benign nodules (20). The largest prospective study of this type (21) compared the systems developed by the British Thyroid Association, the ATA, and the American Association of Clinical Endocrinologists (AACE)/American College of Endocrinology (ACE)/Associazione Medici Endocrinologi (AME) and found no significant differences between their overall diagnostic accuracy parameters.
To obtain a broader picture of the merits and demerits of currently available sonographic risk-stratification systems, we conducted a prospective, observational study of 502 thyroid nodules referred to our center for FNA. During the real-time pre-FNA ultrasound examinations, each nodule was classified using five internationally endorsed systems, and the recommendation for FNA was analyzed in light of the nodule’s pathologic diagnosis. Because a major aim of all of these systems is to eliminate unnecessary thyroid biopsies without jeopardizing the detection of clinically significant malignancies, our aims were to determine (1) the proportion of nodules whose biopsy would have been considered unnecessary by each system, and (2) the reliability of these exclusions, as reflected by multiple parameters of diagnostic accuracy.
Materials and Methods
The study was conducted in the Thyroid Cancer Unit of a large academic referral center. All patients consecutively referred to the unit for FNA cytology of a thyroid nodule between 1 November 2015 and 30 May 2018 were eligible for enrollment. The referring physicians included primary care physicians and secondary health care providers (e.g., endocrinologists, surgeons, otolaryngologists, nuclear medicine specialists). The study was conducted with Institutional Review Board approval and written informed patient consent.
Pre-FNA ultrasound examination of the nodules
Prior to each biopsy, each nodule was carefully examined with a HI VISION Avius® ultrasound system (Hitachi Medical Corporation, Tokyo, Japan) and a 13-MHz linear-array transducer. During this examination, two clinicians experienced in thyroid sonography (G.G. and L.L.) recorded their consensus judgment on the sonographic features of each nodule on a standardized rating form (22), internally developed and based on published recommendations (23, 24). Judgments were made jointly to eliminate the problem of interobserver variability, which has been documented during assessments of the single sonographic features of thyroid nodules (25, 26). The nodule features recorded by the readers were: diameters (anteroposterior, transverse, and longitudinal); margins (peripheral halo, well defined, ill defined, microlobulated, or irregular); structure/composition (solid, cystic, or mixed); echogenicity (hyperechoic, isoechoic, hypoechoic—all relative to the perinodular parenchyma—or markedly hypoechoic, i.e., less echoic than the adjacent strap muscle); calcifications (absent, microscopic, or macroscopic, with the latter including eggshell calcifications); other hyperechoic foci (comet-tail artifacts or indeterminate, with the latter including areas of fibrosis); and suspected extrathyroidal extension (loss of the echogenic thyroid border, abutment, or contour bulging) (see Table 1). For mixed-content nodules, the location of the solid component (nonnodular, eccentric, or central) was also rated. The shape was considered taller than wide when the anteroposterior diameter exceeded the transverse diameter.
|Hashimoto’s thyroiditis||No||470 (93.6)|
|Multinodular gland||No||157 (31.3)|
|Right lobe||209 (41.6)|
|Left lobe||253 (50.4)|
|Single sonographic features|
|Extrathyroidal extension||No||498 (99.2)|
|Including spongiform||13 (2.6)|
|Markedly hypoechoic||17 (3.4)|
|Foci||Comet tail||25 (5)|
|Lymph nodes||Suspicious||6 (1.2)|
|Shape||Taller than wide||84 (16.7)|
|Sonographic classification systems|
|ACR TIRADS||TR1: Benign||16 (3.2)|
|TR2: Not suspicious||134 (26.7)|
|TR3: Mildly suspicious||99 (19.7)|
|TR4: Moderately suspicious||181 (36.1)|
|TR5: Highly suspicious||72 (14.3)|
|Very low suspicion||249 (49.6)|
|Low suspicion||69 (13.7)|
|Intermediate suspicion||28 (5.6)|
|High suspicion||60 (12)|
|Not classifiable||90 (17.9)|
|Low risk||270 (53.8)|
|Intermediate risk||73 (14.5)|
|High risk||155 (30.9)|
|Low suspicion||316 (62.9)|
|Intermediate suspicion||139 (27.7)|
|High suspicion||38 (7.6)|
Classification of nodules using five sonographic risk-stratification systems
For each nodule, the consensus ratings of each ultrasound feature were used to classify the risk of malignancy according to five widely used ultrasound risk-stratification systems—namely, those published by the AACE/ACE/AME (8), the ACR (ACR-TIRADS) (11), the ATA (7), the European Thyroid Association [European TIRADS (EU-TIRADS)] (9), and the K-TIRADS (10). With each system, the likelihood that a nodule is malignant is indicated by its risk class, which is in turn defined by a set of ultrasound features. Additionally, within each risk class, the advisability of FNA is indicated based on lesion size. For nodules in low-risk categories, the size thresholds for FNA range from 1.5 to 3.0 cm, depending on the system. For those in high-risk classes, FNA is usually indicated when the maximum diameter is ≥1 cm. None of the systems we tested routinely recommends the FNA of subcentimeter thyroid nodules, although some make exceptions in the presence of particular high-risk clinical features: for this reason, the nodules with a maximum diameter <1 cm were excluded. Using each system, we identified the nodules for which FNA was suggested based on the size threshold for the assigned risk class.
Reference standard diagnosis
The biopsies were done under ultrasound guidance by clinicians (endocrinologists trained in thyroid sonography) (G.G. and L.L.) using 23- to 25-gauge needles. The nonaspiration technique was used in most cases (one to four needle passes). Direct smears of each specimen were analyzed by experienced thyroid cytopathologists (V.A. and D.B.) and classified according to the criteria published in the Italian Consensus for Thyroid Cytopathology (27, 28). When surgery had been performed, the reference standard diagnosis (malignant vs benign) was based on histological examination of the resected nodule. When the nodule had been managed nonsurgically, the reference standard was FNA cytology: nodules were considered malignant when they had been classified as TIR4 or TIR5 [suspected malignancy or malignancy, corresponding to Bethesda classes V and VI (29)], and benign when they had been classified as TIR2, corresponding to Bethesda class II. Patients with cytologically benign nodules were advised to undergo repeat sonography in 12 to 36 months, depending on their baseline risk status, as suggested by the ATA guidelines (7). Nodules without histologic diagnoses that had been cytologically classified as TIR1 (nondiagnostic), TIR3A, or TIR3B (indeterminate) (similar to Bethesda classes I, III, and IV) were excluded from the final analysis, unless a repeat FNA had yielded conclusive results.
For each sonographic classification system, we calculated the number of nodules that did and did not meet the criteria for FNA (test positivity and test negativity, respectively). For the purposes of the study, the biopsies ordered for test-negative nodules were considered “unnecessary.” The sonographic recommendation regarding FNA was then compared with the reference-standard diagnosis (benign vs malignant) to estimate its sensitivity, specificity, positive predictive value, negative predictive values (NPVs), diagnostic OR (DOR), and area under the receiver operating characteristic (AUROC) curve (each with 95% CI). The proportions of biopsies that would have been considered unnecessary by the various systems were compared using the McNemar test, and the reliability of these indications (i.e., whether the recommended deferral involved a nodule that was indeed benign) was assessed by calculation of the NPV and false-negative rate (FNR). Data were analyzed with the IBM SPSS Statistics package, version 25.0 (IBM Corp., Armonk, NY) and Microsoft Office Suite.
As shown in Fig. 1, a total of 832 thyroid nodules were subjected to sonographic risk stratification prior to FNA. The ultrasound examination identified 79 subcentimeter nodules, which were excluded from our analysis because, in the absence of particular clinical features, FNA is not indicated by any of the five systems for nodules of this size. Of the 753 nodules measuring ≥1 cm, 251 (33.3%) were also excluded from the analysis because their reference standard diagnosis was inconclusive. The final cohort thus comprised 502 thyroid nodules identified in 477 patients (mean age, 55.9 ± 13.9 years; female/male ratio, 3.0) (Fig. 1). Thirty-six (7.2%) lesions met the reference standard criteria for malignancy. In 34 cases, the diagnosis was based on histologic findings: 27 papillary thyroid cancers, 1 follicular thyroid cancer, 2 medullary thyroid cancers, 1 anaplastic thyroid cancer, and 3 thyroid metastases from other malignancies (30). The remaining two were classified cytologically as TIR4/Bethesda V and managed (in accordance with patient preferences) with active surveillance alone.
Strict application of the indications furnished by the each of the five ultrasound systems would have appreciably reduced the number of FNAs performed (Table 2). The percentages of nodules identified as FNA deferrable varied widely (from 17.1% to 53.4%), and with each system, some of the exclusions were false negatives with reference standard diagnoses of malignancy (FNRs, 2.2% to 4.1%). The most effective system in our cohort was the ACR TIRADS, which would have eliminated more than half of the biopsies ordered (268, 53.4%), with a FNR of only 2.2% (NPV, 97.8%; 95% CI, 95.2% to 99.2%). With the exception of the K-TIRADS, the other systems’ discriminatory capacities (as reflected by their DOR and AUROC) were similar to that of the ACR TIRADS, but their impact on the number of procedures performed was significantly smaller (Table 2). Eleven nodules definitively diagnosed as malignant would have been misclassified as not requiring FNA by at least one of the TIRADS systems (Table 3). The three cancers missed by all five systems were either isoechoic or hyperechoic relative to the surrounding parenchyma and had no other features considered anomalous by any of the systems. (These were the only malignancies misclassified by the K-TIRADS system.) The ATA system also failed to identify six other malignant nodules, five of which emerged as false negatives with this system alone. These five nodules were isoechoic, but they also had irregular margins, which are considered suspicious by the other systems. Unfortunately, this “set of features” did not allow them to be allocated into any of the ATA system’s risk classes.
|Avoided Biopsies (%)a||FN (FNR)||TN (TNR)||Sensitivity (95% CI)||Specificity (95% CI)||PPV (95% CI)||NPV (95% CI)||AUC||DORb|
|ACR TIRADS||268 (53.4)||6/268 (2.2%)||262/268 (97.8%)||83.3 (67.2–93.6)||56.2 (51.6–60.8)||12.8 (8.8–17.8)||97.8 (95.2–99.2)||0.7 (0.62–0.78)||6.42 (2.62–15.72)|
|ATA||220 (43.8)||9/220 (4.1%)||211/220 (95.9%)||75 (57.8–87.9)||45.3 (40.7–49.9)||9.6 (6.4–13.6)||95.9 (92.4–98.1)||0.6 (0.51–0.69)||2.48 (1.14–5.39)|
|AACE/ACE/AME||175 (34.9)||5/175 (2.9%)||170/175 (97.1%)||86.1 (70.5–95.3)||36.5 (32.1–41.0)||9.5 (6.5–13.2)||97.1 (93.5–99.1)||0.61 (0.53–0.7)||3.56 (1.36–9.33)|
|EU-TIRADS||154 (30.7)||5/154 (3.2%)||149/154 (96.8%)||86.1 (70.5–95.3)||32 (27.8–36.4)||8.9 (6.1–12.4)||96.7 (92.6–98.9)||0.59 (0.5–0.68)||2.91 (1.11–7.64)|
|K-TIRADS||86 (17.1)||3/86 (3.5%)||83/86 (96.5%)||91.7 (77.5–98.2)||17.8 (14.4–21.6)||7.9 (5.5–11)||96.5 (90.2–99.3)||0.55 (0.46–0.64)||2.38 (0.71–7.96)|
Abbreviations: AUC, area under the receiver operating characteristic curve; FN, false negative; PPV, positive predictive value.
The rate of avoided biopsies is significantly different between the US classification systems (McNemar test: ACR TIRADS vs ATA, P = 0.002; ATA vs AACE/ACE/AME, P < 0.001; AACE/ACE/AME vs EU-TIRADS, P < 0.001; EU-TIRADS vs K-TIRADS, P < 0.001).
The DOR measures the discriminatory power of a diagnostic test as compared with that of the reference standard. The value ranges from 0 to infinity, with higher values indicating better performance.
|Pathological Diagnosis||US Description||Maximum Diameter (mm)||Missed by
|cPTC||Hyperechoic (hypoechoic parenchyma due to thyroiditis)||11.0||X||X||X||X||X|
|cPTC||Solid, mildly hypoechoic||13.0||X||X||X|
|MTC||Solid, mildly hypoechoic||12.1||X||X||X|
|cPTC||Solid, isoechoic, irregular margins||10.5||X|
|cPTC||Solid, isoechoic, irregular margins||11.1||X||X|
|cPTC||Solid, isoechoic, irregular margins||20.3||X|
|fvPTC||Solid, isoechoic, irregular margins||11.9||X|
|fvPTC||Mixed, isoechoic nodule||13.6||X||X||X||X||X|
|Suspected PTCb||Mixed, isoechoic nodule||11.4||X||X||X||X||X|
|cPTC||Mixed, isoechoic nodule, irregular margins||15.0||X|
|cPTC||Mixed, isoechoic nodule, irregular margins||17.2||X|
Abbreviations: cPTC, classic papillary thyroid cancer; fvPTC, follicular-variant papillary thyroid cancer; MTC, medullary thyroid cancer; PTC, papillary thyroid cancer.
If irregular margins (or other worrisome sonographic features), even in the context of isoechoic nodules, were considered high risk, the number of malignancy missed by the ATA system would decrease.
Classified cytologically as TIR4/Bethesda V and managed (in accordance with patient preferences) with active surveillance alone.
Thyroid nodule FNAs play key roles in ruling out the presence of thyroid cancer. However, the costs of sample collection and analysis are relatively high, and the aspiration itself can be a source of discomfort and anxiety for patients. Furthermore, in roughly one third of cases, cytomorphologic analysis of the aspirate yields inconclusive results (5) that prompt repeat biopsies or additional, more expensive testing (1). The sonographic risk-stratification tools we assessed are basically “rule-out” tests, designed to identify nodules with low risks of malignancy whose cytologic assessment can safely be deferred. Our goal was to compare the performances of five widely used systems of this type in achieving this goal. Each system assigns differential weights to the individual sonographic features evaluated to establish a nodule’s risk of malignancy, and the weight assigned to a given feature varies substantially from one system to another. The systems also differ from one another in terms of the size thresholds for identifying nodules within a given risk class that require FNA.
In our cohort, the number of biopsies performed would have been reduced to some extent if the decision had been based on strict application of any of the five internationally endorsed systems tested. However, the safest and most substantial reduction would have been achieved with the ACR TIRADS: it classified more than half of the biopsies ordered as unnecessary (268 of 502, 53.4%) and had the lowest FNR of all systems tested (6 of 268, 2.2%). Its abilities to exclude malignancy and to discriminate between benign and malignant nodules were substantially greater than those of its competitors (NPV, 97.8%; DOR, 6.42; 95% CI, 2.62 to 15.72). The high number of planned biopsies identified by this system as unnecessary reflects the higher size thresholds it sets for recommending biopsy of nodules classified as low risk (Table 4).
|FNA >10 mm||FNA ≥10 mm||FNA ≥10 mm||FNA ≥10 mm||FNA >10 mm|
|ROM 50%–90%||ROM ≥20%||ROM 70%–90%||ROM >60%||ROM 26%–87%|
|FNA >20 mm||FNA ≥15 mm||FNA ≥10 mm||FNA ≥10 mm||FNA >15 mm|
|ROM 5%–15%||ROM 5%–20%||ROM 10%–20%||ROM 15%–50%||ROM 6%–17%|
|FNA ≥25 mm||FNA ≥15 mm||FNA ≥15 mm||FNA >20 mm|
|ROM 5%||ROM 5%–10%||ROM 3%–15%||ROM 2%–4%|
|Low||TR2||Very low suspicion||TR2||TR2|
|FNA >20 mm||No FNA||FNA ≥20 mm||FNA ≥20 mm||No FNA|
|ROM ≈1%||ROM 2%||ROM <3%||ROM 1%–3%||ROM 0%|
|No FNA||No FNA|
|ROM 2%||ROM < 1%|
Differences in the size threshold with respect to ACR TIRADS are highlighted in italic bold.
AACE/ACE/AME guidelines, including only three classes, are not directly comparable to the other four- or five-tiered systems.
Abbreviation: ROM, risk of malignancy.
Importantly, the ACR TIRADS assigned all the thyroid nodules in the cohort to a risk class—a clear advantage over the widely used ATA system, which failed to classify a significant number of the nodules we studied (90, 17.9%) and of those analyzed by others (19, 31). In our hands, the K-TIRADS performance was disappointing: the number of biopsies it would have eliminated was quite modest (86 of 502, 17.1%), and its discriminatory capacity was not statistically significant, as reflected by the lower end of the confidence interval of the AUROC (<0.5) and the DOR (<1). One small nodule classified as FNA deferrable by three of the five systems (the AACE/ACE/AME guidelines, the EU-TIRADS, and the ACR TIRADS) proved to be a medullary thyroid cancer. The sonographic features of these cancers are known to differ significantly from those of papillary thyroid cancer (32), which are the basis of sonographic risk stratification. One third of all medullary thyroid cancer have sonographic features (solid, round/ovoid shapes, smooth margins, mild hypoechogenicity) (33) considered “low suspicion” in many systems: these features are sufficient to classify the nodule at least in the ATA intermediate suspicion pattern, requiring FNA when it measured >1 cm (34).
Importantly, however, note that the choice of a TIRADS system cannot be based solely on the number of biopsies it flags as unnecessary and its diagnostic accuracy. Interobserver variability (25) and consistency over time (35) are also important considerations, as are the setting in which it will be used (e.g., equipment, operator experience). Additionally, the US features being evaluated must also be defined in a manner that is clear and unambiguous to the operators using the system, an outcome favored by specific training and experience (25). The system must also be easy to use in routine clinical practice. For example, the ACR TIRADS differs from the other four systems tested in that it is based on a point scale rather than on pattern recognition. Points are assigned for five individual ultrasound features, and their sum determines the nodule’s risk class. This approach may be considered excessively time-consuming for use in daily practice.
Our study also has several limitations that must be considered when interpreting our findings. First, ours was a selected cohort of thyroid nodules, all of which had already been flagged for FNA by another physician (e.g., endocrinologists, oncologists, general practitioners, clinicians from other fields, pathologists), and the criteria supporting these requests were not known. This cohort’s 7.2% malignancy rate, however, was similar to those reported for unselected nodule series (6), and all sonographic risk classes were represented, including low-risk categories. Second, the composite reference standard used in our study is not error-free. For example, a benign cytology report was considered sufficient for classifying the nodule as benign. However, FNA cytology can yield false-negative results. Such outcomes are uncommon, with estimated frequencies of 3.7% emerging from a recent meta-analysis (5) and even lower frequencies (<1%) in prospective series of cytologically benign nodules with no high-suspicion ultrasound features (36). Second, as for our reference standard diagnosis of malignancy, the risk of error (false positivity) is limited to two nodules whose classification was based on “suspicious” (TIR4, Bethesda class V) cytology alone: the other 34 reference standard diagnoses of malignancy were all histologically confirmed. It is also conceivable that our exclusion of 251 (33%) nodules with nondiagnostic or indeterminate cytology reports caused a selection bias; however, the proportion of nodules with these cytological reports is consistent with those reported in other cytological series (5). Additionally, as shown in Fig. 1, the rates of deferrable biopsies in this subgroup were similar to those in the study cohort itself.
The major strength of our study is its prospective nature: the US features of each nodule were evaluated during real-time examinations carried out before aspirates were collected. In this setting, each of the five internationally endorsed TIRADS approaches we evaluated identified multiple thyroid nodules for which the request for FNA was probably unnecessary. Four of the five (AACE/ACE/AME, ATA, ACR TIRADS, and EU-TIRADS) showed a significant diagnostic value. The best overall performance was that of the ACR TIRADS, which classified more than half of the requested biopsies as unnecessary, with a NPV of 97.8%.