Files
Abstract
Recently, investigators have illustrated the limited reproducibility in radiomics research preventing the translation of radiomics-based models into clinical practice. Therefore, it is important to understand the dependence of radiomics research on each component of the radiomics workflow and whether harmonization methods can limit this dependency.
First, radiomic features were extracted from CT scans depicting a cadaveric liver when scans were acquired and reconstructed with 17 different imaging parameters. Feature values were compared between one reference scan and the remaining 16 modified scans. Reducing the field of view or using coronal slices instead of axial slices resulted in the greatest number of features reflecting significant differences (67.6% and 35.9%, respectively), while slight changes in tube voltage, pitch, or slice interval resulted in the least (0.7% each). To mitigate these differences, five harmonization methods were implemented: histogram normalization, pixel size resampling, Butterworth filtering, resampling and filtering combined, and ComBat harmonization. While histogram normalization maintained or reduced the number of features reflecting significant differences for each scan, ComBat harmonization reduced the number of features reflecting significance to zero for all imaging parameters.
The dependence of radiomic features on the feature calculation process was also investigated. Five radiomic software packages (A1, A2, IBEX, MaZda, and Pyradiomics) were used to calculate 12 common features using databases of mammograms, head and neck (HN) CT scans, and breast MRI scans. For the mammography and HN CT databases, 11/12 features reflected significant differences among packages, while 9/12 features reflected significance for the breast MRI database. When assessing the agreement in feature values among packages using the intraclass correlation coefficient (ICC), 5, 4, and 5 out of 12 features reflected excellent agreement for the mammography, HN CT, and breast MRI databases, respectively.
The effect of differences in radiomics software was quantified by extracting feature values with packages A1, IBEX, and Pyradiomics from CT scans of patients undergoing radiation therapy (RT). Due to therapy, 19% of patients developed radiation pneumonitis (RP). The changes in eight feature values between pre- and post-RT CT scans were calculated with each package and used to classify patients with RP. Based on analysis of variance (ANOVA), features associated with RP development differed among the three packages for 2/8 features. When assessing classification ability, first-order features reflected greater agreement in classification ability but began to deviate for higher-order features.
The potential of mitigating the differences in software packages was assessed using ComBat harmonization. The models described in the previous chapter (M_Avg models) were compared to three models with ComBat implemented at different components of the feature calculation process. ComBat harmonized feature among packages A1, IBEX, and Pyradiomics, and changes in feature classified patients with RP. Based on ANOVA tests, M_Avg models resulted in 5/8 features reflecting significant differences, while the ComBat-based models reduced the number of features reflecting significance (0-2 features). M_Avg models resulted in moderate agreement in AUC values (ICC: 0.727), while the ComBat-based methods resulted in decreased agreement (ICC: 0.637-0.677). When features were normalized before ComBat, ICCs approached that of M_Avg models (ICC: 0.733).