On patient-level splitting, contrast-free claims and unsupported comparators in “Machine learning-based classification of multiple sclerosis lesion activity using multi-sequence MRI radiomics”

Stefania Galassi

doi:10.5114/pjr/213873

Dear Editor

I read with interest the article by Elhaie and colleagues describing a machine-learning model that uses multi-sequence magnetic resonance imaging radiomics to classify active and inactive multiple sclerosis lesions [1]. The topic is clinically important and the manuscript is clearly presented, but several aspects would benefit from clarification.

The first point concerns how patients and lesions were allocated between model development and evaluation. Because multiple lesions from the same person share acquisition conditions and biological context, distributing lesions from a single patient across both the training and test sets may make the results look better than they would be in practice. The safest approach is to assign each patient entirely to a single data split or to use cross-validation that groups by patient, so no individual contributes data to more than one fold [2].

There also appears to be a discrepancy between the accrual dates reported in the abstract and those in the methods. Reconciling the study window would help readers understand the timeline and any scanner or protocol changes that might affect radiomic features [1].

The work is framed as “contrast-free,” yet T1-weighted imaging (T1W) was acquired both before and after gadolinium administration, and it is not explicit which version was used for feature extraction. If post-contrast images entered the model, the “contrast-free” claim should be moderated; if only pre-contrast T1W was used (alongside T2-weighted imaging/FLAIR/DWI/SWI), stating this plainly would avoid confusion.

The abstract also states that performance was comparable to radiologists, but no reader benchmark is shown. Unless a human-reader analysis was actually performed, that phrasing should be removed or supported with data. Given the small internal test, it would be more informative to pre-specify a single operating threshold on validation and carry it forward unchanged to the test, then report precision–recall area under the curve and predictive values at plausible prevalences to match deployment decisions [3,4]. As recently argued in related correspondence, clarity on thresholds and calibration reduces optimistic interpretations and improves reproducibility [5].

Finally, a note on preprocessing. Intensity normalisation and similar transformations should be fitted on training data only and then applied, unchanged, to validation and test. Stating this explicitly – and, if helpful, providing a sensitivity analysis – would rule out inadvertent information leakage and strengthen confidence in the results [2,4].