Human factors in the clinical implementation of deep learning-based automated contouring of pelvic organs at risk for MRI-guided radiotherapy

Med Phys. 2023 Oct;50(10):5969-5977. doi: 10.1002/mp.16676. Epub 2023 Aug 30.

Abstract

Purpose: Deep neural nets have revolutionized the science of auto-segmentation and present great promise for treatment planning automation. However, little data exists regarding clinical implementation and human factors. We evaluated the performance and clinical implementation of a novel deep learning-based auto-contouring workflow for 0.35T magnetic resonance imaging (MRI)-guided pelvic radiotherapy, focusing on automation bias and objective measures of workflow savings.

Methods: An auto-contouring model was developed using a UNet-derived architecture for the femoral heads, bladder, and rectum in 0.35T MR images. Training data was taken from 75 patients treated with MRI-guided radiotherapy at our institution. The model was tested against 20 retrospective cases outside the training set, and subsequently was clinically implemented. Usability was evaluated on the first 30 clinical cases by computing Dice coefficient (DSC), Hausdorff distance (HD), and the fraction of slices that were used un-modified by planners. Final contours were retrospectively reviewed by an experienced planner and clinical significance of deviations was graded as negligible, low, moderate, and high probability of leading to actionable dosimetric variations. In order to assess whether the use of auto-contouring led to final contours more or less in agreement with an objective standard, 10 pre-treatment and 10 post-treatment blinded cases were re-contoured from scratch by three expert planners to get expert consensus contours (EC). EC was compared to clinically used (CU) contours using DSC. Student's t-test and Levene's statistic were used to test statistical significance of differences in mean and standard deviation, respectively. Finally, the dosimetric significance of the contour differences were assessed by comparing the difference in bladder and rectum maximum point doses between EC and CU before and after the introduction of automation.

Results: Median (interquartile range) DSC for the retrospective test data were 0.92(0.02), 0.92(0.06), 0.93(0.06), 0.87(0.04) for the post-processed contours for the right and left femoral heads, bladder, and rectum, respectively. Post-implementation median DSC were 1.0(0.0), 1.0(0.0), 0.98(0.04), and 0.98(0.06), respectively. For each organ, 96.2, 95.4, 59.5, and 68.21 percent of slices were used unmodified by the planner. DSC between EC and pre-implementation CU contours were 0.91(0.05*), 0.91*(0.05*), 0.95(0.04), and 0.88(0.04) for right and left femoral heads, bladder, and rectum, respectively. The corresponding DSC for post-implementation CU contours were 0.93(0.02*), 0.93*(0.01*), 0.96(0.01), and 0.85(0.02) (asterisks indicate statistically significant difference). In a retrospective review of contours used for planning, a total of four deviating slices in two patients were graded as low potential clinical significance. No deviations were graded as moderate or high. Mean differences between EC and CU rectum max-doses were 0.1 ± 2.6 Gy and -0.9 ± 2.5 Gy for pre- and post-implementation, respectively. Mean differences between EC and CU bladder/bladder wall max-doses were -0.9 ± 4.1 Gy and 0.0 ± 0.6 Gy for pre- and post-implementation, respectively. These differences were not statistically significant according to Student's t-test.

Conclusion: We have presented an analysis of the clinical implementation of a novel auto-contouring workflow. Substantial workflow savings were obtained. The introduction of auto-contouring into the clinical workflow changed the contouring behavior of planners. Automation bias was observed, but it had little deleterious effect on treatment planning.

Keywords: auto-contouring; automation bias; deep-learning.