# Convolutional neural network for automatic segmentation of the maxillary sinus on cone-beam CT images

This study was conducted in accordance with the standards of the Declaration of Helsinki on Medical Research. Institutional ethics board approval was obtained from the Ethical Review Board of Leuven University Hospitals (reference number: S57587). Informed consent was not required as patient-specific information was anonymized. The study design and report followed the recommendations of Schrendicke et al.23 to report on artificial intelligence in dental research.

### Database

A sample of 132 CBCT scans (264 sinuses, 75 women and 57 men, mean age 40 years) from 2013 to 2021 with different scan parameters was collected (Table 1). Inclusion criteria were patients with permanent dentition and maxillary sinus with/without mucosal thickening (shallow > 2 mm, moderate > 4 mm) and/or with a semi-spherical membrane in one of the walls24. Scans with dental restorations, orthodontic brackets and implants were also included. The exclusion criteria were patients with a history of trauma, sinus surgery and the presence of pathologies affecting its contour.

DICOM (Digital Imaging and Communication in Medicine) files of CBCT images were exported anonymously. The dataset was then randomly divided into three subsets: (1) training set (n = 83 scans) for ground-truth-based CNN model training; (2) validation set (n = 19 scans) for evaluation and selection of the best model; (3) test set (n = 30 scans) to test model performance against ground truth.

### Ground truth labeling

Ground truth datasets for CNN model training and testing were labeled by semi-automatic sine segmentation using Mimics Innovation Suite (version 23.0, Materialize NV, Leuven, Belgium). Initially, a custom threshold leveling was adjusted between [− 1024 to − 200 Hounsfield units (HU)] to create an air mask (Fig. 1a). Subsequently, the region of interest (ROI) was isolated from the rest of the surrounding structures. Manual delineation of the bone contours was performed using the eclipse and livewire function, and all contours were checked in the coronal, axial, and sagittal orthogonal planes (Fig. 1b). To avoid any inconsistency in the ROI of the different images, the segmentation region was restricted to the early start of the sinus ostium on the side of the sinus before continuing into the infundibulum (Fig. 1b). Finally, the edited mask of each sine was exported separately as a Standard Tessellation Language (STL) file. Segmentation was performed by a dentomaxillofacial (NM) radiologist with seven years of experience and subsequently reassessed by two other radiologists (KFV&RJ) with 15 and 25 years of experience, respectively.

### Architecture and training of the CNN model

Two U-Net 3D architectures were used25which both consisted of 4 encoder blocks and 3 decoder blocks, 2 convolutions with a kernel size of 3×3×3, followed by rectified linear unit (ReLU) activation and group normalization with 8 maps of characteristics26. Subsequently, maximum pooling with a 2×2×2 step-by-two kernel size was applied after each encoder, allowing the resolution to be reduced by a factor of 2 in all dimensions. Both networks were trained as a binary classifier (0 or 1) with a weighted binary cross-entropy loss:

$${L}_{BCE}={y}_{n}*logleft({p}_{n}right)+left(1-{y}_{n}right)*log left(1-{p}_{n}right)$$

for each voxel n with a ground truth value ({y}_{n}) = 0 or 1, and the predicted probability of the network =({p}_{n})

A two-step pre-processing of the training dataset was applied. First, all scans were resampled to the same voxel size. Subsequently, to overcome graphics processing unit (GPU) memory limitations, the full-size scan was downsampled to a fixed size.

The first U-Net 3D was used to provide roughly low-resolution segmentation to offer 3D patches and crop only those that belonged to the sine. Later these relevant patches were transferred to the second U-Net 3D where they were individually segmented and combined to create the full resolution segmentation map. Finally, binarization was applied and only the largest connected part was kept, followed by the application of a cubic walk algorithm on the binary image. The resulting mesh was smoothed to generate a 3D model (Fig. 2).

Model parameters were optimized with ADAM27 (an optimization algorithm for training deep learning models) having an initial learning rate of 1.25e−4. During training, random spatial augmentations (rotation, scaling, and rubberbanding) were applied. The validation dataset was used to define the early stop which indicates a saturation point of the model where no further improvement can be noticed by the training set and more cases will lead to overfitting of the data. The CNN model was deployed on a cloud-based online platform called Virtual Patient Creator (creator.relu.eu, Relu BV, October 2021 release) where users could upload a DICOM dataset and get segmentation automatically of the desired structure.

### Testing the AI ​​pipeline

CNN model testing was performed by uploading DICOM files from the test set to the virtual patient creation platform. The resulting automatic segmentation (Fig. 3) could be downloaded later in DICOM or STL file format. For the clinical evaluation of automatic segmentation, the authors developed the following classification criteria: A—perfect segmentation (no refinements were necessary), B—very good segmentation (refinements of no clinical relevance, slightly over or under- segmentation in regions other than the maxillary sinus floor), C—good segmentation (refinements that have some clinical relevance, slight over or under-segmentation in the maxillary sinus floor region), D—deficient segmentation (over or under-segmentation considerable, independent of sinus region, with repeat needed) and E—negative (the CNN model could not predict anything). Two observers (NM and KFV) assessed all cases, followed by expert consensus (RJ). In cases where improvements were needed, the STL file was imported into the Mimics software and edited using the 3D Tools tab. The resulting segmentation was referred to as refined segmentation.

### Evaluation Metrics

Evaluation metrics28.29 are described in Table 2. The comparison of the results between the ground truth and the automatic and refined segmentation was performed by the main observer on the entire test set. A pilot of 10 scans was first tested, which showed a Dice Similarity Coefficient (DSC) of 0.985 ± 004, Intersection Over Union (IoU) of 0.969 ± 0.007, and Hausdorff Distance (HD) at 95% of 0.204±0.018mm. Based on these results, the sample size of the test set was increased up to 30 scans according to the Central Limit Theorem (CLT)30.

#### Time efficiency

The time required for semi-automatic segmentation was calculated from opening the DICOM files in the Mimics software until exporting the STL file. For automatic segmentation, the algorithm automatically calculates the time needed to have a full resolution segmentation. The refined segmentation time was calculated similarly to that of semi-automatic segmentation and added later to the initial automatic segmentation time. The average time for each method was calculated based on the test set sample.

#### Accuracy

A voxel comparison between ground truth, automatic and refined segmentation of the test set was performed by applying a four-variable confusion matrix: true positive (TP), true negative (TN), false positive (FP) and false negative (FN) voxels. Based on the aforementioned variables, the accuracy of the CNN model was evaluated according to the parameters mentioned in Table 2.

#### Consistency

Once the CNN model is trained, it is deterministic; therefore, its consistency was not assessed. As an illustration, a scan was uploaded twice to the platform and the resulting STLs were compared. Intra- and inter-observer consistency was calculated for semi-automatic and refined segmentation. The intra-observer reliability of the primary observer was calculated by resegmenting 10 scans from the test set with different protocols. For inter-observer reliability, two observers (NM and KFV) performed the necessary refinements, then the STL files were compared with each other.

### statistical analyzes

Data was analyzed with RStudio: Integrated Development Environment for R, version 1.3.1093 (RStudio, PBC, Boston, MA). The mean and standard deviation were calculated for all evaluation measures. A paired-samples t-test was performed with significance level (p