University of Utah Health, Salt Lake City, Utah; Rush University Medical Center, Chicago, Illinois; Mayo Clinic, Rochester, Minnesota; Roanoke Medical Center, Roanoke, Virginia; Allegheny Health Network, Pittsburgh, Pennsylvania; University of Nebraska Medical Center, Omaha, Nebraska; University of California, San Diego, California, USA
aGastroenterology and Hepatology, University of Utah Health, Salt Lake City, Utah (Babu P. Mohan); bGastroenterology, Rush University Medical Center, Chicago, Illinois (Shahab R. Khan); cInternal Medicine, Mayo Clinic, Rochester, Minnesota (Lena L. Kassab); dInternal Medicine, Roanoke Medical Center, Roanoke, Virginia (Suresh Ponnada); eGastroenterology and Hepatology, Allegheny Health Network, Pittsburgh, Pennsylvania (Nabeeha Mohy-Ud-Din, Gursimran S. Kochhar); fGastroenterology and Hepatology, University of Nebraska Medical Center, Omaha, Nebraska (Saurabh Chandan); gGastroenterology and Hepatology, University of California, San Diego, California (Parambir S Dulai), USA
Background Helicobacter pylori (H. pylori) infection, if left untreated, can cause gastric cancer, among other serious morbidities. In recent times, a growing body of evidence has evaluated the use of a type of artificial intelligence (AI) known as “deep learning” in the computer-aided diagnosis of H. pylori using convolutional neural networks (CNN). We conducted this meta-analysis to evaluate the pooled rates of performance of CNN-based AI in the diagnosis of H. pylori infection.
Methods Multiple databases were searched (from inception to June 2020) and studies that reported on the performance of CNN in the diagnosis of H. pylori infection were selected. A random-effects model was used to calculate the pooled rates. In cases where multiple 2×2 contingency tables were provided for different thresholds, we assumed the data tables were independent from each other.
Results Five studies were included in our final analysis. Images used were from a combination of white-light, blue laser imaging, and linked color imaging. The pooled accuracy for detecting H. pylori infection with AI was 87.1% (95% confidence interval [CI] 81.8-91.1), sensitivity was 86.3% (95%CI 80.4-90.6), and specificity was 87.1% (95%CI 80.5-91.7). The corresponding performance metrics for physician endoscopists were 82.9% (95%CI 76.7-87.7), 79.6% (95%CI 68.1-87.7), and 83.8% (95%CI 72-91.3), respectively. Based on non-causal subgroup comparison methods, CNN seemed to perform equivalently to physicians.
Conclusion Based on our meta-analysis, CNN-based computer-aided diagnosis of H. pylori infection demonstrated an accuracy, sensitivity, and specificity of 87%.
Keywords Convolutional neural networks, Helicobacter pylori, meta-analysis
Ann Gastroenterol 2021; 34 (1): 20-25
Helicobacter pylori (H. pylori) infection is a well-known risk factor for gastric cancer [1,2]. Left untreated, the disease can result in chronic gastritis, gastroduodenal ulceration, mucosal atrophy, intestinal metaplasia and mucosa-associated lymphoid tissue tumor. Treatment is therefore of paramount importance and is directed at complete eradication of H. pylori infection [3]. Central to this strategy is an accurate diagnostic methodology aimed at effectively ruling-in, ruling-out, and confirming the eradication of the infection.
Various endoscopic and non-endoscopic diagnostic tests are currently available. Non-endoscopic tests include a urea breath test, fecal H. pylori antigen test, urine anti-H. pylori immunoglobulin (IgG) assay, and serum H. pylori IgG assay. Areas targeted, using a standard endoscope, are usually chosen based on gastric mucosal redness and swelling. Current guidelines favor a repeat endoscopy to document eradication by negative biopsy results [3].
A growing body of evidence has evaluated the use of a type of artificial intelligence (AI) known as “deep learning” in the computer-aided diagnosis of health-related conditions based on medical imaging [4]. A convolutional neural network (CNN) enables machines to analyze various training images. CNN data-driven systems are trained on datasets containing large numbers of images with their corresponding labels. CNN can be seen as a system that first extracts relevant features from the input images and subsequently uses those learned features to classify a given image. The network uses convolutions of the input image in order to extract the most relevant information that helps to classify the image into different entities. Based on the accumulated data features, machines can diagnose newly acquired clinical images prospectively [5-7].
In this systematic review and meta-analysis, we aimed to quantitatively appraise the current reported data on the diagnostic performance of CNN-based computer-aided diagnosis of H. pylori infection and, if possible, compare the results to the diagnostic performance of physician endoscopists.
A medical librarian searched the literature for the concepts of AI with endoscopy for gastrointestinal (GI) conditions. The search strategies were created using a combination of keywords and standardized index terms. Searches were run in November 2019 and an additional updated search was performed in June 2020 in ClinicalTrials.gov, Ovid EBM Reviews, Ovid Embase (1974+), Ovid Medline (1946+ including Epub ahead of print, in-process & other non-indexed citations), Scopus (1970+), and Web of Science (1975+). Results were limited to English language publications. All results were exported to Endnote X9 (Clarivate Analytics) where obvious duplicates were removed, leaving 4245 citations. The search strategy is provided in Appendix 1. The MOOSE checklist was followed and is provided as Appendix 2 [8]. Reference lists of evaluated studies were examined to identify other studies of interest.
In this meta-analysis, we included studies that developed or validated a deep CNN learning model for the detection and/or diagnosis of H. pylori infection. Study selection was restricted to only those that used CNN-based machine learning models. Search terms: “Helicobacter pylori” and “convolutional neural network” were used to filter through the studies in the EndNote file. Studies were included irrespective of inpatient/outpatient setting, study sample size, optics of endoscopic imaging, follow-up time, abstract/manuscript status and geography, as long as they provided the appropriate data needed for the analysis.
Our exclusion criteria were as follows: 1) studies that used non-CNN based machine learning algorithms; and 2) studies not published in the English language. In cases of multiple publications from a single research group reporting on the same patient cohort and/or overlapping cohorts, all reported contingency tables were treated as being mutually exclusive. When necessary, authors were contacted via email for clarification of data and/or study cohort overlap.
Data on study-related outcomes from the individual studies were abstracted independently onto a predefined standardized form by at least 2 authors (BPM, SRK). Disagreements were resolved by consultation with a senior author (GK). Diagnostic performance data were extracted and contingency tables were created at the reported thresholds. Contingency tables consisted of reported accuracy, sensitivity, specificity, positive predictive value and negative predictive value.
We used meta-analysis techniques to calculate the pooled estimates in each case, following a random-effects model [9]. We assessed heterogeneity between study-specific estimates using the Cochran Q statistical test for heterogeneity, 95% prediction interval, which deals with the dispersion of the effects, and the I2 statistic [10,11], where a value <50% was considered as absence of heterogeneity. A formal publication bias assessment was not done because of the nature of the pooled results derived from the studies.
To compare the diagnostic performance of CNN to physician endoscopists, we did a subgroup analysis comparing the pooled performances of these 2 groups of datasets. All analyses were performed using Comprehensive Meta-Analysis (CMA) software, version 3 (BioStat, Englewood, NJ).
The literature search resulted in 4245 study hits (study search and selection flowchart: Supplementary Fig. 1). All 4245 studies were screened and 106 full-length articles and/or abstracts were assessed. Five studies were included in the final analysis [12-16] (Table 1).
Table 1 Study characteristics
The following diagnostic tests were used to confirm the presence or absence of H. pylori infection: 1) H. pylori density by histology; 2) serum H. pylori IgG assay; 3) fecal H. pylori antigen test; and 4) urine H. pylori IgG assay. Further information is provided in Table 1.
From all the included studies, we were able to extract a total of 9 contingency table datasets for CNN and 4 for physician endoscopists’ performance in detecting/diagnosing H. pylori infection. Values available for analysis were for accuracy, sensitivity and specificity. None of the studies reported data on positive or negative predictive value.
The pooled accuracy of CNN in the computer-aided diagnosis of H. pylori infection was 87.1% (95% confidence interval [CI] 81.8-91.1), the pooled sensitivity was 86.3% (95%CI 80.4-90.6) and specificity was 87.1% (95%CI 80.5-91.7). Forest plots are shown in Fig. 1-3.
Figure 1 Forest plot comparing the accuracy of artificial intelligence (AI) vs. physicians in the detection of Helicobacter pylori infection CI, confidence interval
Figure 2 Forest plot comparing the sensitivity of artificial intelligence (AI) vs. physicians in the detection of Helicobacter pylori infection CI, confidence interval
Figure 3 Forest plot comparing the specificity of artificial intelligence (AI) vs. physicians in the detection of Helicobacter pylori infection CI, confidence interval
The pooled accuracy of physician endoscopists in the diagnosis of all GI lesions was 82.9% (95%CI 76.7-87.7), the pooled sensitivity was 79.6% (95%CI 68.1-87.7) and specificity was 83.8% (95%CI 72-91.3) (Fig. 1-3). Based on a non-causal subgroup method of comparison, CNN appeared to perform comparably to physicians in terms of accuracy, sensitivity and specificity (P=0.2, P=0.2, and P=0.6, respectively) (Table 2).
Table 2 Summary of results Pooled results (95% confidence intervals)
To assess whether any one study had a dominant effect on the meta-analysis, we excluded one study at a time and analyzed its effect on the main summary estimate. In this analysis, no single study significantly affected the outcome or the heterogeneity.
A large degree of between-study heterogeneity was expected, given the broad nature of machine learning algorithms and endoscopic optics included in this study. This is reflected in our I2 values (Table 2). Prediction interval statistics were not calculated because of the expected large degree of heterogeneity and the fact that the goal was not to provide precise point estimates.
Publication bias assessment largely depends on the sample size and the effect size. A publication bias assessment was deferred in this study because the final number of studies included in the analysis was less than 10.
The quality of evidence was rated for results from the meta-analysis according to the GRADE working group approach [17]. Observational studies begin with a low-quality rating and, based on the risk of bias, indirectness, heterogeneity and publication bias, this meta-analysis would be considered as low-quality evidence.
To the best of our knowledge, this is the first systematic review and meta-analysis to assess CNN-based computer aided diagnosis of H. pylori infection. Based on our analysis, CNN-based deep machine learning demonstrated a pooled accuracy of 87.1%, a sensitivity of 86.3% and a specificity of 87.1% in the computer-aided diagnosis of H. pylori infection based on endoscopic images.
The respective pooled accuracy, sensitivity and specificity of CNN in the detection of H. pylori infection were each approximately 87%. The corresponding pooled performance values for physician endoscopists were 83%, 80% and 84%, respectively. Based on a non-causal subgroup comparison, the pooled accuracy, sensitivity and specificity values seemed comparable between AI and physicians. Although the estimates seem to support the claim that deep learning algorithms can match physician-level diagnostic performance, several methodological limitations need to be kept in mind.
The included studies evaluated the performance of CNN under experimental conditions and not in a real-life clinical scenario. Prospective studies are lacking. Only high-quality images were used to train CNN. Procedural limitations such as less insufflation of air, post-biopsy bleeding, halation, blur, defocus or mucus can all affect the accuracy of a computer-aided diagnosis in a real clinical setting. Not all studies reported comparison outcomes with physician endoscopists. There was variability in the choice of thresholds used to report sensitivity and specificity. There was a lack in uniformity in the validation of the algorithm’s training process before it was used for testing.
How does our study compare to currently published data? Although there are no other current meta-analyses evaluating the use of CNN in the diagnosis of H. pylori infection, a recently published review by Liu et al [4] evaluated the performance of deep machine learning and compared it to healthcare professionals in detecting diseases from medical imaging. They found the diagnostic performance of deep learning models to be equivalent to that of healthcare professionals and the pooled results were comparable to this study.
The strength of this review resides in the careful selection of studies reporting on machine-based learning, limiting them solely to CNN-based algorithms and avoiding other redundant studies. There are limitations to this study, most of which are inherent to any meta-analysis. The included studies were not entirely representative of the general population and community practice, with most studies being performed in an experimental environment. Our analysis had studies that were retrospective in nature, contributing to selection bias.
Our analysis has the limitation of non-causal comparison and heterogeneity. We were unable to statistically ascertain a cause for the observed heterogeneity. However, we hypothesize that the observed heterogeneity was primarily due to the following variables: threshold cutoff used, different training algorithms and training methodologies employed, and the variability in endoscopic optics (standard white light, blue laser imaging, linked color imaging). There exists a considerable degree of uncertainty and incomprehensibility due to the small dataset, as only 5 studies were available for analysis. Nevertheless, this study is the best available pooled evaluation of the diagnostic performance of CNN in computer-aided assessment of H. pylori infection thus far.
In conclusion, based on our meta-analysis, deep machine learning by means of convolutional neural network based algorithms demonstrates a pooled accuracy, sensitivity and specificity of 85% in the computer-aided diagnosis of H. pylori infection based on endoscopic imaging. CNN seems to demonstrate better accuracy, and equivalent sensitivity and specificity, compared to physician endoscopists. Deep learning in gastroenterology is in its infancy and is witnessing a rapid, steep growth in terms of learning as well as technological development. Future studies are needed to streamline the machine-learning process and define its role in the computer-aided diagnosis of H. pylori infections in real-life clinical scenarios.
What is already known:
There is no other meta-analysis evaluating the use of artificial intelligence (AI) based on convolutional neural networks (CNN) in the endoscopic image-based diagnosis of Helicobacter pylori (H. pylori)
A handful of studies have reported equivalent diagnostic performance between AI and physician endoscopists
What the new findings are:
Diagnostic accuracy, sensitivity and specificity are >85% with AI based on CNN in the diagnosis of H. pylori
Diagnostic accuracy of CNN seemed comparable to physician endoscopists based on non-causal subgroup comparison
The authors would like to thank Dana Gerberi, MLIS, Librarian, Mayo Clinic Libraries, for help with the systematic literature search, and Unnikrishnan Pattath, BTECH, MBA, Artificial intelligence solutions, Bangalore, India, for help with technical details on convolutional neural network algorithms.
1. Sugano K. Effect of Helicobacter pylori eradication on the incidence of gastric cancer:a systematic review and meta-analysis. Gastric Cancer 2019;
2. Uemura N, Okamoto S, Yamamoto S, et al. Helicobacter pylori infection and the development of gastric cancer. N Engl J Med 2001;
3. Chey WD, Leontiadis GI, Howden CW, Moss SF. ACG Clinical Guideline:treatment of Helicobacter pylori infection. Am J Gastroenterol 2017;
4. Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging:a systematic review and meta-analysis. Lancet Digit Health 2019;
5. Byrne MF, Chapados N, Soudan F, et al. Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut 2019;
6. Krizhevsky A, Sutskever I, Hinton GE. Advances in neural information processing systems. Neural Information Processing Systems Foundation 2012;1269.
7. Szegedy C, Vanhoucke V, Ioffe S, et al. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
8. Stroup DF, Berlin JA, Morton SC, et al. Meta-analysis of observational studies in epidemiology:a proposal for reporting. Meta-analysis of observational studies in epidemiology (MOOSE) group. JAMA 2000;
9. DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986;
10. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003;
11. Mohan BP, Adler DG. Heterogeneity in systematic review and meta-analysis:how to read between the numbers. Gastrointest Endosc 2019;
12. Shichijo S, Endo Y, Aoyama K, et al. Application of convolutional neural networks for evaluating Helicobacter pylori infection status on the basis of endoscopic images. Scand J Gastroenterol 2019;
13. Itoh T, Kawahira H, Nakashima H, Yata N. Deep learning analyzes Helicobacter pylori infection by upper gastrointestinal endoscopy images. Endosc Int Open 2018;
14. Nakashima H, Kawahira H, Kawachi H, et al. Artificial intelligence diagnosis of Helicobacter pylori infection using blue laser imaging-bright and linked color imaging:a single-center prospective study. Ann Gastroenterol 2018;
15. Shichijo S, Nomura S, Aoyama K, et al. Application of convolutional neural networks in the diagnosis of Helicobacter pylori infection based on endoscopic images. EBioMedicine 2017;
16. Zheng W, Zhang X, Kim JJ, et al. High accuracy of convolutional neural network for evaluation of Helicobacter pylori infection based on endoscopic images:preliminary experience. Clin Transl Gastroenterol 2019;
17. Puhan MA, Schünemann HJ, Murad MH, et al;GRADE Working Group. A GRADE Working Group approach for rating the quality of treatment effect estimates from network meta-analysis. BMJ 2014;