Machine learning in predicting treatment response and remission in inflammatory bowel disease: a systematic review

Sheza Malika, Renisha Redijb, Dushyant Singh Dahiyac, Chengu Niud, Douglas G. Adlere

Emory University Hospital, Atlanta, Georgia; Trinity Health Livonia Hospital, Michigan; University of Kansas, Kansas City; University of Nebraska, Omaha, Nebraska, USA; Advent Health, Denver, Colorado, USA

aGastroenterology and Hepatology, Emory University Hospital, Atlanta, Georgia, USA (Sheza Malik, Chengu Niu); bInternal Medicine, Trinity Health Livonia Hospital, Michigan, USA (Renisha Redij); cGastroenterology and Hepatology, University of Kansas, Kansas City, Kansas, USA (Dushyant Singh Dahiya); dGastroenterology and Hepatology, University of Nebraska, Omaha, Nebraska, USA (Chengu Niu) eGastroneterology and Hepatology, Center for Advanced Therapeutic Endoscopy, Centura Health, Denver, Colorado, USA (Douglas G. Adler)

Correspondence to: Douglas G. Adler, MD, FACG, AGAF, FASGE, Director, Center for Advanced Therapeutic Endoscopy, Advent Health, Denver, CO, USA, e-mail: dougraham2001@gmail.com
Received 13 April 2025; accepted 24 December 2025; published online 12 February 2026
DOI: 10.20524/aog.2026.1041
© 2026 Hellenic Society of Gastroenterology

Abstract

Background The heterogeneity of inflammatory bowel disease (IBD) and its unpredictable course have always been a challenge for gastroenterologists, with regard to predicting the disease response using endoscopic techniques. Machine learning (ML) models have shown some early promise in predicting treatment response in IBD patients.

Methods We conducted a systematic review of studies investigating the application of ML to predict treatment response and remission in IBD patients. We used the CHARMS checklist for data extraction. Bias was assessed with the PROBAST tool.

Results We included in our review 6 studies that evaluated numbers of IBD patients ranging from 67 to 3004. ML models demonstrated low to moderate predictive accuracy for treatment response and remission (area under the receiver operating characteristic curve: 0.489-0.811; sensitivity: 0.46-0.96; specificity: 0.56-0.98). The studies that utilized ML models with more input variables performed better. Furthermore, only 2 studies performed external validation, and half of the studies demonstrated a substantial risk of bias due to missing data/overfitting, and variability in outcome definition

Conclusions ML models show considerable promise in predicting treatment outcomes and remission in IBD. However, given the substantial bias in studies so far, future studies should use a standardized methodology, external validation, and an interpretable broader input variable.

Keywords Machine learning, inflammatory bowel disease, treatment, monitoring, response

Ann Gastroenterol 2026; 39 (2): 247-253


Introduction

Inflammatory bowel disease (IBD), comprising Crohn’s disease (CD), ulcerative colitis (UC), and unclassified IBD, has a heterogeneous nature, making it very challenging for gastroenterologists to predict treatment remission and response on the basis of endoscopic scores [1,2]. The unpredictable response to various commonly used pharmacological therapies further complicates clinical management [3]. Machine learning (ML) models have shown early promise to overcome these challenges.

ML models utilize large and diverse datasets to recognize meaningful patterns in clinical, biochemical, and imaging data [4]. The methods include supervised, unsupervised, and ensemble learning algorithms. ML models have shown potential in predicting treatment response and remission in IBD patients [5]. These models have utilized multiple input variables (electronic health records, serum markers, genomics, radiographic, histological, and endoscopic data) to predict the disease activity and response to the treatment.

Our study aimed to evaluate the ML models’ performance in predicting treatment response and remission in patients with IBD. By examining the methodological quality and predictive accuracy of the ML models, we aimed to assess the current performance of ML models for IBD.

Materials and methods

In our systematic review, we followed the Preferred Reporting Items for Systematic Reviews and Meta-analysis [PRISMA] statement [6]. The details of the PRISMA checklist are provided in Supplementary Table 1.

Literature search

We conducted a comprehensive database search in February 2025, including Ovid Medline, Ovid EMBASE, Scopus, and Web of Science. With input from our team, an experienced medical librarian designed the search strategy, using controlled vocabulary and specific keywords for ML, artificial intelligence, IBD, CD, UC, treatment response prediction, clinical remission, and biologic therapy.

Eligibility criteria

Studies were selected based on predefined inclusion and exclusion criteria. No a priori exclusions were applied based on the type of predictors. Studies incorporating clinical, biochemical, endoscopic, imaging, or genetic variables were eligible, provided that these features were available at or before the treatment decision. Only peer-reviewed journal articles published in English were considered. In addition, in our review, we considered any predictive model utilizing data-driven optimization, regularization, or non-linear pattern recognition beyond standard statistical regression (e.g., penalized regression methods) as ML models.

Exclusion criteria included studies that used traditional statistical models and studies that focused only on IBD diagnosis. Additionally, case reports, conference abstracts, and review articles without primary data were excluded.

Selection process

Two authors, SM and RR, independently reviewed the titles and abstracts of studies returned by the primary search. Studies that did not address the research question were excluded. The full texts of the remaining articles were then examined. Any discrepancies in selecting articles were resolved through consensus and discussion with another co-author, DSD.

Data extraction and quality assessment

In accordance with the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS checklist) [7], 2 authors, SM and RR, independently extracted data. Baseline characteristics were gathered, including year of publication, study design, patient population, and sample size. Data on ML methods, model validation strategies, performance metrics, and input variables—including clinical markers, biochemical parameters, endoscopic scores, genetic features, and imaging findings—were also gathered.

Risk of bias assessment

The study co-authors (SM, RR) assessed each study for risk of bias according to the TRIPOD recommendations (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) [8], using the PROBAST tool (Prediction model Risk Of Bias Assessment Tool) [9]. Studies were assessed across 4 key areas: (i) participants; (ii) predictors; (iii) outcome; and (iv) analysis. For each domain, the risk of bias and applicability to the intended clinical setting were evaluated. Discrepancies were resolved through consensus among the coauthors.

Statistical analysis

Because of the heterogeneity in ML models, input variables, definition of the outcome and type/strategy for validation, a meta-analysis could not be conducted. Instead, a descriptive synthesis was performed. We also separated the studies based on validation (internal vs. external validation), allowing for an assessment of model generalizability [10].

Results

Study selection

A total of 29 studies were identified by our search strategy, of which 6 were included in the final analysis [9-14]. There was a high degree of agreement between the 2 reviewers (SM, RR) regarding the inclusion of studies (Cohen’s k: 0.977, 95% confidence interval: 0.81-1.00). Fig. 1 shows the PRISMA flowchart for the study identification and selection.

thumblarge

Figure 1 PRISMA flowchart

From: Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. doi: 10.1136/bmj.n71

For more information, visit: http://www.prisma-statement.org/

Characteristics of studies

Our review encompasses 6 retrospective studies that investigated ML in predicting the Treatment Response and Remission in IBD [11-16]. These studies include 3 retrospective analyses and 3 that reported a post hoc ML analysis of Phase III multicenter, randomized placebo-controlled clinical trials. Participant numbers ranged from 67 to 3004. The average age was 31 years. The baseline characteristics of these studies are summarized in Table 1.

Table 1 Study characteristics and validation details

thumblarge

ML approaches

The included studies employed a range of supervised learning algorithms, with a predominance of ensemble methods, particularly RF and XGBoost. None of the included studies used deep learning models, probably due to the limited availability of large-scale datasets.

Input features used in ML models

The included studies varied in their selection of predictive input features. Commonly used biochemical markers included C-reactive protein, fecal calprotectin and white blood cell count. Clinical variables such as age, disease duration, prior treatment failure and medication history were frequently incorporated into machine-learning models. Some studies included endoscopic findings, using scoring systems such as the Simple Endoscopic Score for Crohn’s Disease, Ulcerative Colitis Endoscopic Index of Severity scores, and the Mayo Endoscopic Score.

Performance of ML models

In our review, ML models showed variable performance across studies (area under the receiver operating characteristic curve [AUROC] 0.489-0.811), with AUROC 0.70-0.79 interpreted as acceptable/moderate and ≥0.80 as strong/high [17].

The ensemble learning algorithms (e.g., Random Forest and XGBoost) consistently performed better than simpler models, probably because of their ability to recognize complex, nonlinear relationships. In our review, Sensitivity ranged from 0.462-0.964 and specificity from 0.56-0.98. The details of various ML models are summarized in Table 2.

Table 2 Performance of ML models

thumblarge

Internal validation was performed in 4 studies, whereas only 2 studies implemented external validation. The handling of missing data was inconsistent and variable across studies. The patient population (CD vs. UC) and definitions of treatment response/remission also varied across included studies, while different therapeutic agents were assessed.

Risk of bias and methodological limitations

In the Participants domain of the PROBAST tool, all studies (6/6: 100%) were considered as having a low risk of bias. The concerns regarding applicability, however, were unclear in 4/6 studies (66%) (insufficient information on inclusion and exclusion criteria).

In the Predictors domain of the PROBAST tool, all studies clearly defined predictor variables. However, 2 studies (33%) [11,12] had an unclear risk of bias, as they lacked sufficient information on whether predictors were assessed independently of outcome data. The concern regarding the applicability of the predictor domain was low across all 6 studies.

The Outcomes domain showed a low risk of bias and low applicability concerns in all 6 studies (100%). For the Analysis domain, substantial concerns were identified. Three of 6 studies (50%) showed a high risk of bias, due to unclear handling of continuous and categorical variables, and a lack of information on how missing data, class imbalance, and potential non-linearity/time dependence were managed. None of the studies addressed potential overfitting in their models. The PROBAST details are provided in Table 3.

Table 3 PROBAST scoring

thumblarge

Discussion

Our study highlights the potential role of ML for predicting treatment response and remission in people with IBD. In 6 studies, ML models showed good prediction performance, especially when multivariate clinical and biochemical input variables, such as biomarkers, endoscopic scores, and genetic markers, were used. However, the current clinical applicability of the methodology and the potential for its extensive adoption are still limited, given the large heterogeneity.

Previous systematic reviews in IBD have examined the potential for ML as a tool to aid in the diagnosis of IBD, with limited investigation into ML’s role in predicting treatment response and remission in patients already diagnosed with IBD [18,19]. We assessed ML models designed to predict therapy response and remission in IBD. Of all the algorithms evaluated, the ensemble ML models, Extreme Gradient Boosting (XGBoost) and Random Forest, surpassed simpler linear models, further substantiating their suitability in prognostic modeling in oncology as well as other chronic disorders [20]. Nevertheless, none of the studies used explainable artificial intelligence (XAI) methods. XAI methods show how complex models make predictions. A popular scheme, SHapley Additive exPlanations (SHAP), shows how much each variable contributes to each predictor, making model outputs clearer. In other areas, such as cardiovascular risk estimation, XAI has greater clinician confidence [21], highlighting its significance for complex conditions like IBD.

In our review, approximately half of the included studies had a high risk of bias, mainly due to inadequate statistical methodology. This finding is consistent with previous studies, which have highlighted the importance of transparent model reporting as outlined in the TRIPOD statement [22]. Most studies in our review were retrospective, single-center studies, raising concerns about overfitting and generalizability [23]. Additionally, heterogeneity in the definitions of clinical response and remission (such as clinical vs. biochemical), and in outcome measures, further complicates cross-study comparisons [24].

Future research should prioritize 3 key directions. First, prospective multicenter trials following the TRIPOD framework [8] are essential for validating ML models in real-world IBD populations. Second, reaching consensus on core biomarkers and remission criteria will harmonize predictors and endpoints across studies, facilitating external validation and meta-analytic synthesis [25]. Third, integrating explainability tools should enhance clinicians’ understanding and build confidence in ML-generated outputs, rather than substituting for methodological rigor [26]. Despite moderate accuracy, existing models are largely limited to predicting response to individual biologic agents (e.g., vedolizumab or infliximab). The next generation of models must move beyond single-drug predictions to enable comparative forecasting across multiple therapeutic options, thereby supporting precision therapy selection in routine clinical practice.

In conclusion, ML represents a potentially transformative opportunity to advance precision medicine in IBD. Its clinical integration will depend on rigorous prospective validation, adoption of standardized reporting frameworks, and close interdisciplinary collaboration among data scientists, clinicians, and methodologists to bridge the gap between algorithmic innovation and tangible patient benefit.

Summary Box

What is already known:

  • Predicting treatment response in inflammatory bowel disease (IBD) is difficult, in view of the variability of the disease

  • Traditional tools like endoscopy and biomarkers have limited predictive power

  • Machine learning (ML) has shown promise in IBD diagnosis, but its role in predicting remission is less explored

  • Existing studies lack consistency in methods and validation

What the new findings are:


  • ML models, especially Random Forest and XGBoost, show moderate-to-good accuracy in predicting IBD treatment outcomes

  • Multivariate data sets improve model performance

  • Only a third of studies used external validation; half had a high risk of bias

  • Standardization and prospective validation are key for clinical use

References

1. Vasudevan A, Gibson PR, van Langenberg DR. Time to clinical response and remission for therapeutics in inflammatory bowel diseases:what should the clinician expect, what should patients be told?World J Gastroenterol 2017;23:6385-6402.

2. D'IncàR, Sturniolo G. Biomarkers in IBD:what to utilize for the diagnosis?Diagnostics (Basel) 2023;13:2931.

3. Plevris N, Lees CW. Disease monitoring in inflammatory bowel disease:evolving principles and possibilities. Gastroenterology 2022;162:1456-1475.

4. Pinto-Coelho L. How artificial intelligence is shaping medical imaging technology:a survey of innovations and applications. Bioengineering (Basel) 2023;10:1435.

5. Kraszewski S, Szczurek W, Szymczak J, Reguła M, Neubauer K. Machine learning prediction model for inflammatory bowel disease based on laboratory markers. Working model in a discovery cohort study. J Clin Med 2021;10:4745.

6. Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement:an updated guideline for reporting systematic reviews. Syst Rev 2021;10:89.

7. Moons KG, de Groot JA, Bouwmeester W, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies:the CHARMS checklist. PLoS Med 2014;11:e1001744.

8. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD):the TRIPOD statement. BMJ 2015;350:g7594.

9. Wolff RF, Moons KGM, Riley RD, et al;PROBAST Group†. PROBAST:a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med 2019;170:51-58.

10. McHugh ML. Interrater reliability:the kappa statistic. Biochem Med (Zagreb) 2012;22:276-282.

11. Chen J, Girard M, Wang S, Kisfalvi K, Lirio R. Using supervised machine learning approach to predict treatment outcomes of vedolizumab in ulcerative colitis patients. J Biopharm Stat 2022;32:330-345.

12. Harun R, Lu J, Kassir N, Zhang W. Machine learning-based quantification of patient factors impacting remission in patients with ulcerative colitis:insights from etrolizumab phase III clinical trials. Clin Pharmacol Ther 2024;115:815-824.

13. Miyoshi J, Maeda T, Matsuoka K, et al. Machine learning using clinical data at baseline predicts the efficacy of vedolizumab at week 22 in patients with ulcerative colitis. Sci Rep 2021;11:16440.

14. Morikubo H, Tojima R, Maeda T, et al. Machine learning using clinical data at baseline predicts the medium-term efficacy of ustekinumab in patients with ulcerative colitis. Sci Rep 2024;14:4386.

15. Qiu Y, Hu S, Chao K, et al. Developing a machine-learning prediction model for infliximab response in Crohn's disease:integrating clinical characteristics and longitudinal laboratory trends. Inflamm Bowel Dis 2025;31:1334-1343.

16. Waljee AK, Wallace BI, Cohen-Mekelburg S, et al. Development and validation of machine learning models in prediction of remission in patients with moderate to severe Crohn disease. JAMA Netw Open 2019;2:e193721.

17. Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regression. 3rd ed. Wiley;2013.

18. Zulqarnain F, Rhoads SF, Syed S. Machine and deep learning in inflammatory bowel disease. Curr Opin Gastroenterol 2023;39:294-300.

19. Pei J, Wang G, Li Y, et al. Utility of four machine learning approaches for identifying ulcerative colitis and Crohn's disease. Heliyon 2024;10:e23439.

20. Stidham RW, Takenaka K. Artificial intelligence for disease assessment in inflammatory bowel disease:how will it change our practice?Gastroenterology 2022;162:1493-1506.

21. Al-Droubi SS, Jahangir E, Kochendorfer KM, et al. Artificial intelligence modelling to assess the risk of cardiovascular disease in oncology patients. Eur Heart J Digit Health 2023;4:302-315.

22. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD):the TRIPOD statement. BMJ 2015;350:g7594.

23. Aliferis C, Simon G. Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI. 2024 Mar 5. In:Simon GJ, Aliferis C, editors. Artificial intelligence and machine learning in health care and medical sciences:Best practices and pitfalls [Internet]. Cham (CH):Springer;2024.

24. Ma C, Panaccione R, Fedorak RN, et al. Heterogeneity in definitions of endpoints for clinical trials of ulcerative colitis:a systematic review for development of a core outcome set. Clin Gastroenterol Hepatol 2018;16:637-647.

25. Bova G, Domenichiello A, Letzen JE, et al. Developing consensus on core outcome sets of domains for acute, the transition from acute to chronic, recurrent/episodic, and chronic pain:results of the INTEGRATE-pain Delphi process. EClinicalMedicine 2023;66:102340.

26. Agrawal R, Gupta T, Gupta S, Chauhan S, Patel P, Hamdare S. Fostering trust and interpretability:integrating explainable AI (XAI) with machine learning for enhanced disease prediction and decision transparency. Diagn Pathol 2025;20:105.

Notes

Conflict of Interest: DGA: Consultant, Boston Scientific. All other authors: No conflict of interest