R/Medicine Abstracts

Abstracts are listed by day. Click a talk title on the Program page to jump directly to its abstract.


Meeting Day 1 — Thursday, May 7, 2026

Bridging the Gap Between Spreadsheets and Shiny: Interactive Clinical Data Entry with esq.handsontable

Anastasiia Kostiv | ESQlabs

Regular Talk — Thursday, May 7, 2026

Clinical researchers frequently rely on spreadsheets for data entry due to their familiar, interactive editing experience. Yet spreadsheets lack validation enforcement, audit trails, and integration with analytical pipelines. Existing R/Shiny table widgets — such as DT (read-only) and rhandsontable (largely unmaintained) — leave a gap: there is no actively maintained, feature-rich solution for structured data entry within Shiny applications.

We present esq.handsontable, an open-source R package developed at esqlabs GmbH that brings Excel-like editing to Shiny with capabilities designed for clinical and life-science workflows. The package wraps the Handsontable JavaScript library via a React frontend and provides validated dropdowns, multi-select fields with drag-and-drop sorting, conditional cell disabling based on other cell values, dynamic option updates at runtime, and row-level action buttons — all configured declaratively from R.


MetaInsight v7: A Comprehensive and Reproducible Shiny App for Network Meta-analysis

Simon Smart | University of Leicester

Regular Talk — Thursday, May 7, 2026

Network meta-analysis (NMA) enables comparisons between treatments in different studies, even when no direct comparisons have been made, thereby determining the ‘best’ intervention for patients. Researchers may lack the statistical knowledge or programming expertise to conduct NMA in R, limiting uptake of cutting-edge methods. MetaInsight is an R Shiny web app that was developed to enable NMA to be conducted through a user-friendly interface and has been cited over 300 times.

We present a major update (https://crsu.shinyapps.io/MetaInsight_Scholar/) focussed on enabling reproducibility by refactoring the app using the shinyscholar template. Now analyses can be stored permanently, either as a Quarto or html document and also saved and restored from disk. Additionally, we have improved the quality of plot downloads, added the option to upload risk of bias data, developed a new interface for excluding studies from sensitivity analyses, made models refit automatically and enabled data exports to the CINeMA app (Confidence in Network Meta-Analysis). As a side product of enabling reproducibility, the functionality of the app is made available as an R package, providing a simple interface to the various R packages used to conduct NMA.

The talk will demonstrate the new features whilst conducting a NMA in MetaInsight and discuss some of the challenges associated with refactoring an existing codebase and developing apps that analyse data uploaded by users.


NIH: Not Invented Here: balancing unique software needs vs out-of-box (almost) perfect generic solution.

Dror Berel | Consultant

Regular Talk — Thursday, May 7, 2026

Biomedical Researchers face persistent tension between building bespoke software solutions and adopting existing frameworks. This “Not Invented Here” (NIH) syndrome often leads to costly reinvention, yet unique regulatory and scientific demands sometimes justify custom approaches. This presentation examines this balance through Teal, Pharmavers’ open-source framework for clinical trial reporting. Teal’s “module-first” architecture provides a validated foundation while enabling domain-specific customization through modular components. The Module-First approach features four key characteristics: Reusability through study-agnostic analytical modules; Configuration-Driven customization via YAML rather than code changes; Rapid Deployment by combining existing modules; and Unified Experience through a meta-UI framework providing linked data filtering, standardized reporting, and consistent user experience. This significantly reduces development time while maintaining quality and consistency. Through real-world examples, I demonstrate how this approach addresses regulatory compliance, data traceability, and cross-functional collaboration requirements. The presentation explores three critical decision points for developers: when to build versus adopt; how to evaluate “almost perfect” solutions; and strategies for extending generic frameworks without compromising validation or maintainability.


CougarStats: An Interactive R-Based Platform for Applied Health Data Science

Ashok Krishnamurthy & Darren Law Yan Lun | Mount Royal University

Lightning Talk — Thursday, May 7, 2026

CougarStats (www.cougarstats.ca) is a web-based analytical platform designed to support applied health data science workflows using the R ecosystem. The application enables biomedical researchers and health analysts to explore datasets, fit statistical and predictive models, and generate publication-ready visualizations through an intuitive browser interface. Emphasis is placed on transparent computation, reproducibility, and interactive model exploration so that users can move efficiently from raw clinical or population health data to interpretable insights. By lowering technical barriers while exposing the underlying R logic,

CougarStats also serves as a practical bridge for attendees seeking to strengthen their R proficiency in real-world health analytics contexts. This presentation demonstrates how the platform can be used to accelerate exploratory analysis, generate data visualizations, and enhance methodological understanding in health data science settings.


Toggling Between Traditional and Chat-Based Data Filters in Shiny

Alex Zajichek | Cleveland Clinic

Lightning Talk — Thursday, May 7, 2026

Packages like {ellmer} and {querychat} have opened new doors for how users can interact with data within Shiny applications. In this talk, we’ll provide a brief introduction to these packages, explain their clever advantages for data security and statistical integrity, and provide an example of how to implement a toggle in the application’s user interface (UI) to enable the user to switch between traditional data filters and an LLM-based chat interface for filtering map, table, and plot components in the context of the Hospital Readmissions Reduction Program (HRRP).


From Prompt to Proof: Governing LLM Intelligence in Pharma with Snowflake and Shiny

Tanya Cashorali | TCB Analytics

Regular Talk — Thursday, May 7, 2026

This talk presents a production-ready AI architecture for generating defensible competitive and clinical intelligence in the pharmaceutical industry. While large language models can summarize biomedical content, healthcare decisions demand traceability, source attribution, and reproducibility. We demonstrate how to transform LLM outputs into governed, auditable, and reusable insights aligned with real-world regulatory and strategic requirements.

Our approach separates probabilistic AI reasoning from deterministic data execution. The intelligence pipeline lives natively in Snowflake and is exposed through REST API endpoints, making it fully language agnostic. R, Python, or other clients can interface with the same services while keeping the user interface decoupled from orchestration and core logic.

We showcase LLM-based question classification, structured entity extraction, semantic caching with embeddings, knowledge graph execution for grounded retrieval, and source-aware integration with PubMed and ClinicalTrials.gov. A Shiny interface deployed with Posit Team demonstrates how human-in-the-loop validation, audit trails, and versioned knowledge objects can support institutional memory and continuous monitoring. Attendees will learn reproducible design patterns for building trustworthy, scalable AI systems in regulated healthcare environments.


Report Agent: An LLM-Powered Framework for Automated Shell-to-Code Report Generation

Lucie Zhang | Genentech

Lightning Talk — Thursday, May 7, 2026

The development of clinical Tables, Listings, and Graphs (TLGs) from Statistical Analysis Plan (SAP) shells remains a labor-intensive and error-prone step in statistical programming workflows. Although the pharmaverse ecosystem has improved clinical reporting infrastructure, programmers still manually translate narrative shell specifications into executable R code, introducing inefficiencies, variability across programmers, and increased quality-control burden.

To address this challenge, we developed TLG-Agent for clinical research, this is an LLM-driven framework that automates the conversion of shell specifications into reproducible R code. TLG-Agent parses shell content—including titles, footnotes, table structure, column definitions, and summary statistics—and converts these elements into structured prompts for a Large Language Model. The generated code is executed within a controlled environment to produce TLG outputs, while an interactive R Shiny interface allows users to review prompts, inspect AI-generated code, and preview results in a transparent workflow.

In a proof-of-concept evaluation using a baseline demographics shell, TLG-Agent reproduced expected table layouts and summary statistics. Preliminary results suggest reduced development time, improved programming consistency, and strong potential for integrating LLM-based agents into clinical reporting workflows.


AI-Assisted Authoring of an Open-Access Portuguese Biostatistics Book in Quarto and R

Audrei Pavanello & Leonardo Pestillo de Oliveira | Cesumar Institute of Science, Technology and Innovation

Lightning Talk — Thursday, May 7, 2026

Teaching biostatistics to healthcare doctoral students presents a persistent challenge: students need rigorous statistical foundations, practical R skills, and real-world health data applications, but few open-access resources address all three simultaneously, particularly in Portuguese. This talk describes the development of “Bioestatística Aplicada à Saúde usando R,” a free, open-access biostatistics book built with Quarto and R, covering tidyverse data wrangling, ggplot2 visualization, hypothesis testing, regression modeling, psychometric analysis, and text mining with LLMs. All examples use real hospital admission data from DATASUS, Brazil’s public health information system. The book was developed using an AI-assisted authoring workflow powered by Claude Code, which substantially accelerated the production of reproducible code examples, chapter structure, and Quarto formatting. We discuss how this workflow was implemented, where it helped most, and its limitations for technical educational content requiring domain expertise. The resulting book is published via Zenodo (DOI: 10.5281/zenodo.18458781), hosted on GitHub Pages, and used in a PhD-level Health Promotion program. We argue that AI-assisted authoring, when anchored in subject-matter expertise, offers a reproducible model for developing open educational resources in R, particularly for non-English-speaking research communities underserved by existing materials.


Care and Feeding of Your Biostats Team: Scaling best practices in a large hybrid SAS/R team

John Ehrlinger | Cleveland Clinic

Regular Talk — Thursday, May 7, 2026

The Cardiovascular Outcomes Registries and Research (CORR) group at Cleveland Clinic supports observational cardiovascular outcomes research, producing 75–100 peer-reviewed manuscripts per year from 19 core staff — and scaling to that size exposes every shortcut that worked when one person held the workflow in their head. Our core deliverable is the Analysis Report: each hypothesis maps to a statistical approach, each approach to a SAS or R job, and outputs are manually assembled into a document researchers use to write papers. At this scale, inconsistency in report structure is a production problem, not an aesthetic one. We describe a framework mapping SAS conventions to R equivalents: batch jobs to Quarto documents, reducing the manual collation step; a shared macro library to internal R packages (ggRandomForests, hvtiPlotR, hvtiRutilities) standardizing ggplot2 themes and utilities; informal environment tracking to RStudio Projects with renv; and a 25-year social contract of naming conventions and folder standards we are reinforcing with automated checks — moving toward “trust, but verify.” The underlying motivation is reducing manual and cognitive load so biostatisticians can focus on science, making good practice the natural path for SAS and R users alike. We share what is working, where we have gaps, and how framing these investments as production reliability rather than methodological idealism changed conversations with leadership and colleagues.


Streamlined Access to NHANES Data: Fast, Harmonized, and Reliable

Natalie Neugaard & Amrit Baral | University of Miami

Lightning Talk — Thursday, May 7, 2026

The National Health and Nutrition Examination Survey (NHANES) offers nationally representative health, examination, and laboratory data on US adults aged 18 years and older spanning more than 25 years. Despite its value, working with NHANES data is often tedious. Researchers must reconcile inconsistent dataset structures and correctly specify complex survey weights when combining cycles. Unreliable CDC server access can also break reproducible pipelines, and the cycle suffix system turns data discovery into a scavenger hunt. The nhanesdata R package delivers fast, harmonized, analysis-ready NHANES tables. Public datasets are hosted on reliable cloud storage and pre-merged across survey cycles. The read_nhanes() function returns multi-cycle data with a cycle/year variable, consistent column classes, and a clean structure. Built-in search tools help users locate variables across the NHANES catalog without navigating suffix tables such as DEMO, DEMO_B through DEMO_L. The create_design() function constructs survey designs and recalculates multi-cycle weights following NHANES guidance. Data loads in seconds, allowing researchers to focus on analysis. The package is available on CRAN, and datasets are publicly hosted without requiring user authentication. The data can be used with any Arrow-compatible language. The presentation will demonstrate how researchers can retrieve harmonized NHANES datasets and construct valid survey designs with a few lines of code.


Infodengue intramunicipal - An R package under development

Ayrton Gouveia & Thais Riback | Fundação Oswaldo Cruz

Lightning Talk — Thursday, May 7, 2026

Infodengue is a pioneering Brazilian project that uses R to monitor and forecast arboviruses (dengue, chikungunya, Zika) nationwide, informing data-driven decisions by the Ministry of Health. Infodengue is a successful surveillance system at the city (municipality) level, however capturing transmission dynamics at intramunicipal scale—crucial for targeted interventions—has been a major challenge. This difficulty arises because the sparse case notification data at this level makes robust analysis difficult. This work presents a R pipeline developed to tackle this challenge, and discusses its successful implementation in generating Infodengue’s outputs: climate receptivity, alert level, and Rt at the intramunicipal level for two Brazilian cities: Recife and Joinville. This pipeline is being consolidated into an R package, to facilitate use and adaptation by other municipalities and researchers.

The pipeline operates by aggregating individual case notification data to the neighborhood level using two key variables: neighborhood ID and name. This requires a table containing these identifiers for all official neighborhoods in a municipality, along with population estimates to calculate incidence rates. Once neighborhood-level data is prepared, the pipeline can aggregate cases into larger spatial units that align with each city’s administrative structure—health districts in Joinville and administrative regions in Recife—enabling comparisons across different governance scales.


REDCapDM: Managing REDCap Data

João Carmezim | IGTP

Lightning Talk — Thursday, May 7, 2026

REDCap (Research Electronic Data Capture) is widely used in clinical research for data collection, but preparing REDCap exports for statistical analysis often requires a laborious and a time-consuming task. The R package REDCapDM was introduced at R/Medicine 2023 to streamline this process by providing tools for data import, transformation, query generation, and query follow-up. Since then, the package has been formally published in BMC Medical Research Methodology (https://doi.org/10.1186/s12874-024-02178-6) and has undergone a substantial redesign.

The most notable change in this release is a redesigned data transformation pipeline. Previously handled by a single function, the workflow has now been divided into a step-by-step process. This allows users to apply each transformation independently, inspect intermediate results at every stage, and transparently track how REDCap exported data are converted into analysis-ready datasets. The new design also makes it easier to customize the pipeline to accommodate different REDCap project structures. This release also includes improved error handling with more informative messages and resolving previous inconsistencies.

These updates make REDCapDM a more robust, flexible, and transparent tool for preparing analysis-ready REDCap data within reproducible R workflows, supporting both researchers and developers working with clinical data. The package is available on CRAN and actively developed on GitHub.


Meeting Day 2 — Friday, May 8, 2026

bulkreadr: Fast, Reproducible Bulk Data Import for R-Based Research Workflows

Ezekiel Ogundepo | African Institute for Mathematical Sciences

Regular Talk — Friday, May 8, 2026

Research and public health data workflows often begin with fragmented inputs: multiple spreadsheets, delimited text files, and labelled survey extracts collected across sites, teams, or time points. Importing these files manually is inefficient, difficult to reproduce, and prone to error. Although the R ecosystem offers strong tools for reading individual file types, researchers still lack a simple, unified approach for bulk import across heterogeneous sources. bulkreadr is an open-source R package that streamlines this process through a consistent, vectorised interface built on packages such as readxl, readr, googlesheets4, and haven. It enables users to import all sheets in Excel workbooks or Google Sheets, read whole directories of CSV and related text files, preserve variable labels from SPSS and Stata files, and automatically generate data dictionaries and missing-data summaries. Outputs are returned as tibbles, making the package easy to integrate into tidyverse-based analytical pipelines. In a Multidimensional Poverty Research Project spanning 19 northern Nigerian states, bulkreadr reduced data-loading time from about 40 minutes of manual processing to under 3 minutes of scripted execution. Automated dictionary generation also improved questionnaire harmonisation across survey rounds and reduced downstream coding errors. By reducing friction at the data-ingestion stage, bulkreadr supports more reproducible, scalable, and transparent research workflows.


From Chaos to Consistency: Building a Long-Lasting Polish Arthroplasty Registry Data Pipeline

Dominik Żabiński | Gruca Orthopaedic and Trauma Teaching Hospital

Lightning Talk — Friday, May 8, 2026

This talk examines the systematic transformation of Polish Arthroplasty Registry data analysis from a fragmented, manual practice into a coordinated and sustainable analytical infrastructure. Initially, analytical work relied on ad hoc R scripts circulated informally among analysts, resulting in inconsistent methods, limited reproducibility, and substantial manual effort. Collaboration with clinicians was similarly constrained by static Excel outputs that were difficult to maintain and update. To address these limitations, we introduced structured version control to formalize collaboration within the analytics team and ensure transparent, traceable development of analytical workflows. We then replaced static reporting with Quarto documents and interactive Shiny dashboards deployed on the Posit platform, enabling clinicians to access timely, reproducible, and dynamically updated results. To further enhance consistency and reduce duplication of effort, we developed a suite of in house R packages that standardize data preparation, modeling procedures, and reporting formats across projects. Together, these interventions established a stable, multi user pipeline that has supported reliable arthroplasty analytics for several years. The talk outlines the technical architecture, organizational changes, and methodological principles that enabled this transition, and reflects on the broader implications for data driven collaboration in clinical settings.


Encapsulated Test Coverage: test.assessr – a package for R unit testing

Edward Gillian | Sanofi

Lightning Talk — Friday, May 8, 2026

The test.assessr package has been developed to address the growing need for a dedicated, flexible, and fully isolated unit‑testing infrastructure for R package validation. As validation needs evolved, the Sanofi Validation team identified a distinct requirement: a standalone package capable of generating robust, reproducible unit‑test coverage across diverse and sometimes highly unconventional testing environments. The motivation for test.assessr stems from both business and technical considerations. A new validation business use case requires the ability to assess R packages to remediate test coverage. Also, R package unit testing requires support for numerous non‑standard frameworks such as data.table, tinytest, RUnit, and test-cran. CRAN reviewers have further emphasized the need for isolated testing environments for unit test coverage. By externalizing testing functionality to test.assessr, we enable modularity, maintainability, and long‑term sustainability. Although cross‑package coordination introduces some complexity, these trade‑offs are outweighed by the clarity and stability gained through a dedicated testing architecture. This talk presents the design motivations, practical implementation insights, and lessons learned, contributing to ongoing conversations in the validation community about reproducible validation workflows and sustainable software engineering practices.


blockr: Clinical Data Analysis without Code

Christoph Sax | cynkra / University of Basel

Regular Talk — Friday, May 8, 2026

We present a clinical data analysis tool built with blockr, an open-source block-based visual programming framework for R, built on Shiny. The tool operates on CDISC ADaM datasets and provides multiple views for safety review: patient profile timelines with adverse events, laboratory panels, vital sign trends, and cohort-level summary tables with cross-filtering. An AI assistant allows users to ask natural-language questions about the data. Clinicians and medical monitors can conduct safety reviews without programming and without waiting for custom reports. blockr is a group of R packages that provides reusable analytical blocks covering data import, dplyr transformations, ggplot2 visualizations, and clinical displays. Users assemble these blocks into pipelines through a point-and-click interface. Because blocks are modular, the same tool can be reconfigured for different trials, endpoints, or therapeutic areas. Every interaction maps to generated R code, so analyses are reproducible and auditable. New blocks can be added by writing R functions. The tool has been developed in collaboration with Bristol Myers Squibb.


sumExtras: Streamlining Table Workflows in Clinical Research with R

Kyle Grealis | University of Miami

Lightning Talk — Friday, May 8, 2026

Creating publication-quality summary tables is a core task in clinical research. However, the repetitive formatting code required by standard R packages adds friction to reproducible workflows. Analysts often apply the same sequence of functions to every table by adding overall columns, calculating p-values, and incorporating stylized labels and headers. This repetition increases the risk of inconsistencies and adds a significant maintenance burden when project requirements change.

The sumExtras package addresses this by consolidating gtsummary formatting steps into a single, robust function call. By replacing multiple boilerplate functions with the extras() call, users can reduce styIing code by nearly 50% while ensuring a standardized aesthetic. This package provides a labeling system that integrates data dictionaries with R attributes to ensure label consistency across gtsummary and gt tables and ggplot2 figures.

Key features include a warn-and-continue design to enable table rendering even when individual formatting steps fail. It also includes JAMA-style compact themes and project-level options for global labeling behavior. This talk will demonstrate how sumExtras integrates into clinical trial and academic reporting pipelines to reduce maintenance burden. Attendees will see how the package allows researchers to focus on analysis decisions rather than formatting semantics.


An R-Based Pipeline for Resource-Optimized Participant Selection in High-Cost Clinical Research

Aparna Bhattacharyya | Johns Hopkins University

Lightning Talk — Friday, May 8, 2026

Omics assays such as metabolomics and proteomics can provide valuable biological insight, but their high cost often makes full-cohort profiling impractical. In heterogenous clinical datasets, random sample selection may miss meaningful subgroup structure and overrepresent noisy or ambiguous cases. We present a reproducible R workflow implemented in the SciDataReportR package for prioritizing participants for downstream omics using a public dataset as a stand-in for private clinical data. Using laboratory variables from the 2013-2014 NHANES cycle retrieved with the nhanesdata package, we used SciDataReportR for exploratory data analysis and an implemented pipeline to select variables of interest, apply dimension reduction with self-organizing maps, cluster participants with Gaussian mixture models, and identify high-confidence representatives of each cluster using their posterior probability of cluster membership to exclude ambiguous cases before costly follow-up assays are performed. Attendees will learn a practical, reusable strategy for combining exploratory reporting, unsupervised learning, and probability-based filtering to support budget-conscious sample selection in heterogenous clinical datasets. This presentation will also help us to refine the workflow through community feedback and strengthen its use in future collaborative clinical research.


medicalcoder: A Unified and Longitudinally Aware Framework for ICD-Based Comorbidity Assessment in R

Peter DeWitt | University of Colorado Anschutz

Regular Talk — Friday, May 8, 2026

Comorbidity algorithms derived from International Classification of Diseases (ICD) codes are central to risk adjustment and cohort characterization in clinical research. However, existing implementations often fragment across packages, inconsistently handle mixed ICD-9 and ICD-10 data, and typically rely on encounter-level aggregation that may under-ascertain chronic disease burden. We present medicalcoder, an R package providing a unified, longitudinally aware framework for applying multiple variants of the Charlson, Elixhauser, and Pediatric Complex Chronic Conditions algorithms. The package includes an internal ICD database, supports full and compact codes, accommodates mixed ICD versions within a dataset, and integrates present-on-admission and primary diagnosis indicators. Unlike encounter-level approaches that simply aggregate flags, medicalcoder implements cumulative longitudinal methods that propagate qualifying diagnoses forward in time, increasing sensitivity and improving detection of disease severity. The package is self-contained (R ≥ 3.5.0) and designed for portability in restricted computing environments while dynamically leveraging modern R workflows when available. This talk will demonstrate longitudinal sensitivity gains, mixed-version handling, and practical workflows for reproducible comorbidity assessment in real-world EHR data.


Modeling Fall Risk with Staged Trees and Expert Judgement

Rachel Wilkerson & Anika Dachiraju | Baylor University

Lightning Talk — Friday, May 8, 2026

Falls in the elderly is a major public health concern in the US, especially with the rapidly aging population. In addition to evaluating these pathways, there is a great need to customize these pathways to specific locations and areas. The aim of this study was to perform an expert elicitation protocol to predict the ways health insurance affordability could impact the falls risk of an individual in McLennan County. We constructed the initial staged event tree with the “stagedtrees” package and deployed a Shiny app to display the changes to probabilities of events occurring based on expert judgement. We also show how we can check the forecasts from the model with prequential diagnostic monitors for the staged tree with a new set of functions in a “cegmonitor” package (https://github.com/rachwhatsit/cegmonitor).


Using ggregions for building intuitive coding interfaces for anatomical visualizations

Evangeline (Gina) Reynolds | Consultant

Lightning Talk — Friday, May 8, 2026

Among chart types, maps are some of the most compelling and intuitive. There is great support for mapping in R and in the data visualization library {ggplot2}. However, producing a simple map visualizations like choropleth maps can feel much less intuitive than producing other chart types, like a scatterplot, bar chart, histogram, etc in ggplot2. The new ggregions package makes writing new region-specific layer functions easy, i.e. geom_() and stat_*(), resulting in user-friendly code interfaces for mapping. Demonstrations of geographic applications as well as with anatomical application, with region objects from the gganatogram and teethr packages, will be featured in the talk.


Beyond Sequential: Scaling Existing Medical Pipelines with ‘futurize’

Henrik Bengtsson | University of California San Francisco

Regular Talk — Friday, May 8, 2026

In medical research, we rely on trusted R packages such as ‘lme4’, ‘survival’, and ‘boot’. However, as datasets and simulations grow, sequential analyses become a bottleneck. Parallelizing code typically requires a “refactoring tax”, forcing researchers to adopt complex, package-specific APIs that can obscure the original analytical logic and deter parallel adoption.

I introduce ‘futurize’ (https://futurize.futureverse.org/), a universal adapter for the R parallel ecosystem. The futurize() function allows researchers to parallelize existing code with minimal friction. Leveraging the Futureverse framework, it works with common workflows like purrr::map(), lapply(), and domain-specific tools like ‘boot’, ‘caret’, ‘glmnet’, and Tidymodels. Parallelization becomes as simple as ys <- map(xs, slow_fcn) |> futurize() or b <- boot(data, statistic, R = 1000) |> futurize(), fully preserving original logic.

futurize() separates the declaration of what to parallelize from the execution environment. I will demonstrate how the same script can scale from a notebook to an HPC cluster without changing any of the code. I will also showcase how this powerful technology injects progress-reporting via progressify(), providing essential feedback for long-running medical simulations. By lowering the entry barrier to large-scale computing, ‘futurize’ empowers life-science researchers to scale their analyses while maintaining the original pipeline as-is.


Parallel computing in R to analyze big health data

Ariel Mundo Ortiz | Centre de recherche du CHUM

Lightning Talk — Friday, May 8, 2026

Large biomedical datasets present interesting research opportunities, but also represent major computational challenges. Many statistical analyses require repeated calculations that become slow or impractical when run sequentially in R. Relative recent developments in the R ecosystem (such as the future and mirai libraries) can be integrated to parallelise code in a flexible manner to perform analyses that would be otherwise virtually impossible to execute. This talk will present a use case of parallel computing in R using real world big health database (~9 million observations from >250,000 patients). Bootstrap analyses using random effects models on this database were performed on this database using future and mirai. This talk can help biomedical researchers get acquainted with the current parallel computing capabilities of R, and how they can be harnessed to analyze health data.


The MBCproject R Package: Enabling reproducible analysis of multi-omic data in metastatic breast cancer

Diana Garcia Cortes | Dana Farber Cancer Institute

Lightning Talk — Friday, May 8, 2026

The Metastatic Breast Cancer Project (MBCproject) is a patient-partnered research initiative launched in October 2015 by Count Me In, a nonprofit research program of the Broad Institute and the Dana-Farber Cancer Institute. This study collected clinical, genomic, and survey data through online enrollment to better understand the genomic and clinical landscape of metastatic breast cancer by capturing the diverse experiences of patients treated across various clinical settings. The cohort includes whole exome sequencing (WES) from 379 tumor biopsies with matched germline from 301 patients and RNA-seq from 200 biopsies (141 patients). In addition to confirming prior observations, this multi-omics dataset has enabled the discovery of novel MBC characteristics and is one of the largest publicly available multi-omic resources for MBC. To facilitate access and reproducibility, we developed an R package to share analysis-ready clinical and genomic data from the MBCproject. The package provides an easy-to-use interface for researchers, ensuring reproducibility of results and enabling the broader research community to accelerate discoveries in MBC genomics. We hope this resource will serve as a valuable tool for advancing precision medicine in metastatic breast cancer.


Synthetic by Design: Pairing toysurveydata and NSSP Synthetic Data Generator for Privacy-Safe Public Health Training and Development

Abigail Stamm & Eric Kvale | Minnesota Department of Health

Regular Talk — Friday, May 8, 2026

Access to realistic healthcare data for training and tool validation is a persistent challenge in public health informatics. Privacy regulations and data sensitivity create barriers for educators and developers who need representative data to learn or test against. We present two complementary open-source R tools addressing this gap across common public health data paradigms.

The toysurveydata package (Stamm) generates synthetic survey data from a priori response proportions via a single settings table. The NSSP Synthetic Healthcare Data Generator (Kvale) is a Shiny dashboard producing HIPAA-safe ED visit records modeled after the CDC NSSP BioSense Platform, including all five core BioSense tables, LOINC-coded lab results, and ten configurable public health disruption scenarios.

Together these tools enable agencies to onboard analysts, validate pipelines, and demonstrate methods without touching real patient records. We demonstrate combined use in a Minnesota Department of Health context and invite community feedback on gaps and improvements.

  • toysurveydata source: https://github.com/ajstamm/toysurveydata
  • toysurveydata documentation: https://ajstamm.github.io/toysurveydata/
  • NSSP app: https://0v0ins-ekvale.shinyapps.io/nssp-synthetic-data/
  • NSSP source: https://github.com/ekvale/nssp-synthetic-data

DIVINE: Curated Datasets and Tools for Medical Data

Natàlia Pallarès & João Carmezim | IDIBAPS/IGTP

Lightning Talk — Friday, May 8, 2026

DIVINE is an R package that provides 14 curated datasets and 6 data management functions to streamline medical research workflows. It helps researchers in access clean, structured data and perform essential tasks such as cleaning, summarizing, visualization, and exporting with minimal effort.

The datasets originate from a multicenter COVID-19 study conducted in 2020-2021 in the south metropolitan area of Barcelona (Spain), including data from 5813 patients across four pandemic waves. They contain information on baseline characteristics (e.g. demographics and comorbidities), follow-up during hospitalisation (e.g. intensive care unit admission and treatments received), and clinical outcomes (e.g. complications and in-hospital mortality).

The 6 functions included in DIVINE follow a data curation pipeline: data_overview() for data inspection, impute_missing() for handling missing values, multi_join() to merge datasets, stats_table() to create descriptive tables, multi_plot() for visualization (e.g., boxplots and histograms), and export_data() to export results to formats such as CSV and RDS.

The datasets have been made publicly available to support reproducible research and enable its reuse in research applications such as validating existing COVID-19 predictive models or studying outcome incidence and prognostic factors. They can also be used as a teaching resource in biostatistics, epidemiology, or data science courses.

The DIVINE package is available on CRAN and GitHub.


Preventing Data Leakage in Clinical Machine Learning: Guarded Resampling Workflows with fastml in R

Fikret Bartu Yurdacan | Trakya University

Lightning Talk — Friday, May 8, 2026

Data leakage is one of the most common yet underrecognized pitfalls in clinical machine learning. It occurs when information from outside the training set — often introduced during preprocessing — inadvertently influences model evaluation, leading to inflated performance metrics. In medical research, where predictions can affect patient care, overly optimistic models are not just misleading — they can be harmful.

This talk introduces fastml, an R package that addresses data leakage through guarded resampling workflows. Unlike conventional pipelines where preprocessing is applied before splitting, fastml re-estimates all preprocessing and fitting steps within each resampling fold. This ensures no information from held-out data contaminates training, producing honest and reproducible performance estimates.

Built on tidymodels, fastml provides a lightweight AutoML interface that automates training, tuning, and comparison across multiple algorithms — including XGBoost, random forest, and logistic regression — while keeping evaluation transparent and user-controlled.

We demonstrate a complete workflow using a clinical dataset: from raw data through guarded cross-validation to model selection, integrating explainability tools (SHAP, PDP) and fairness diagnostics for publication-ready analyses. Attendees will leave with practical knowledge to build safer ML pipelines for biomedical research.


When Quant Meets Qual: Extending the R Ecosystem for Qualitative Data

Abiraahmi Shankar | New York University

Regular Talk — Friday, May 8, 2026

Quantitative research has benefited enormously from open, reproducible tools such as R, while qualitative analysis often remains confined to proprietary software and less transparent workflows. The gap is not in rigor, but in interoperability, reproducibility, and integration with modern data science pipelines.

In this talk, I introduce DedooseR, an open-source R package published on CRAN that bridges qualitative coding conducted in Dedoose with the R ecosystem. DedooseR standardizes raw exports, generates code summaries, computes and visualizes saturation, and maps code co-occurrence through matrices and network plots. It also supports interactive excerpt review and text exploration, enabling user-friendly inspection of coded material alongside quantitative analyses within a unified workflow.

Reproducible pipelines and clear documentation of analytic decisions are especially pertinent in qualitative research, given its iterative and interpretive nature. By bringing coded qualitative data into R, DedooseR supports transparent reporting, flexible exploration, and allows qualitative research to leverage the richness and breadth of quantitative resources available through R.

Attendees will gain practical tools for importing and analyzing qualitative datasets in R and a framework for extending reproducible research practices to mixed-methods and qualitative health and social science research.

Package details: https://cran.r-project.org/web/packages/DedooseR/index.html


Rapid and comprehensive power analysis in R, with a focus on planned error control for multiple outcomes

Kristen Hunter & Luke Miratrix | University of New South Wales

Regular Talk — Friday, May 8, 2026

Researchers planning on assessing impacts on multiple outcomes in a randomized trial need to account for multiple testing adjustments for controlling error rates. These adjustments impact both whether an experiment is adequately powered and what sample size is needed to achieve a desired level of power. With multiple testing, even the definition of power is more complex: are you powering to detect all effects? To detect some fraction of effects? This talk introduces PUMP, an R package designed to calculate power when planning to use either false discovery rate control or family-wise error control methods such as Bonferroni or Westfall-Young. With PUMP, researchers can easily calculate how small of an effect could reliably be detected, or how large of a sample would need to be obtained, under a variety of scenarios. The scenarios supported by PUMP include different definitions of power, several commonly-used multiple testing adjustment procedures, a range of different experimental designs, and choices of planned analytic models, along with the usual design parameters. PUMP supports a variety of common multilevel experimental designs, including multisite (blocked) randomized experiments, cluster randomized experiments, and blocked, cluster-randomized experiments. The talk will also showcase PUMP’s ability to rapidly explore ranges of scenarios and design parameters for sensitivity checks on all results.