GSDBench Intake

Submission Governance

Contributor name or initials optional

GitHub username required

Contributor role

Organization type optional

Source type required

Confidentiality confirmations all required I confirm this submission contains no PHI, patient-level data, confidential protocol text, trade secrets, or proprietary company information. I confirm I have permission to share this example in a public GitHub issue, or I have sufficiently generalized/de-identified it. I understand that maintainers may request revisions before accepting this case.

Case Identity

Short case title required

Case ID required

Example: gsdb-20260430-pfs-os-alpha-split

Benchmark project name

Suggested benchmark split

Difficulty

Case priority

Benchmark Prompt to Give the AI Agent

User-facing design request required

Write the prompt exactly as the evaluated AI agent should see it. Include only information that should be available to the agent.

Optional prior conversation/context

Expected deliverable type select at least one Design summary Executable R code Boundary table Event/sample size/timing outputs Multiplicity strategy Simulation verification NPH evaluation Protocol-ready language Error diagnosis / critique only

Trial Design Content

Disease/setting optional

Phase

Endpoint structure required

Endpoint names required

Randomization ratio optional

Population/hypothesis structure

Advanced design details

Alpha/multiplicity strategy

Number of analyses / information fractions

Spending functions / futility rules

Survival assumptions

Enrollment/dropout assumptions

Operational constraints

Examples: minimum follow-up, minimum gap between analyses, max N, feasibility range, data-cleaning buffer assumptions.

NPH assumptions optional

Other design assumptions optional

Reference Ground Truth

Reference truth type select at least one Locked reference R script Expected numerical outputs Expert textual standard Rubric-only open-ended assessment Simulation verification To be developed by maintainers

Reference script summary or link/path

Reference R code snippet optional

Textual ground truth / expected reasoning

Simulation requirements

Rubric Criteria

At least four criteria are required. Include numerical/statistical correctness and at least one penalty or fatal gate.

Reviewer and Curation Notes

Suggested reviewer expertise Survival/GSD Multiplicity Oncology Non-inferiority NPH methods Simulation Regulatory/statistical review

Why this case matters

Expected failure modes for current AI agents

Additional notes to maintainers

GSDBench intake form.

Submission Governance

Case Identity

Benchmark Prompt to Give the AI Agent

Trial Design Content

Reference Ground Truth

Rubric Criteria

Reviewer and Curation Notes

Numerical check

R1

Submission Governance

Case Identity

Benchmark Prompt to Give the AI Agent

Trial Design Content

Reference Ground Truth

Numerical checks

Rubric Criteria

Reviewer and Curation Notes

Numerical check

R1