GSDBench
pharma-skills
Collect one standardized group sequential design benchmark example per GitHub issue. The form prepares a human-readable GitHub issue body and a machine-readable JSON block. Nothing leaves your browser.
Example: gsdb-20260430-pfs-os-alpha-split
Write the prompt exactly as the evaluated AI agent should see it. Include only information that should be available to the agent.
Examples: minimum follow-up, minimum gap between analyses, max N, feasibility range, data-cleaning buffer assumptions.
Power tolerance often uses percentage points. Type I error tolerance often uses percentage points. Event/timing checks may use relative percent or months.
At least four criteria are required. Include numerical/statistical correctness and at least one penalty or fatal gate.