From Ranking to Matching: Accuracy Metrics in Precision Clinical Trial Matching
Adam Blum
Aug 22, 2025
Until recently, most AI-based open source clinical trial matching systems have not attempted to perform true precision matching of patients to clinical trials. Previous attempts at AI-assisted matching (such as TrialGPT, TrialMatchAI and MatchMiner) instead rank fitness of trials for patients. The rankings are then compared with a clinician labeled “gold standard” of the “best trial” (which may or may not be a true match) among a small set of potential trials. The ranking is considered “correct” if the “gold standard” is in the “top n” of the possible trials, regardless of whether there is an exact match to the trial eligibility attributes.
Because these systems operate on “patient vignettes” (brief descriptions of a patient’s disease) and not full patient records, they do not focus on whether the patient is truly eligible for the trial based on the inclusion/exclusion criteria for the trial, potentially eligible (pending more patient data) or ineligible for the trial. Instead the doctor is asked to mark the most relevant trial for the patient, often from a small dataset (such as TREC 2022 which has 50 trials across all symptoms), so a simple disease match is often the most relevant (there may only be a single disease match available in the dataset).
This paper examines how existing open source AI-based trial ranking systems measure accuracy in more detail, discussing the problems with this approach in performing real world matching against thousands of potentially matching trials. We then propose a more useful way to measure accuracy in clinical trial matching for systems that are trying to assess true eligibility of patients with structured electronic health records (EHRs) for complex attribute eligibility trials.
Measuring Based on Ranking: A Survey
Clinical trial recommendation systems that rely on ranking rather than true eligibility matching have generally borrowed their evaluation metrics from the broader information retrieval (IR) and recommender systems literature. Rather than asking whether a patient meets inclusion/exclusion criteria, these systems measure how closely their output ordering of trials aligns with a clinician-defined “gold standard” trial list. The most common metrics include:
Top-n Accuracy (Hit Rate @n)
Definition: A prediction is considered “correct” if the clinician-labeled best trial appears within the top n ranked trials.
Example: If the correct trial is in the top-3 recommendations, the system scores a hit under “Top-3 accuracy.”
Usage: Both TrialGPT and TrialMatchAI reported Top-n metrics (most often @1, @3, and @5), reflecting how frequently the system surfaced the “gold standard” trial among its highest-ranked options.
2. Mean Reciprocal Rank (MRR)
Definition: The average of the reciprocal ranks of the correct trial across all test patients.
Example: If the correct trial is ranked first, the score contribution is 1.0; if second, 0.5; if third, 0.33, etc.
Usage: MRR was emphasized in TREC 2022 Clinical Trials Track evaluations, which TrialGPT and MatchMiner adapted, as it penalizes correct matches placed lower in the ranking.
3. Normalized Discounted Cumulative Gain (nDCG)
Definition: A graded relevance metric that discounts the value of a correct match based on its position in the ranked list.
Usage: Applied in TrialGPT benchmarking, especially when multiple “acceptable” trials were labeled by experts, not just a single “gold standard.”
4. Precision and Recall at k (P@k, R@k)
Definition: Precision measures how many of the trials in the top-k were correct matches; recall measures how many of all possible correct matches were retrieved within the top-k.
Usage: MatchMiner incorporated these metrics when clinicians identified more than one potentially relevant trial per vignette, allowing recall to capture broader coverage.
5. Area Under the Receiver Operating Characteristic Curve (AUROC)
Definition: A measure of the ability of the system to rank relevant trials higher than irrelevant ones across all thresholds.
Usage: Less common, but used in TrialMatchAI experiments where “relevant” vs. “irrelevant” trials were annotated in larger datasets.
6. F1 Score (at top-k cutoff)
Definition: The harmonic mean of precision and recall at a given cutoff.
Usage: Occasionally applied in academic papers benchmarking on synthetic or crowdsourced relevance labels (e.g., “disease-matched” vs. not).
Limitations of These Metrics
They assume there exists a single “best” trial for a patient, ignoring the fact that a patient may be eligible (or ineligible) for many trials simultaneously.
The reliance on vignette-level “relevance judgments” means the system is rewarded for surfacing trials that look superficially disease-appropriate, even if the patient would be excluded based on comorbidities, biomarkers, prior lines of therapy, or other critical eligibility factors.
Metrics like Top-n Accuracy and MRR measure ranking quality but do not assess whether the match reflects true eligibility. That should be the focus versus the search engine-inspired concept of “high relevance”.
In addition, it is valuable to identify potentially eligible trials. The focus on relevance obscures the true problem: is the patient eligible, or are they potentially eligible or are they simply ineligible?
In sum, clinical trial matching is not a rating or relevance problem from information retrieval. It is an attribute extraction problem, which requires a different paradigm for the work necessary to be done, and for how to measure its success.
Towards Exact Eligibility Matching: Measuring Precision and Recall of Attribute Extraction
To achieve true precision in clinical trial matching — going beyond ranking to identify whether a patient genuinely satisfies inclusion and exclusion criteria — it’s essential to accurately extract structured eligibility attributes from unstructured trial documents. It is somewhat understandable that this has been avoided since performing attribute by attribute extraction is an enormous leap in difficulty. Nevertheless, it is what is necessary to determine true eligibility or potential eligibility. Evaluating this attribute extraction process with metrics such as precision, recall, and F1 score becomes critical.
Defining the Task
In this paradigm, the system is responsible for recognizing eligibility criteria in free-text sources — such as trial protocols, registry entries, or unstructured eligibility sections — and transforming them into structured, machine-interpretable formats (e.g., age ≥ 18, no prior chemotherapy, specific biomarker thresholds). True matching then hinges on correctly mapping these structured elements to patient data in EHRs.
Evaluation Metrics for Attribute Extraction
Precision
Measures the proportion of extracted attributes that are actually correct.
High precision ensures the model avoids introducing faulty or extraneous eligibility constructs.
Recall
Captures the proportion of TRUE eligibility attributes present in the text that are successfully extracted.
High recall ensures the model does not miss important criteria.
F1 Score
The harmonic mean of precision and recall; useful when a single balanced measure is needed.
Illustrative Examples from the Literature
AutoCriteria (GPT‑4 based extraction)
Achieved an overall F1 score of 89.42 across nine disease domains for extracting eligibility entities, with scores ranging from 95.44 (non‑alcoholic steatohepatitis) to 84.10 (breast cancer).
Its overall accuracy (i.e., proportion of fully correct extractions including context) was 78.95%.
PMC+13PMC+13Oxford Academic+13Nature
SEETrials system
Reported exceptionally high performance: precision = 0.964, recall = 0.988, and F1 = 0.974 in safety-related criteria extraction. ScienceDirect
Att‑BiLSTM for COVID‑19 trial parsing
Achieved precision = 0.942, recall = 0.810, and F1 = 0.871 using an attention‑based LSTM for extracting variables from COVID‑19 trial criteria.
Proceedings of Machine Learning Research+9arXiv+9BioMed Central+9TAES prototype (Heart Failure prescreening example from Meystre et al.)
Demonstrated recall up to 0.778 and precision up to 1.000 in detecting eligible patients by extracting and matching eligibility criteria from clinical notes.
BioMed Central+2PMC+2
Applying These Metrics: The EXACT Clinical Trial Matching System
In this analysis of CancerBot’s EXACT Clinical Trial Matching System, a similar approach applies:
Manual Annotation of Criterion Attributes
A random subset of all trials in a particular disease (generally 90 percent confidence interval of 5% error) are manually annotated with ground-truth attributes (e.g., numerical thresholds, therapies/therapy components/therapy classes, genes/mutations/origins/interpretations).
Systemic Extraction via LLMs
The system parses all trial texts to extract structured eligibility components: entities (like disease subtype), numeric thresholds, temporal qualifiers, and inclusion/exclusion labels.
Compute Precision
For a given attribute, for example “treatment” with the requirement ‘no prior chemotherapy’), what fraction of trials match the manually labeled ground truth?”
Compute Recall
Of all manually annotated attributes in the protocol, how many did the system successfully extract?
F1 Score as Summary
Useful when needing one metric to reflect the trade-off between precision and recall.
We report the attribute metrics individually, by category (e.g. disease markers, bloods, labs, patient behavior) and overall allowing the system to be assessed globally, but the LLM prompt engineering to be tuned locally.
Why This Matters for Exact Eligibility Matching
Trustworthy Matching: Incorrect or missing attributes can lead to false-positive matches (ineligible patients recommended) or false negatives (eligible patients missed).
Fine-Grained Eligibility Logic: Many trials involve nested logic (e.g., “(HER2-positive OR ER-positive) AND no prior systemic therapy within 6 months”). Accurate extraction is key to interpreting these correctly.
ResearchGate+2PMC+2arXiv+8arXiv+8Nature+8ScienceDirect+1arXiv+5Proceedings of Machine Learning Research+5PMC+5arXiv+3ASCO Journals+3ScienceDirect+3BioMed CentralScalability & Generalizability: Systems like AutoCriteria show that models can generalize across diverse diseases with high extraction performance.
PMC+1Attribute Level Metrics: Reporting attribute by attribute level criteria helps guide the prompt engineering efforts to induce the LLM to accurately extract criteria. It also provides more context to users of the system (doctors or patients when examining trials and the extracted attributes to determine eligibility.
Summary
Our hope in discussing this topic is to stimulate a discussion on how to truly measure clinical trial matchers’ efficacy. We believe this will spur AI LLM-based matchers (both open source and non-open source) to do the hard work of attribute by attribute criteria extraction to create more useful systems for patients.
If CancerBot’s EXACT Clinical Trial Matching System ends up with many competitors doing true precision clinical trial matching as a result, that would be deemed a success by us and of course for patients in general. Moving from ranking to true eligibility assessment is just an important first step. As a followup post, we plan to discuss separately how such systems can rank and rate the “goodness” of trials for patients, but from the subset of trials that truly match.
References
TREC 2022 Clinical Trials Track. Text Retrieval Conference. National Institute of Standards and Technology (NIST); 2022.
Jin Q, Wang Z, Floudas CS, Chen F, Gong C, Bracken-Clarke D, Xue E, Yang Y, Sun J, Lu Z. Matching patients to clinical trials with large language models. Nat Commun. 2024;15:9074. doi:10.1038/s41467–024–53081-z (introduces TrialGPT)
Abdallah M, Nakken S, Bierkens M, Galvis J, Groppi A, Karkar S, Meiqari L, Rujano MA, Canham S, Dienstmann R, Fijneman R, Hovig E, Meijer G, Nikolski M. TrialMatchAI: An end-to-end AI-powered clinical trial recommendation system to streamline patient-to-trial matching. arXiv. 2025. doi:10.48550/arXiv.2505.08508 (introduces TrialMatchAI)
Cerami E, Trukhanov P, Paul MA, Hassett MJ, Riaz IB, Lindsay J, Mallaber E, Klein H, Gungor G, Galvin M, Van Nostrand SC, Yu J, Mazor T, Kehl KL. MatchMiner-AI: An open-source solution for cancer clinical trial matching. arXiv. 2024. doi:10.48550/arXiv.2412.17228 (introduces MatchMiner-AI)
Klein H, Marriott E, Hansel J, Yu J, Albayrak A, Barry S, Keller RB, MacConaill LE, Lindeman N, Johnson BE, Rollins BJ, Do KT, Beardslee B, Shapiro G, Hector-Barry S, Methot J, Sholl L, Lindsay J, Hassett MJ, Cerami E. MatchMiner: an open-source platform for cancer precision medicine. npj Precis Oncol. 2022;6:69. doi:10.1038/s41698–022–00312–5 (introduces MatchMiner)
Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, Lai AM. A survey of approaches for ranking clinical trial eligibility criteria. J Biomed Inform. 2014;52:112–120. doi:10.1016/j.jbi.2014.01.006
Ni Y, Krumholz HM, Xu J, et al. SEETrials: a system for automatically extracting and structuring safety information from clinical trial protocols. J Biomed Inform. 2024;155:104512. doi:10.1016/j.jbi.2024.104512
Wong A, Zhang R, Xu J, et al. AutoCriteria: large language model–based automatic eligibility criteria extraction. Sci Rep. 2024;14:11712. doi:10.1038/s41598–024–77447-x
Wang X, Zhang Y, Xu H. Automatic extraction of clinical trial eligibility criteria using an attention-based BiLSTM model. Proc Mach Learn Res. 2020. arXiv:2012.10063
Meystre SM, Sarnikar S, Knoll B, et al. Trial Eligibility Screening (TAES): automated extraction of eligibility criteria for prescreening in heart failure. BMC Med Res Methodol. 2023;23:165. doi:10.1186/s12874–023–01916–6
Luo Y, Xin Y, Joshi I, et al. Natural language processing for EHR-based clinical trial eligibility: a review of methods. BMC Med Inform Decis Mak. 2019;19(Suppl 3):123. doi:10.1186/s12911–019–0829–0
Warner JL, Zhang P, Liu J, Alterovitz G. Trial matching and patient recruitment for precision oncology: current status and future directions. JCO Clin Cancer Inform. 2021;5:1032–1042. doi:10.1200/CCI.21.00022
Blum A, The EXACT Clinical Trial Matching System, CancerBot, 2025.
Blum A, The EXACT System for Precision Clinical Trial Matching: Case Study of Accuracy with Follicular Lymphoma Trials, CancerBot, 2025.
Turning frustration into innovation
After being diagnosed with follicular lymphoma, AI tech entrepreneur Adam Blum assumed he could easily find cutting-edge treatment options. Instead, he faced resistance from doctors and an exhausting search process. Determined to fix this, he built CancerBot—an AI-powered tool that makes clinical trials more accessible, helping patients find potential life-saving treatments faster.