Accuracy of Precision Clinical Trial Matching with Follicular Lymphoma Trials on CancerBot

Blog

Adam Blum

Aug 23, 2025

In our last post, we discussed how clinical trial matching systems should evaluate their accuracy. Using the recommended techniques there we now turn our attention to evaluating CancerBot’s trial matching accuracy.

CancerBot’s underlying EXACT system is a novel approach to clinical trial matching that converts unstructured eligibility criteria into structured attribute-based logic. This post evaluates the precision and recall of EXACT’s Attribute-Criteria Extraction (ACE) engine in extracting eligibility criteria from clinical trials for follicular lymphoma. We analyze performance across all of the trial attributes we track for follicular lymphoma. Results show high overall precision (80%) and recall (85%), with performance varying by attribute type and complexity.

Introduction

Despite numerous AI-driven efforts, clinical trial eligibility matching remains a critical bottleneck in oncology. Most commercial systems still lack effective use of structured lab, diagnostic, and biomarker data. Open source efforts such as TrialGPT and TrialMatchAI merely improve the trial searching process incrementally (TrialGPT’s 42% increase in speed of screening). The open source MatchMiner’s CTML relies on trial researchers creating the structure in their trials (versus extracting structure from existing trials) and has not received significant uptake by researchers running trials.

The EXACT (EXtracting Attributes from Clinical Trials) system from CancerBot offers a different approach by extracting and structuring criteria directly from trial text using LLM-driven attribute-criteria extraction. This paper focuses on evaluating the performance of EXACT’s Attribute-Criteria Extraction (ACE) engine in accurately identifying and extracting eligibility constraints from unstructured trial descriptions.

2. System Overview

2.1 Attribute-Criteria Extraction (ACE) Engine

The ACE engine uses prompts refined by SMEs via a Prompt Workbench (PW) to extract logical constraints from trial text. Each attribute’s prompt can include variables (e.g., $therapy_type, $brca1_mutation) and domain-specific constraints (e.g. unit normalizations and adherence to enumerated types). The ACE engine has built-in knowledge of therapy components within therapies, and therapy types for each component. Deep knowledge of treatment constituents and types are crucial to accurate criteria extraction (and later, outside the scope of this analysis, accurate trial to patient matching).

ACE also has knowledge of many “compound alternatives” which chain individual criteria into a named group representing the OR of several individual attributes. This includes terms such as “renal insufficiency” and “hypercalcemia” (both of which have several underlying attributes that determine their truth value). It also includes disease-specific compound alternative labels such MeetsCRAB (boolean OR combination of calcium elevation, renal insufficiency, anemia and bone lesions) and MeetSLiM (another OR combination of underlying attributes) and MeetsGELF (which includes various criteria to determine active follicular lymphoma).

The breadth of named OR combinations combinations and the ability to extract the inverse of each boolean allows the most complex trial logic to be expressed consistently in conjunctive normal form: a set of ANDed criteria at the top level that are easy to inspect for missing patient values to determine eligibility, and that can also be executed performantly by EXACT’s PATCH patient-trial matching engine.

Note: this paper is not a direct assessment of the accuracy of the patient-trial matching engine itself. The accuracy of patient-trial matching is a function of the attribute-criteria extraction accuracy and patient attribute extraction (whether it is through patient record clean up or through direct patient questions such as CancerBot does). If you accept the assumption of accurate patient attribute extraction though, then the attribute criteria extraction accuracy can be used as a proxy for the overall patient to trial matching accuracy.

2.2 Prompt Workbench (PW)

SMEs interactively refine prompts based on LLM responses and cross-check against labeled ground truth. Prompts can use the built-in variables which work over all specific values, thus avoiding separate prompts for all possible mutations and therapy types (as examples). Prompts are tested and validated on either a single trial or groups of trials. Once the prompt engineer is satisfied with the results, the changed prompts are saved, and marked ready for global trial extraction. On a scheduled basis, bulk extraction is then performed by ACE across the entire trial corpus.

Figure 1: ACE Engine Architecture

2.3 Ground Truth Annotation

A representative sample of trials (specifically the last 100 follicular lymphoma trials on clinicaltrials.gov) was labeled by SMEs for a set of target attributes (e.g. ECOG, HER2 status, prior therapies, bilirubin thresholds). These annotations were used as ground truth to evaluate micro-averaged precision, recall and F1 of attribute criteria extractions.

3. Methods

3.1 Dataset

The CancerBot system operates on all clinicaltrials.gov, WHO ICTRP and EUCTR follicular lymphoma and multiple myeloma trials. We currently have 1518 trials available for follicular lymphoma. For the purposes of this evaluation, using the latest 100 trials on clinicaltrials.gov allows a scoped effort for evaluation that is a proxy for a random sample of all trials. The set of trials currently labeled is here. The individual attribute criteria labels themselves are available here. This amount of trials should achieve 88 percent confidence with a 5 percent margin of error assuming our assumed true accuracy of 81.8% (based on measured F1).

For follicular lymphoma, we track 84 distinct patient attributes with 533 individual attribute criteria listed here. Each attribute generally has multiple criteria associated with it. An example is therapyTypesRequired. For each therapy type (such as proteasome inhibitors, immunomodulatory drugs or corticosteroids) the criteria will be whether a specific type is required for participation. For lab values, there will typically be criteria such as a minimum or a maximum.

Note that we only extract and track attribute criteria that occur in more than one trial for a disease. The accuracy metrics are within the defined universe of attributes we track for each specific disease.

3.2 Metrics

For the metrics on each attribute we are focusing on the micro-averaged precision and recall of each attribute criteria that we are extracting. The list of all attributes that we prepare criteria for are the set of attributes that the system is tracking. In the domain of cancer clinical trial matching there are often infrequently (even once only) criteria expressed logically in a specific trial text that are rarely used again. The system does not try to handle matching such rare criteria so we do not measure efficacy of extracting these outliers. We also calculate these metrics with Partial Labeling, which creates an intrinsic downward bias for both precision and recall versus Full Labeling. See Appendix I for a discussion of micro-average and partial labeling.

3.3 Evaluation Process

The evaluation process is as follows:

Ground truth attributes labeled by SMEs for the target set of trials (in this case the 100 most recent FL trials)
Attribute criteria compared to ground truth labeled values
Accuracy metrics are updated and analyzed

Note that the attribute criteria in CancerBot are never set manually by the SMEs. They are set by the LLM via attribute criteria extraction prompts, engineered in the Prompt Workbench. If the attribute criteria were set manually by the SMEs the accuracy would be virtually 100% as the process is identical to setting labeled values. There are a few special purpose trial databases in existence with small amounts of trials focused on a single disease or sub-disease that use this manual and labor-intensive approach. But such an approach will not scale to a trial corpus of tens of thousands of cancer trials with new trials being updated every day.

4. Results

4.1 Overall Performance

Below is the aggregate summary precision, recall and F1 based as shown on this analysis of extracted trial attribute criteria values versus labeled data at accuracy.cancerbot.org. From this dashboard, you can explore the list of trials analyzed, the labeled data, as well as accuracy metrics per attribute and per attribute category.

4.2 Performance by Attribute Category

Different attribute categories have significantly different metrics. Interpreting treatment requirements is the biggest challenge today (77% micro precision in figuring out which therapies are required or excluded). Labs numeric requirements are handled quite well: 98% precision in interpreting those criteria successfully. Micro precision by category and specific attribute is the primary guide as to where the SME prompt engineers spend their ongoing efforts in prompt optimization.

5. Discussion

The ACE engine performs strongly across most attribute types, particularly those with discrete values or strong synonym lists. Attributes involving therapy history inclusion and exclusion require more careful unit normalization and prompt tuning, as evidenced by the lower micro precision.

Prompt engineering plays a critical role in controlling hallucination and improving attribute recall. For example, adding “Do not use ECOG” to a Karnofsky prompt improved extraction accuracy by 11% on related trials. The ease of evaluation of results within the Prompt Workbench environment makes creation and modification of these prompts much easier than preparing such text exogenously of the EXACT system.

LLM model selection also impacted outcomes, with newer models yielding better precision but not always better recall.

6. Conclusion

The EXACT system’s ACE engine demonstrates high precision and recall for eligibility attribute extraction in follicular lymphoma clinical trials. With systematic prompt refinement, domain-specific variables, and unit normalization, it addresses many limitations of prior approaches to clinical trial matching, and even more recent LLM-based approaches. These results suggest that structured extraction from unstructured eligibility text is feasible at scale and can support rapid and reliable trial-patient matching.

7. Future Work

We will be adding error type classification to an updated version of this paper shortly.
Evaluate multiple myeloma and breast cancer accuracy in an updated version of this paper.
Evaluation of additional cancers in this paper. EXACT and its ACE engine support follicular lymphoma and multiple myeloma today. Breast cancer work is underway and all other cancers will be added.

About CancerBot

Turning frustration into innovation

After being diagnosed with follicular lymphoma, AI tech entrepreneur Adam Blum assumed he could easily find cutting-edge treatment options. Instead, he faced resistance from doctors and an exhausting search process. Determined to fix this, he built CancerBot—an AI-powered tool that makes clinical trials more accessible, helping patients find potential life-saving treatments faster.

Find trials

Start your search for clinical trials now

New treatment options could be just a click away. Start a chat with CancerBot today and get matched with clinical trials tailored to you—quickly, easily, and at no cost.

Find trials

Start your search for clinical trials now

New treatment options could be just a click away. Start a chat with CancerBot today and get matched with clinical trials tailored to you—quickly, easily, and at no cost.

Find trials

Start your search for clinical trials now

New treatment options could be just a click away. Start a chat with CancerBot today and get matched with clinical trials tailored to you—quickly, easily, and at no cost.

Find trials