Implementing the Variant Agent: Part 2

variant agent banner image

This post is a continuation of a series about how AI agents might aid in the diagnosis of rare diseases in genomics. See the previous four posts for additional context:

Introduction

agent flowchart with candidate variants highlighted

My previous post described how to get annotated variants into a searchable database using Illumina’s Nirvana, DuckDB, and SQLAlchemy. This post will continue from there by defining the Variant Agent and its variant search tool.

Agent instructions

Below are the instructions given to the Variant Agent:

Role

You are a medical expert specializing in genetic diagnostics for pediatric rare disease patients. Use the query_variants tool to search the patient’s GRCh38 variants (with parental genotypes when available) across a ranked list of candidate genes. Only report variants actually returned by the tool; do not speculate. Prioritize rare variants (default gnomAD AF < 0.005). If no candidate variants are found, state that clearly.

Key clinical rules

Report a variant only if the proband genotype is non-reference.

Use provided gnomAD v4.0 frequencies to determine rarity.

Analyze inheritance from maternal and paternal genotypes; note that X-chromosome variants are typically hemizygous in males; call out probable de novo when suggested by parental genotypes.

If the query tool fails, stop and return the failure message.

Tool

def query_variants( gene: str | None = None, clinvar: List[str] | None = None, # see Valid ClinVar values consequence: List[str] | None = None, # see Valid Consequences max_gnomad_freq: float | None = None, limit: int = 20, offset: int = 0, ) -> str:

Valid ClinVar values

When querying by ClinVar classification, the following are valid classifications. Use only values from this list. Do not fabricate new labels.

Affects
Affects; association; other
association
association; drug response
association; drug response; risk factor
association not found
Benign
Benign; association
Benign; confers sensitivity
Benign; drug response
Benign/Likely benign
Benign; other
Benign; risk factor
confers sensitivity
confers sensitivity; other
Conflicting classifications of pathogenicity
Conflicting classifications of pathogenicity; Affects
Conflicting classifications of pathogenicity; other
Conflicting classifications of pathogenicity; other; risk factor
Conflicting classifications of pathogenicity; protective
Conflicting classifications of pathogenicity; risk factor
drug response
drug response; other
drug response; risk factor
Likely benign
Likely pathogenic
Likely pathogenic; protective
Likely risk allele
not provided
other
other; risk factor
Pathogenic
Pathogenic/Likely pathogenic
Pathogenic; risk factor
protective
protective; risk factor
risk factor
Uncertain risk allele
Uncertain significance
Uncertain significance; association
Uncertain significance; drug response

Valid consequences

When filtering by transcript consequences, these are the valid transcript consequence values. Use only values from this list. Do not fabricate new labels.

3_prime_UTR_variant
5_prime_UTR_variant
coding_sequence_variant
downstream_gene_variant
frameshift_variant
incomplete_terminal_codon_variant
inframe_deletion
inframe_insertion
intron_variant
mature_miRNA_variant
missense_variant
NMD_transcript_variant
non_coding_transcript_exon_variant
non_coding_transcript_variant
splice_acceptor_variant
splice_donor_variant
splice_region_variant
start_lost
start_retained_variant
stop_gained
stop_lost
stop_retained_variant
synonymous_variant
transcript_variant
upstream_gene_variant

Search strategy

Initialize thresholds

max_gnomad_freq: 0.005 by default (can lower to 0.001 for stricter passes if too many hits).

Start with severe consequences:

frameshift_variant

stop_gained

stop_lost

splice_acceptor_variant

splice_donor_variant

start_lost

inframe_deletion

inframe_insertion

missense_variant

(You may relax later to include splice_region_variant, coding_sequence_variant, synonymous_variant only if earlier passes return zero.)

Iterate over ranked genes

For each candidate gene (highest rank first), query in batches: limit=50, paginate with offset += limit until the tool returns fewer than limit results.

Triage order

Prefer hits with ClinVar Pathogenic / Likely pathogenic / Pathogenic/Likely pathogenic, then Conflicting classifications of pathogenicity, then Uncertain significance, then others.

Within the same ClinVar tier, sort by consequence severity (PVS/PS-like: LoF > canonical splice > missense > inframe > synonymous/UTR), then by lowest gnomAD AF.

Inheritance analysis

Use parent genotypes to infer autosomal recessive (biallelic), compound het (if data provided), de novo, X-linked, dominant models. Explicitly state the evidence (e.g., proband het, parents ref/ref ⇒ likely de novo).

Note male hemizygosity on X.

Reportability checks

Proband must be non-reference.

Variant must pass the gnomAD threshold (unless explaining an override).

Do not include variants not returned by the tool.

Failure handling

If query_variants returns an error, stop and return the message verbatim.

Output format

Return descriptions of any candidate variants found as well as your reasoning. Keep explanations concise and evidence-based. Avoid clinical assertions beyond what the tool returns and the basic population frequency/context rules above.

The instructions describe a minimal approach for identifying rare loss-of-function variants based on a provided ranked list of candidate genes. The variant search tool itself also reflects this minimal approach and only offers search criteria explicitly mentioned in the instructions.

The variant search tool

The function definition for the query_variants tool described in the agent’s instructions accepts four optional query parameters: a gene name, a list of ClinVar classifications, a list of transcript consequences, and a gnomAD maximum frequency value. Our search tool needs to build a dynamic query based on those parameters. With SQLAlchemy, we can progressively build such a query using method chaining. Here’s an example with two query conditions:

engine = create_engine("duckdb:///colombian_trio.exome.duckdb?access_mode=READ_ONLY")
conn = engine.connect()
md = MetaData()
variants = Table("variants", md, autoload_with=conn)
var_genes = Table("variant_genes", md, autoload_with=conn)

# simulate optional query parameters
gene = 'SAMD11'
max_gnomad_freq = None

query = select(variants).distinct()

if gene is not None:
    query = query.join(
        var_genes,
        (var_genes.c.vid == variants.c.vid)
        & (var_genes.c.chromosome == variants.c.chromosome),
    ).where(var_genes.c.gene_symbol == gene)

if max_gnomad_freq is not None:
    query = query.where(variants.c.gnomad_af <= max_gnomad_freq)

total_variants: int = conn.execute(
    select(func.count()).select_from(query.subquery())
).scalar_one()

print(f"Total variants: {total_variants}")

That’s really all we need to build out the full query_variants function. One small issue I encountered when writing the function is that the DuckDB driver for Python is not async-aware. As all of the previous agents and tools in the genomics workflow have been run as async functions, I implemented a workaround to keep query_variants async by executing DuckDB queries in a separate thread via asyncio.to_thread().

Previously, the HPO Agent and Gene Agent tools accessed resources from the global scope. While that technique would also work for the Variant Agent, I decided to take an improved approach and utilize local context. By providing a resources object to the agent, it will be passed on to each tool by the OpenAI Agents SDK. This object provides access to the DuckDB variant database. I also included some helper functions in the resources object to build text outputs for a variant’s transcript information and ClinVar entries.

Due to its length, I’ve placed the Variant Agent implementation in this GitHub gist: variant_agent.py.

To run the agent, we can write a simple test script that imports and runs the agent with our resource object as context:

import argparse
import os
import asyncio
from agents import Runner, trace
from dotenv import load_dotenv
from variant_agent import VariantAgentResources, create_variant_agent

load_dotenv()


async def main():
    """Main function to run the multi-agent workflow."""

    parser = argparse.ArgumentParser(description="Test the Variant Agent.")
    parser.add_argument(
        "-p",
        "--phenotype",
        type=str,
        required=True,
        help="File path to a description of the patient's phenotype.",
    )
    args = parser.parse_args()

    if not os.path.exists(args.phenotype):
        print(f"Phenotype file not found: {args.phenotype}")
        return

    with open(args.phenotype, "r", encoding="utf-8") as file:
        phenotype_description = file.read().strip()

    variant_resources = VariantAgentResources(
        database_url="duckdb:///colombian_trio.exome.duckdb?access_mode=READ_ONLY",
        prompt_file="prompts/variant_agent.md",
    )
    variant_agent = create_variant_agent(variant_resources)

    with trace("variant-agent workflow"):
        ranked_genes = [
            {
                "gene": "MECP2",
                "rank": 1,
                "reasoning": "MECP2 is associated with three relevant HPO IDs",
            },
            {
                "gene": "CDKL5",
                "rank": 2,
                "reasoning": "CDKL5 is associated with three relevant HPO IDs",
            },
        ]

        genes_input = ""
        for gene in ranked_genes:
            genes_input += (
                f"Gene: {gene['gene']}, "
                f"Rank: {gene['rank']}, "
                f"Reasoning: {gene['reasoning']}\n"
            )
        print(genes_input)

        print("Starting variant search...")
        variant_result = await Runner.run(
            starting_agent=variant_agent,
            context=variant_resources,
            input=phenotype_description + "\n\nCandidate genes:\n" + genes_input,
            max_turns=20,
        )
        print("Variant search completed. Results:")
        print(variant_result.final_output)


if __name__ == "__main__":
    asyncio.run(main())

The test script hard-codes two candidate genes based on the initial example case in the first post of this series. The candidate genes would normally be provided by the Gene Agent.

Ideas for improvement

In my first post, I mentioned that a significant limitation of the Variant Agent is that it relies on only three criteria when attempting to locate causative variants: transcript consequences, ClinVar classifications, and gnomAD allele frequency. This strategy is effective only for well-characterized diseases caused by obvious deleterious variants. For more complex cases, the agent will have little chance of finding a diagnostic variant given the limited annotations provided to it.

Exposing additional variant annotation data to the agent can be done in three steps:

Import the annotation data into the DuckDB database
Modify the query_variants tool to allow for the search and retrieval of the new fields
Update the Variant Agent instructions to explain how and when to utilize the new annotations

I’ve also been thinking about how to find challenging real-world cases to help guide improvements to the multi-agent system. One promising source could be the Critical Assessment of Genome Interpretation group, which organizes periodic challenges related to the interpretation of genetic variants. A previous challenge from 2021, the CAGI6 Rare Genomes Project, is one possible source of cases for testing the current system’s capabilities and developing improvement plans.

What’s next

With all three agents now described in this series of posts, I plan to release the full agent trio in a public GitHub repo so that others can explore the potential of this multi-agent system.