Implementing the Variant Agent: Part 2
This post is a continuation of a series about how AI agents might aid in the diagnosis of rare diseases in genomics. See the previous four posts for additional context:
- Can an AI agent help diagnose genetic diseases?
- Implementing the HPO Agent
- Implementing the Gene Agent
- Implementing the Variant Agent: Part 1
Introduction
My previous post described how to get annotated variants into a searchable database using Illumina’s Nirvana, DuckDB, and SQLAlchemy. This post will continue from there by defining the Variant Agent and its variant search tool.
Agent instructions
Below are the instructions given to the Variant Agent:
Role
You are a medical expert specializing in genetic diagnostics for pediatric rare disease patients. Use the query_variants tool to search the patient’s GRCh38 variants (with parental genotypes when available) across a ranked list of candidate genes. Only report variants actually returned by the tool; do not speculate. Prioritize rare variants (default gnomAD AF < 0.005). If no candidate variants are found, state that clearly.
Key clinical rules
- Report a variant only if the proband genotype is non-reference.
- Use provided gnomAD v4.0 frequencies to determine rarity.
- Analyze inheritance from maternal and paternal genotypes; note that X-chromosome variants are typically hemizygous in males; call out probable de novo when suggested by parental genotypes.
- If the query tool fails, stop and return the failure message.
Tool
def query_variants( gene: str | None = None, clinvar: List[str] | None = None, # see Valid ClinVar values consequence: List[str] | None = None, # see Valid Consequences max_gnomad_freq: float | None = None, limit: int = 20, offset: int = 0, ) -> str:
Valid ClinVar values
When querying by ClinVar classification, the following are valid classifications. Use only values from this list. Do not fabricate new labels.
Affects
Affects; association; other
association
association; drug response
association; drug response; risk factor
association not found
Benign
Benign; association
Benign; confers sensitivity
Benign; drug response
Benign/Likely benign
Benign; other
Benign; risk factor
confers sensitivity
confers sensitivity; other
Conflicting classifications of pathogenicity
Conflicting classifications of pathogenicity; Affects
Conflicting classifications of pathogenicity; other
Conflicting classifications of pathogenicity; other; risk factor
Conflicting classifications of pathogenicity; protective
Conflicting classifications of pathogenicity; risk factor
drug response
drug response; other
drug response; risk factor
Likely benign
Likely pathogenic
Likely pathogenic; protective
Likely risk allele
not provided
other
other; risk factor
Pathogenic
Pathogenic/Likely pathogenic
Pathogenic; risk factor
protective
protective; risk factor
risk factor
Uncertain risk allele
Uncertain significance
Uncertain significance; association
Uncertain significance; drug responseValid consequences
When filtering by transcript consequences, these are the valid transcript consequence values. Use only values from this list. Do not fabricate new labels.
3_prime_UTR_variant
5_prime_UTR_variant
coding_sequence_variant
downstream_gene_variant
frameshift_variant
incomplete_terminal_codon_variant
inframe_deletion
inframe_insertion
intron_variant
mature_miRNA_variant
missense_variant
NMD_transcript_variant
non_coding_transcript_exon_variant
non_coding_transcript_variant
splice_acceptor_variant
splice_donor_variant
splice_region_variant
start_lost
start_retained_variant
stop_gained
stop_lost
stop_retained_variant
synonymous_variant
transcript_variant
upstream_gene_variantSearch strategy
- Initialize thresholds
- max_gnomad_freq: 0.005 by default (can lower to 0.001 for stricter passes if too many hits).
- Start with severe consequences:
- frameshift_variant
- stop_gained
- stop_lost
- splice_acceptor_variant
- splice_donor_variant
- start_lost
- inframe_deletion
- inframe_insertion
- missense_variant
(You may relax later to include splice_region_variant, coding_sequence_variant, synonymous_variant only if earlier passes return zero.)
- Iterate over ranked genes
- For each candidate gene (highest rank first), query in batches: limit=50, paginate with offset += limit until the tool returns fewer than limit results.
- Triage order
- Prefer hits with ClinVar Pathogenic / Likely pathogenic / Pathogenic/Likely pathogenic, then Conflicting classifications of pathogenicity, then Uncertain significance, then others.
- Within the same ClinVar tier, sort by consequence severity (PVS/PS-like: LoF > canonical splice > missense > inframe > synonymous/UTR), then by lowest gnomAD AF.
- Inheritance analysis
- Use parent genotypes to infer autosomal recessive (biallelic), compound het (if data provided), de novo, X-linked, dominant models. Explicitly state the evidence (e.g., proband het, parents ref/ref ⇒ likely de novo).
- Note male hemizygosity on X.
- Reportability checks
- Proband must be non-reference.
- Variant must pass the gnomAD threshold (unless explaining an override).
- Do not include variants not returned by the tool.
- Failure handling
- If query_variants returns an error, stop and return the message verbatim.
Output format
Return descriptions of any candidate variants found as well as your reasoning. Keep explanations concise and evidence-based. Avoid clinical assertions beyond what the tool returns and the basic population frequency/context rules above. 
The instructions describe a minimal approach for identifying rare loss-of-function variants based on a provided ranked list of candidate genes. The variant search tool itself also reflects this minimal approach and only offers search criteria explicitly mentioned in the instructions.
The variant search tool
The function definition for the query_variants
tool described in the agent’s instructions accepts four optional query parameters: a gene name, a list of ClinVar classifications, a list of transcript consequences, and a gnomAD maximum frequency value. Our search tool needs to build a dynamic query based on those parameters. With SQLAlchemy, we can progressively build such a query using method chaining. Here’s an example with two query conditions:
engine = create_engine("duckdb:///colombian_trio.exome.duckdb?access_mode=READ_ONLY")
conn = engine.connect()
md = MetaData()
variants = Table("variants", md, autoload_with=conn)
var_genes = Table("variant_genes", md, autoload_with=conn)
# simulate optional query parameters
gene = 'SAMD11'
max_gnomad_freq = None
query = select(variants).distinct()
if gene is not None:
query = query.join(
var_genes,
(var_genes.c.vid == variants.c.vid)
& (var_genes.c.chromosome == variants.c.chromosome),
).where(var_genes.c.gene_symbol == gene)
if max_gnomad_freq is not None:
query = query.where(variants.c.gnomad_af <= max_gnomad_freq)
total_variants: int = conn.execute(
select(func.count()).select_from(query.subquery())
).scalar_one()
print(f"Total variants: {total_variants}")
That’s really all we need to build out the full query_variants
function. One small issue I encountered when writing the function is that the DuckDB driver for Python is not async-aware. As all of the previous agents and tools in the genomics workflow have been run as async
functions, I implemented a workaround to keep query_variants
async by executing DuckDB queries in a separate thread via asyncio.to_thread()
.
Previously, the HPO Agent and Gene Agent tools accessed resources from the global scope. While that technique would also work for the Variant Agent, I decided to take an improved approach and utilize local context. By providing a resources object to the agent, it will be passed on to each tool by the OpenAI Agents SDK. This object provides access to the DuckDB variant database. I also included some helper functions in the resources object to build text outputs for a variant’s transcript information and ClinVar entries.
Due to its length, I’ve placed the Variant Agent implementation in this GitHub gist: variant_agent.py.
To run the agent, we can write a simple test script that imports and runs the agent with our resource object as context:
import argparse
import os
import asyncio
from agents import Runner, trace
from dotenv import load_dotenv
from variant_agent import VariantAgentResources, create_variant_agent
load_dotenv()
async def main():
"""Main function to run the multi-agent workflow."""
parser = argparse.ArgumentParser(description="Test the Variant Agent.")
parser.add_argument(
"-p",
"--phenotype",
type=str,
required=True,
help="File path to a description of the patient's phenotype.",
)
args = parser.parse_args()
if not os.path.exists(args.phenotype):
print(f"Phenotype file not found: {args.phenotype}")
return
with open(args.phenotype, "r", encoding="utf-8") as file:
phenotype_description = file.read().strip()
variant_resources = VariantAgentResources(
database_url="duckdb:///colombian_trio.exome.duckdb?access_mode=READ_ONLY",
prompt_file="prompts/variant_agent.md",
)
variant_agent = create_variant_agent(variant_resources)
with trace("variant-agent workflow"):
ranked_genes = [
{
"gene": "MECP2",
"rank": 1,
"reasoning": "MECP2 is associated with three relevant HPO IDs",
},
{
"gene": "CDKL5",
"rank": 2,
"reasoning": "CDKL5 is associated with three relevant HPO IDs",
},
]
genes_input = ""
for gene in ranked_genes:
genes_input += (
f"Gene: {gene['gene']}, "
f"Rank: {gene['rank']}, "
f"Reasoning: {gene['reasoning']}\n"
)
print(genes_input)
print("Starting variant search...")
variant_result = await Runner.run(
starting_agent=variant_agent,
context=variant_resources,
input=phenotype_description + "\n\nCandidate genes:\n" + genes_input,
max_turns=20,
)
print("Variant search completed. Results:")
print(variant_result.final_output)
if __name__ == "__main__":
asyncio.run(main())
The test script hard-codes two candidate genes based on the initial example case in the first post of this series. The candidate genes would normally be provided by the Gene Agent.
Ideas for improvement
In my first post, I mentioned that a significant limitation of the Variant Agent is that it relies on only three criteria when attempting to locate causative variants: transcript consequences, ClinVar classifications, and gnomAD allele frequency. This strategy is effective only for well-characterized diseases caused by obvious deleterious variants. For more complex cases, the agent will have little chance of finding a diagnostic variant given the limited annotations provided to it.
Exposing additional variant annotation data to the agent can be done in three steps:
- Import the annotation data into the DuckDB database
- Modify the
query_variants
tool to allow for the search and retrieval of the new fields - Update the Variant Agent instructions to explain how and when to utilize the new annotations
I’ve also been thinking about how to find challenging real-world cases to help guide improvements to the multi-agent system. One promising source could be the Critical Assessment of Genome Interpretation group, which organizes periodic challenges related to the interpretation of genetic variants. A previous challenge from 2021, the CAGI6 Rare Genomes Project, is one possible source of cases for testing the current system’s capabilities and developing improvement plans.
What’s next
With all three agents now described in this series of posts, I plan to release the full agent trio in a public GitHub repo so that others can explore the potential of this multi-agent system.