Implementing the Gene Agent
This post is a continuation of a series about how AI agents might aid in the diagnosis of rare diseases in genomics. See the previous two posts for additional context:
Introduction
Of the three agents in the AI system, the Gene Agent is the simplest. In fact, it doesn’t really need to be an agent at all in this original implementation, but I decided to keep it as one because I think there are a number of ways it could be improved by adopting more agent-like behavior. Like the HPO Agent, its functionality relies on just two components: its instructions and a provided tool.
Agent instructions
You are the Gene Agent in a pediatric rare-disease analysis workflow.
Objective
Given a list of Human Phenotype Ontology (HPO) term IDs, use the provided tool to retrieve genes associated with those terms and produce a ranked list of gene symbols. Provide concise reasoning for each gene that is grounded ONLY in the tool output (matched HPO IDs and counts).
Tool
- ranked_genes_for_hpo_terms(term_ids: list[str]) -> str
- Returns a newline-delimited string. Each line has:
“Gene:, HPO IDs: <[HP:xxxx,...], Count: "
The agent is given a straightforward task: provide the given HPO term IDs to the single available tool, and respond with a prioritized list of genes based on the tool’s output.
The ranked gene tool
I decided to implement the ranked gene tool in the simplest possible way for this proof-of-concept. The easiest way to map HPO terms to genes is via the downloadable Genes to Phenotype annotation file from the HPO website. This is a tab-delimited text file containing over one million HPO-to-gene mappings:
import pandas as pd
hpo_to_genes_df = pd.read_csv("phenotype_to_genes.txt", sep="\t")
print(hpo_to_genes_df.head().to_string(index=False))
print(f"{len(hpo_to_genes_df):,} entries")
hpo_id hpo_name ncbi_gene_id gene_symbol disease_id
HP:0025700 Anhydramnios 26281 FGF20 OMIM:615721
HP:0025700 Anhydramnios 2674 GFRA1 OMIM:619887
HP:0025700 Anhydramnios 79867 TCTN2 OMIM:613885
HP:0025700 Anhydramnios 6091 ROBO1 OMIM:620305
HP:0025700 Anhydramnios 8516 ITGA8 OMIM:191830
1,039,980 entries
With this resource, I wrote a basic tool to return a list of genes associated with a list of HPO terms. The returned list specifies which of the provided HPO terms are associated with each gene:
def ranked_genes_for_hpo_terms(term_ids: list[str]) -> str:
"""Get a ranked list of genes for a set of HPO terms.
Args:
term_ids (list[str]): List of HPO term IDs.
Returns:
str: Formatted string of gene symbols ranked by HPO term count
"""
print(f"Finding genes for HPO terms: {', '.join(term_ids)}")
term_ids_set = set(term_ids)
mask = hpo_to_genes_df["hpo_id"].isin(term_ids_set)
matches = (
hpo_to_genes_df.loc[mask, ["hpo_id", "gene_symbol"]]
.groupby("gene_symbol", as_index=False, sort=False)
.agg(
hpo_ids=("hpo_id", lambda x: list(set(x))),
count=("hpo_id", "nunique")
)
.sort_values(["count", "gene_symbol"], ascending=[False, True])
)
results = []
for row in matches.head(30).itertuples(index=False):
results.append(
f"Gene: {row.gene_symbol}; "
f"HPO IDs: {', '.join(row.hpo_ids)}; "
f"Count: {row.count}"
)
return "\n".join(results)
Defining the agent’s output
Just as with the HPO Agent, the output of the Gene Agent will be sent to the next agent in the system. For that reason, we will again use structured output to give us better control over the agent’s output format. The Gene Agent’s instructions request concise reasoning for the selection and ranking of each gene, so I decided on the following data model:
from pydantic import BaseModel
class RankedGene(BaseModel):
"""Model for ranked gene."""
gene: str
""" Gene symbol """
rank: int
""" Rank based on HPO term count """
reasoning: str
""" Reasoning for the ranking"""
hpo_ids: list[str]
""" List of HPO term IDs """
class RankedGeneList(BaseModel):
"""Model for a list of ranked genes."""
genes: list[RankedGene]
""" List of ranked genes """
We will have the agent return a list of RankedGenes
via RankedGeneList
- each ranked gene will include the gene’s symbol, rank, the model’s reasoning for selecting the gene, and a list of the provided HPO IDs that were associated with the gene.
Defining the agent
Now that we have the components we need, we can define the agent itself:
from agents import Agent, ModelSettings, function_tool
@function_tool
async def ranked_genes_for_hpo_terms(term_ids: list[str]) -> str:
# defined above
gene_agent_prompt_file = "prompts/gene_agent.md"
with open(gene_agent_prompt_file, mode="r", encoding="utf-8") as f:
gene_agent_prompt = f.read()
gene_agent = Agent(
name="Gene Assistant",
instructions=gene_agent_prompt,
tools=[ranked_genes_for_hpo_terms],
model="gpt-4o",
output_type=RankedGeneList,
model_settings=ModelSettings(tool_choice="required"),
)
Running the agent
We can test out the agent by building a minimal script to run the agent. We’ll use the HPO terms from one of the example cases used in the previous posts:
import asyncio
from agents import Runner, trace
# ensure we have the previous resources available:
# - gene_agent
# - RankedGene and RankedGeneList
# - ranked_genes_for_hpo_terms
# - hpo_to_genes_df
async def main():
hpo_terms = {
"HP:0032588": "Hand apraxia",
"HP:0007064": "Progressive language deterioration",
"HP:0012171": "Stereotypical hand wringing",
"HP:6000950": "Breath-holding spell",
"HP:0002883": "Hyperventilation"
}
terms_input = ""
for term_id, term_name in hpo_terms.items():
terms_input += f"{term_id}: {term_name}\n"
print(terms_input)
print("Starting gene search...")
gene_result = await Runner.run(starting_agent=gene_agent, input=terms_input)
ranked_genes = gene_result.final_output_as(RankedGeneList)
print("Gene search completed. Ranked genes:")
genes_output = ""
for gene in ranked_genes.genes:
genes_output += (
f"Gene: {gene.gene}; "
f"Rank: {gene.rank}; "
f"Reasoning: {gene.reasoning}\n"
)
print(genes_output)
if __name__ == "__main__":
asyncio.run(main())
Running the above produces the following output from the agent (truncated to just the first 8 genes):
Starting gene search...
Finding genes for HPO terms: HP:0032588, HP:0007064, HP:0012171, HP:6000950, HP:0002883
Gene search completed. Ranked genes:
Gene: MECP2, Rank: 1, Reasoning: MECP2 is associated with four out of the five HPO terms: progressive language deterioration, hand apraxia, stereotypical hand wringing, and hyperventilation. This extensive overlap makes MECP2 highly relevant.
Gene: GABBR2, Rank: 2, Reasoning: GABBR2 is connected to three HPO terms: hand apraxia, stereotypical hand wringing, and hyperventilation, indicating its significant association with the phenotypes.
Gene: CNTNAP2, Rank: 3, Reasoning: CNTNAP2 is linked to progressive language deterioration, stereotypical hand wringing, and hyperventilation, covering three key phenotypes.
Gene: CDKL5, Rank: 4, Reasoning: CDKL5 shares associations with hand apraxia, stereotypical hand wringing, and hyperventilation. Three matched terms indicate its relevance.
Gene: CASK, Rank: 5, Reasoning: CASK is associated with stereotypical hand wringing and hyperventilation, making it pertinent to a subset of the given phenotypes.
Gene: SMC1A, Rank: 6, Reasoning: SMC1A is linked to hand apraxia and stereotypical hand wringing, indicating a focused relevance to the phenotypes.
Gene: NTNG1, Rank: 7, Reasoning: NTNG1 is connected to hand apraxia and stereotypical hand wringing, two important terms that highlight its significance.
Gene: AP1S2, Rank: 8, Reasoning: AP1S2 is associated with stereotypical hand wringing only, indicating limited relevance.
Ideas for improvement
One significant limitation of this agent is that it does not take into account the hierarchical structure of HPO terms. Each term has a parent term, all the way up to the root entry. For example, the code below takes an HPO term and walks up the tree, counting the number of genes associated with each successive term:
def genes_for_term(term_id: str) -> list[str]:
mask = hpo_to_genes_df["hpo_id"] == term_id
return list(set(hpo_to_genes_df[mask]["gene_symbol"].tolist()))
t = hpo['HP:0012171']
t_genes = genes_for_term(t.id)
print(f"{t.id}: {t.name} (Genes: {len(t_genes):,})")
depth = " "
while parents := list(t.superclasses(distance=1, with_self=False)):
if not parents:
break
parent = parents[0]
gene_count = len(genes_for_term(parent.id))
print(depth + f"{parent.id}: {parent.name} (Genes: {gene_count:,})")
depth += " "
t = parent
Output:
HP:0012171: Stereotypical hand wringing (Genes: 30)
HP:5200009: Midline hand movements (Genes: 30)
HP:5200018: Stereotypic upper-extremity movements (Genes: 57)
HP:0000733: Motor stereotypy (Genes: 282)
HP:0031432: Restricted or repetitive behaviors or interests (Genes: 288)
HP:5200241: Recurrent maladaptive behavior (Genes: 1,089)
HP:0000708: Atypical behavior (Genes: 1,642)
HP:0011446: Abnormality of mental function (Genes: 3,223)
HP:0012638: Abnormal nervous system physiology (Genes: 3,822)
HP:0000707: Abnormality of the nervous system (Genes: 4,066)
HP:0000118: Phenotypic abnormality (Genes: 5,124)
HP:0000001: All (Genes: 5,126)
You can see that the number of genes associated with each term grows as the terms become more general. One consequence of this is that our gene selection process is highly dependent on the specificity of the individual HPO terms selected by the previous HPO Agent. If the agent selects terms that are too general, the list of associated genes will be excessive. Conversely, if the agent selects terms that are too specific given the description of the patient’s symptoms, relevant genes might be excluded by the Gene Agent.
There isn’t a perfect solution to this problem. We of course want the HPO Agent to be as specific as possible when selecting HPO terms, but we also need to acknowledge that it won’t always select the optimal terms. And don’t forget the original free-text description of the patient’s symptoms might not be sufficiently detailed to select the optimal terms.
One route we could take to address this is to have our Gene Agent return two tiers of candidate genes: one tier based only on the provided HPO terms, and another that takes into account additional HPO terms that are parents of the provided terms. This could be rule-based, for example by requiring a minimum number of genes to be associated with each term. For terms that don’t have the minimum number of associated genes, we could walk up the hierarchy, including parent terms until the minimum is met.
With the two-tier approach, the next agent in the system, the Variant Agent, could be instructed to first attempt to locate candidate diagnostic variants in the first tier of genes. If it cannot find any candidates, it could then fall back to the second tier of genes.
We could also consider a disease-level approach to candidate genes. Imagine an agent that analyzes the patient’s symptoms and comes up with a list of candidate diseases. We could then leverage HPO’s genes-to-disease mapping to generate a list of candidate genes.
Next up will be the final agent in the system: the Variant Agent.