Implementing the HPO Agent

Dives into the HPO Agent that converts free-text symptom descriptions into precise HPO terms for the genomics workflow. Breaks down the instructions that force tool use, forbid guessing, and require reasoning tied to search results. Builds a vector database of HPO terms and an `hpo_search` tool so the agent can reliably match phenotypes. Shares example outputs, pitfalls, and ideas for improving recall without sacrificing specificity.

Posted Aug 7, 2025 Updated Dec 12, 2025

By Jeff Johnston

14 min read

Introduction

In my previous post, I gave an overview of an AI system that could assist in finding diagnostic variants for pediatric rare disease patients. As promised, I’m going to discuss each of the three AI agents in that system in more detail. This post will be about the HPO Agent, whose task is to take a free-text description of a patient’s symptoms and convert it to a list of Human Phenotype Ontology terms. These terms are then used by the next agent in the system, the Gene Agent, to compile a prioritized list of genes that are potentially involved in the patient’s symptoms.

The HPO Agent’s two most important components are the agent instructions (the prompt) and the tools provided to the agent, each discussed below.

Agent instructions

Below are the instructions provided to the HPO Agent:

SYSTEM

You are an experienced clinical geneticist and ontology curator. Your task is to translate a free-text phenotype description into precise Human Phenotype Ontology (HPO) terms.

TOOL

hpo_search(phenotype_text: str, top_k: int = 10)

INSTRUCTIONS

Read the entire description once. Extract every distinct phenotype concept (one physical finding, symptom, or laboratory abnormality per concept).

For each concept:
a. Call hpo_search with a concise query (core clinical keywords only). You MUST call the tool and only report terms returned by the tool’s output.
b. From the returned list, pick one term that is (i) clinically specific, (ii) best matches the phenotype concept, and (iii) not a duplicate of a term you have already selected.
c. If no result has a similarity distance ≥ 0.25, omit the concept; do not guess.

Return the list of HPO terms along with your reasoning for selecting each one.

The agent is tasked with taking the description of the patient’s symptoms, breaking it up into individual phenotypes, and searching for each phenotype using the hpo_search tool to find an appropriate HPO term. We explicitly tell the agent not to guess and only return HPO terms that were returned by the search tool.

Implementing the hpo_search tool is key for this agent’s performance, as the symptom descriptions won’t always exactly match HPO terms. We will use a vector database to store the terms and allow for semantic searches.

The HPO search tool

To build the HPO search tool, we will first do the following:

Obtain a list of all current HPO terms
Create a vector representation of each term
Insert the vectors into a vector database

Once we have our vector database of HPO terms, we can then build the search tool that the agent will utilize.

Obtaining HPO terms

The first step is to obtain a list of current HPO terms. We can download the latest version from the HPO website in Open Biomedical Ontologies (OBO) format and parse it using the Python pronto package. HPO terms are in a hierarchy, and the terms we want all fall under the root Phenotypic abnormality entry:

        
      
import pronto

hpo = pronto.Ontology("hp.obo", encoding="utf-8")

# this is Phenotypic Abnormality
pheno_abnormal = hpo["HP:0000118"]

term_texts = []
for term in pheno_abnormal.subclasses():
    if term.obsolete:
        continue
    parts = [
        term.id,
        term.name or "",
        term.definition or "",
    ]
    parts += [syn.description for syn in term.synonyms]
    term_texts.append(" | ".join(parts))

print(f"Found {len(term_texts):,} terms")
print("Sample terms:")
# show 6 random terms
import random
random.shuffle(term_texts)
for item in term_texts[:6]:
    print(item)

For each term, I’ve collected the term ID, name, definition, and any synonyms. The output will look something like this:

Found 18,507 terms
Sample terms:
HP:0032193 | Decreased low-density lipoprotein particle size | An abnormal decrease in the average size of low-density lipoprotein particle size in the blood circulation.
HP:0025532 | Positive pathergy test | With the pathergy test, a small, sterile needle is inserted into the skin of the forearm. The site of injectionis circuled and observed after one and two days. If a small red bump or pustule at the site of needle insertion occurs, the pathergy test is considered to have a positive (abnormal) result.
HP:0025570 | Choroidal vascular hyperpermeability | Increased tendency of choiroidal blood vessels to allow fluids to leak characterized by multifocal choroidal hyperfluorescence on indocyanine green angiography (ICGA). | Choroidal hyperpermeability
HP:0032958 | Urinary oval fat bodies | The presence in the urine of desquamated tubular epithelial cells or macrophages filled with lipid droplets.
HP:0009633 | Osteolytic defect of the proximal phalanx of the thumb | Dissolution or degeneration of bone tissue of the proximal phalanx of the thumb. | Osteolytic defects of the proximal phalanx of the thumb
HP:6001154 | Pain at the medial aspect of the knee | An unpleasant sensation characterized by physical discomfort (such as pricking, throbbing, or aching) localized to the inner side of knee, that is, the region of the knee that faces the other knee.

These are the entries we want the HPO Agent to be able to search through, so the next step is to put them into a queryable database.

Indexing HPO terms

To enable effective searching, we will use the Python SentenceTransformers package to generate a vector representation of each term. Instead of using a generic pre-trained model to generate these embeddings, we can use a model specifically trained on biomedical text to improve search performance with our clinical symptom descriptions. I selected SapBERT-from-PubMedBERT-fulltext:

        
      
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
embeddings = model.encode(
    sentences=term_texts,
    normalize_embeddings=True,
    show_progress_bar=True
)

Now that we have the embeddings, we put them into an index for fast querying:

        
      
import hnswlib
index = hnswlib.Index(space="cosine", dim=embeddings.shape[1])
index.init_index(max_elements=len(embeddings), ef_construction=200, M=16)
index.add_items(embeddings)
index.set_ef(200)
index.save_index("hpo_search_index.bin")

Searching HPO terms

With our model and index of HPO terms, querying becomes a straightforward process:

        
      
query_text = "poor muscle tone"
query_embedding = model.encode(sentences=[query_text], normalize_embeddings=True)
labels, distances = index.knn_query(query_embedding, k=5)
results = []
for idx, dist in zip(labels[0], distances[0]):
    sim = 1.0 - dist
    # format with up to 3 decimal places
    sim_text = f"{sim:.3f}"
    results.append(
        f"{term_texts[idx]} (similarity distance: {sim_text})"
    )
print("\n".join(results))

The above outputs:

HP:0001252 | Hypotonia | Hypotonia is an abnormally low muscle tone (the amount of tension or resistance to movement in a muscle). Even when relaxed, muscles have a continuous and passive partial contraction which provides some resistance to passive stretching. Hypotonia thus manifests as diminished resistance to passive stretching. Hypotonia is not the same as muscle weakness, although the two conditions can co-exist. | Low muscle tone | Muscle hypotonia | Low or weak muscle tone | Muscular hypotonia | Central hypotonia | Peripheral hypotonia (similarity distance: 0.708)
HP:0003725 | Firm muscles | A state of increased firmness (i.e., a reduced tendency to yield to pressure) of skeletal muscle tissue. This finding is ascertained by palpation and is to be distinguished from increased muscle tone (abnormal elevation in the resting tension of a muscle, leading to stiffness and resistance to movement). (similarity distance: 0.671)
HP:0009004 | Hypoplasia of the musculature | Underdevelopment of the musculature. | Underdeveloped muscles | Underdeveloped muscle | Poorly developed skeletal musculature | Muscle hypoplasia (similarity distance: 0.648)
HP:0000297 | Facial hypotonia | Reduced muscle tone of a muscle that is innervated by the facial nerve (the seventh cranial nerve). | Hypotonic facies | Atony of facial musculature | Low facial muscle tone | Reduced facial muscle tone | Decreased facial muscle tone (similarity distance: 0.636)
HP:0002375 | Hypokinesia | Abnormally diminished motor activity. In contrast to paralysis, hypokinesia is not characterized by a lack of motor strength, but rather by a poverty of movement. The typical habitual movements (e.g., folding the arms, crossing the legs) are reduced in frequency. | Decreased muscle movement | Decreased spontaneous movements | Decreased spontaneous movement (similarity distance: 0.634)

Putting it all together

Now that we have the components we need, the last step is to define an agent using the OpenAI Agents SDK. Since our system will use a series of agents, I opted to use structured output so that the HPO Agent’s response will be in a consistent format. This will allow us to provide the next agent in the system an appropriately-formatted list of HPO terms.

Let’s first define our output format:

        
      
from pydantic import BaseModel

class HPOTerm(BaseModel):
    """Model for HPO term."""

    id: str
    """HPO term ID."""

    text: str
    """HPO term text."""

    reasoning: str
    """Reasoning behind the term's relevance."""


class HPOTermList(BaseModel):
    """Model for a list of HPO terms."""

    terms: list[HPOTerm]
    """List of HPO terms."""

Since our agent instructions specify that the agent should include its reasoning for selecting each term, I’ve added a reasoning attribute to HPOTerm.

Next, let’s define our hpo_search tool. We do that with the SDK’s @function_tool decorator. For simplicity, we’ll assume that the variables model, index, and term_texts, which were defined earlier, are available in the global scope so they can be accessed directly within the search function.

        
      
from agents import function_tool

@function_tool
async def hpo_search(
    phenotype_text: str,
    top_k: int = 10,
) -> str:
    """Search for HPO terms similar to a phenotype description.
    Args:
        phenotype_text (str): The text description of the phenotype.
        top_k (int): Number of top similar terms to return.
    Returns:
        str: Formatted string containing HPO term id, text, and similarity score.
    """
    # Encode the query
    print(f"Searching for: {phenotype_text}")
    query_embedding = model.encode(
        sentences=[phenotype_text], normalize_embeddings=True
    )

    # Search the index
    labels, distances = index.knn_query(query_embedding, k=top_k)

    # Prepare results
    results = []
    for idx, dist in zip(labels[0], distances[0]):
        sim = 1.0 - dist
        # format with up to 3 decimal places
        sim_text = f"{sim:.3f}"
        results.append(
            f"{term_texts[idx]} (similarity distance: {sim_text})"
        )

    return "\n".join(results)

Now we are ready to define the agent itself. Rather than include the prompt directly in the code below, I am instead reading it from a markdown file:

        
      
from agents import Agent, ModelSettings

hpo_agent_prompt_file = "prompts/hpo_agent.md"
with open(hpo_agent_prompt_file, mode="r", encoding="utf-8") as f:
    hpo_agent_prompt = f.read()

hpo_agent = Agent(
    name="HPO Assistant",
    instructions=hpo_agent_prompt,
    tools=[hpo_search],
    model="gpt-4.1",
    output_type=HPOTermList,
    model_settings=ModelSettings(tool_choice="required")
)

By specifying a pydantic model as the output_type, I’ve enabled the structured output feature. And using model_settings, I’ve told the agent that use of the hpo_search tool is required, as we do not want the model to try to guess appropriate HPO terms without searching.

Let’s try it out! Here’s how to run it against the first test case in my previous post:

        
      
import asyncio
from agents import Runner

# be sure to have our previous resources defined as well (not shown):
# - model
# - index
# - term_texts
# - hpo_search function 
# - pydantic models (HPOTerm, HPOTermList)
# - hpo_agent

patient_symptoms = """
The patient is an 18-month-old female. The patient was born at term after an 
uncomplicated pregnancy and delivery. Early development appeared normal: she 
held her head steadily by 3 months, rolled over by 5 months, and sat unsupported 
at 7 months. She began babbling at 6 months and used single consonant–vowel 
syllables ("ma," "ba") by 9 months. Starting around 12 months of age, her parents 
noted a gradual loss of previously acquired purposeful hand skills and spoken 
sounds. Over the next 6 months she stopped babbling altogether and developed 
repetitive hand-wringing movements, breath-holding spells, and episodes of 
hyperventilation.
"""

async def main():
    print("Starting HPO search...")
    hpo_result = await Runner.run(
        starting_agent=hpo_agent, input=patient_symptoms, max_turns=20
    )
    hpo_terms = hpo_result.final_output_as(HPOTermList)
    print("HPO search completed. Found terms:")
    for term in hpo_terms.terms:
        print(f"{term.id}: {term.text} (Reasoning: {term.reasoning})")

if __name__ == "__main__":
    asyncio.run(main())

Running the agent produces the following output:

Starting HPO search...
Searching for: Loss of purposeful hand skills
Searching for: Loss of spoken language
Searching for: Repetitive hand-wringing movements
Searching for: Breath-holding spells
Searching for: Episodes of hyperventilation
HPO search completed. Found terms:
HP:0032588: Hand apraxia (Reasoning: The patient exhibited a loss of purposeful hand skills, and 'Hand apraxia' (HP:0032588) directly describes inability to perform purposeful hand movements. This is the most precise, clinically relevant term in the returned results.)
HP:0002371: Loss of speech (Reasoning: The patient had a gradual loss of previously acquired spoken sounds and stopped babbling, matching 'Loss of speech' (HP:0002371) as the most specific term representing this regression in language.)
HP:0012171: Stereotypical hand wringing (Reasoning: The development of repetitive, hand-wringing movements is matched most specifically and accurately by 'Stereotypical hand wringing' (HP:0012171), which describes the habitual wringing of hands observed here.)
HP:6000950: Breath-holding spell (Reasoning: Breath-holding spells are described in the clinical course, and 'Breath-holding spell' (HP:6000950) was explicitly returned as a direct match for this non-epileptic paroxysmal event.)
HP:0002883: Hyperventilation (Reasoning: Episodes of hyperventilation were noted, and the term 'Hyperventilation' (HP:0002883) directly matches the abnormal episodic increased respiratory rate described in the vignette.)

Future improvements

I developed this HPO Agent as a minimal proof-of-concept to explore the feasibility of having an AI model determine appropriate HPO terms from a free-text description. While it did an acceptable job with the straightforward cases I provided, there are a number of avenues to explore for improving its performance.

First, we would need a way to systematically evaluate its performance across a range of scenarios. For that, we could gather clinical symptom descriptions from the medical literature that include expert-curated HPO terms. With the ability to benchmark the agent’s performance against a set of real-world cases, we could start making changes to the agent and observe whether its HPO selection improves.

I would start by exploring changes to the following three components of the agent:

The OpenAI model selection: With an appropriate benchmark, it would be straightforward to see if switching the agent’s model from gpt-4.1 to another OpenAI model improves its ability to select HPO terms. OpenAI’s reasoning models, like o3 and o4-mini, might do a better job of breaking down the symptom description and searching for relevant phenotypes.
The embedding model used for the HPO term semantic search: I used the SapBERT-from-PubMedBERT-fulltext embedding model, as it was trained on medical text, but I don’t know for sure whether it performs better than a state-of-the-art general text model. By embedding the HPO terms using different models, we might find one that performs better in our benchmark cases. There is also the possibility of fine-tuning an existing model by compiling numerous examples of queries and their respective HPO term matches.
The agent instructions: Modifying the instructions given to the agent is another possible way to improve performance. The agent might benefit from some specific examples for converting common phrases seen in symptom descriptions to effective query terms.

The key to improving the agent’s performance is having a realistic benchmark, so it would be worthwhile dedicating some time to collecting various clinical symptom descriptions with corresponding HPO terms selected by human experts.

My next post will continue with the Gene Agent, which will receive as input the HPO terms selected by the HPO Agent and use them to generate a prioritized list of genes.

This post is licensed under CC BY 4.0 by the author.