abstract image of a graph with nodes and edges

Introduction

I’m frequently asked if I have any guidance for getting started with AI, so I decided to write a brief post describing a few concepts I think are important to learn. I’ve written about most of these in various posts over the last few months, but putting them all together in a single place will hopefully be more useful to those wanting to learn how to build things with AI.

If you want to build tools and capabilities that go beyond what you can do by opening ChatGPT in your browser, I think familiarity with these three concepts is critical:

  • The OpenAI API
  • structured output
  • custom tool use

Knowledge of the above opens up enormous possibilities for building unique tools. First, I highly recommend you use the Python programming language. OpenAI has official SDKs in other languages, but unless you have a strong preference for a different language, I would stick with Python for a variety of reasons:

  • An official Python SDK from OpenAI
  • Tons of resources available for learning the language
  • A rich ecosystem of packages covering a wide variety of functionality
  • Jupyter notebooks make experimentation easy

For these reasons, I’ve chosen to use Python when building the vast majority of my AI-enabled tools and when writing this blog. The rest of this post will go over each concept with Python examples using the official OpenAI Python SDK.

The OpenAI API

The OpenAI API is what allows you to interact with OpenAI’s LLMs programmatically. Nearly everything you can do in ChatGPT can be done via the API. For example:

  • Submit text queries and get responses back
  • Send images, documents, and audio as part of your queries
  • Generate images from text, existing images, or a combination of both
  • Generate audio from queries

To get started, you will need an OpenAI account. As OpenAI uses a pay-as-you-go model for API usage, add some funds to your account via the billing settings page. Next you’ll need to generate an API key and make it available to your Python code. One easy way to do this is to store your key in a .env file either in your home directory or in a project directory where your Python code will live. The .env file contents would look like this:

OPENAI_API_KEY="your_api_key"

Then, in your Python scripts and Jupyter notebooks, you can load the .env file before you start using the OpenAI API:

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

This will look for a .env file in the script’s directory, and if one is not found, it will continue searching parent directories until one is located.

Now you can start using the API. Let’s go through two examples: one text-based and another where we send an image to the API.

Text queries

Sending text queries to the OpenAI API requires only two parameters: the model you want to use and the query text.

# ensure you first install required packages in your Python environment:
# pip install openai python-dotenv pydantic
from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5-mini",
    input=[{
        "role": "user",
        "content": "Provide a brief description of what an API is in programming."
    }]
)

print(response.output_text)

The code above will print out the model’s response, which will be slightly different every time it’s generated:

An API (Application Programming Interface) is a defined set of rules and routines that lets one software component communicate with another. It exposes functions, data structures, or web endpoints so developers can use functionality without needing to know the internal implementation. Examples include library APIs (function calls in code) and web APIs/REST endpoints (HTTP requests returning data). APIs provide abstraction, reuse, and interoperability between programs and services.

A single, one-off query is a bit limited compared to the ChatGPT website, where queries and responses are organized into conversations so that you can ask follow-up questions. There are multiple ways to accomplish the same thing using the API, and the easiest way is to create a Conversation object and refer to it when submitting each query:

conversation = client.conversations.create()
response = client.responses.create(
    model="gpt-5-mini",
    conversation=conversation.id,
    input="List three popular programming languages.",
)

print("First response:")
print(response.output_text)

response = client.responses.create(
    model="gpt-5-mini",
    conversation=conversation.id,
    input="List two more.",
)

print("Second response:")
print(response.output_text)

This produces:

First response:
1. Python
2. JavaScript
3. Java  
Second response:
4. C++
5. C#

Those are the basics for text queries! We will see that text queries are often the foundation of AI-enabled apps, and when combined with structured output and tool use (explained below), they allow for some cool possibilities!

Queries with images

Most of OpenAI’s models are multimodal, meaning they were trained on not just text but also audio and images. This allows them to accept as input text, audio, and images and produce outputs in all three formats. Let’s see how to submit an image to the API along with a text query. We will use the example photo below and ask the model to guess the location.

Grassy ridge with a small white–red lighthouse above sea cliffs

This is one approach described in OpenAI’s images and vision API documentation. The input parameter can specify both text and image components:

import base64
from openai import OpenAI

client = OpenAI()

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

image_path = "IMG_2765.jpeg"
base64_image = encode_image(image_path)

response = client.responses.create(
    model="gpt-5-mini",
    input=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "input_text",
                    "text": "Where might this image be from?",
                },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                },
            ],
        }
    ],
)

print(response.output_text)

This outputs:

This looks like the Faroe Islands — specifically the famous Kallur lighthouse on the island of Kalsoy. The narrow grassy ridge, small white/red lighthouse and steep sea cliffs match that spot.

Now that we’ve covered the OpenAI API basics, let’s look at how to get programmatically useful output.

Structured output

Earlier this year I wrote an entire blog post about structured output, but I’ll review it again here as it is an extremely useful concept.

In all of the examples above, what we got back from the OpenAI API was a block of text. This might be sufficient for chat-based applications where you simply want to relay to the user the output of the model, but for other applications we often need to consume the output programmatically. For example, let’s say we have this photo of a Post-it Note:

post-it note

How can we use the OpenAI API to extract the date and time specified in the note? The first step is to define the exact output format that we want. We do this using Pydantic models, which are Python classes with defined attributes - not to be confused with the model parameter in an OpenAI API call! Since we want a date and a time, we’ll use Python’s built-in datetime module to define a model with the two desired attributes:

from datetime import date, time
from pydantic import BaseModel

class Event(BaseModel):
    event_date: date
    event_time: time

Now, we can send a query to the API with the image and specify that we want the result to be of type Event:

from datetime import date, time
import base64
from pydantic import BaseModel
from openai import OpenAI

class Event(BaseModel):
    event_date: date
    event_time: time

client = OpenAI()

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

image_path = "post-it-note.jpeg"
base64_image = encode_image(image_path)

response = client.responses.parse(
    model="gpt-5-mini",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "Return the date and time from this image, assuming the current year."
                },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                },
            ],
        }
    ],
    text_format=Event,
)

event = response.output_parsed
print(f"Event date: {event.event_date}, Event time: {event.event_time}")

The response object’s output_parsed attribute gave us an instance of our Event model, which we used to retrieve the date and time:

Event date: 2025-10-15, Event time: 15:00:00+00:00

Models can have attributes referring to other models, allowing the API to return complex data structures. Let’s use a cooking recipe as an example. We can define separate models for the ingredients and steps, and have the recipe itself be composed of a list of ingredients and steps, along with some additional attributes:

from pydantic import BaseModel

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

class Step(BaseModel):
    step_number: int
    instructions: str

class Recipe(BaseModel):
    title: str
    ingredients: list[Ingredient]
    steps: list[Step]
    prep_time_minutes: int
    cook_time_minutes: int
    servings: int

Now we can provide a recipe to the OpenAI API in any format and ask for the response to be a Recipe model. Let’s try it with a screenshot from this Instant Pot peanut chicken recipe page:

peanut chicken recipe

# remember to also include our pydantic model definitions above
import base64
from openai import OpenAI

client = OpenAI()

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

image_path = "recipe-screenshot.png"
base64_image = encode_image(image_path)

response = client.responses.parse(
    model="gpt-5-mini",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "Return a structured recipe from this image."
                },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                },
            ],
        }
    ],
    text_format=Recipe,
)

recipe = response.output_parsed
print(f"Recipe title: {recipe.title}")
print(f"Ingredients:")
for ingredient in recipe.ingredients:
    print(f" - {ingredient.name}, quantity: {ingredient.quantity} {ingredient.unit}")
print(f"Steps:")
for step in recipe.steps:
    print(f" {step.step_number}. {step.instructions}")
print(f"Preparation time: {recipe.prep_time_minutes} minutes")
print(f"Cooking time: {recipe.cook_time_minutes} minutes")
print(f"Servings: {recipe.servings}")

As expected, we get back a nicely structured recipe:

Recipe title: Slow-Cooker Peanut Butter Teriyaki Chicken with Spaghetti
Ingredients:
 - chicken breast tenders, cubed, quantity: 1.5 pounds
 - cornstarch, quantity: 3.0 tablespoons
 - teriyaki sauce, quantity: 2.0 tablespoons
 - fresh garlic, minced, quantity: 2.0 teaspoons
 - crushed red pepper, quantity: 0.25 teaspoons
 - dark sesame oil, quantity: 1.0 teaspoon
 - chicken broth, quantity: 2.0 cups
 - peanut butter, quantity: 0.25 cup
 - sugar snap peas, trimmed, quantity: 2.5 cups
 - carrots, julienned, quantity: 1.0 cup
 - spaghetti, quantity: 12.0 ounces
 - scallions, sliced, quantity: 0.5 cup
 - unsalted peanuts, chopped, roasted, quantity: 0.25 cup
 - lime wedges (for serving), quantity: 0.0 wedges
Steps:
 1. Combine chicken, 2 tablespoons of the cornstarch, 1 tablespoon of the teriyaki sauce, 1 teaspoon of the garlic, and red pepper in a bowl; toss well.
 2. Heat a large nonstick skillet over Medium-High heat. Add oil to pan; swirl to coat. Add the chicken mixture to the pan; cook for 6 minutes, browning on all sides. Stir in 1/2 cup of the broth, scraping the pan to loosen any browned bits. Transfer the chicken mixture to the inner pot of a 6-quart Instant Pot.
 3. Combine the remaining 1 1/2 cups broth, peanut butter, remaining 1 tablespoon cornstarch, remaining 1 tablespoon teriyaki sauce, and remaining 1 teaspoon garlic in a bowl; pour over the chicken mixture.
 4. Close and lock the lid of the Instant Pot. Turn the steam release handle to 'Venting' position. Press Slow Cook, and use Adjust to select More mode. Press [-] or [+] to choose 1 hour 30 minutes cook time.
 5. When the time is up, open the lid and stir in the peas and carrots. Repeat the Slow Cook procedure, this time choosing 30 minutes as the cook time. When the time is up, the peas should be crisp-tender.
 6. While the peas and carrots cook, cook the pasta according to package directions, omitting salt and fat; drain. Add the cooked spaghetti to the chicken mixture in the Instant Pot; toss well.
 7. Sprinkle with scallions and peanuts; serve with lime wedges, if desired. Enjoy!
Preparation time: 11 minutes
Cooking time: 120 minutes
Servings: 6

Structured output is important because it allows us to very easily use the output of an OpenAI API call as input to another system - whether that be a database, another API, or our own custom program.

Next, let’s look at providing the AI model access to custom tools that allow it to request information or perform actions that it would not otherwise be capable of.

Tool use

In this context, tools are Python functions that we describe to the LLM as part of a query. If the model decides that it needs to use one of the described tools in order to fulfill the query, it will send back a tool request. We are then responsible for executing the specified tool with the provided parameters and sending the result back through the API so it can continue processing the query using the output of the tool.

Tools either provide information to the model or let it perform specific actions. Let’s walk through an example to see how it works. Let’s say we have a Python function that can be used to search through a private contact database. We can implement it with a list of fake contacts for this example:

import json
from openai import OpenAI

my_contacts = """[
{ "first_name": "Alice", "last_name": "Nguyen", "email_address": "alice.nguyen@example.com", "mobile_number": "+1-202-555-0147" },
{ "first_name": "Brian", "last_name": "Henderson", "email_address": "brian.henderson@example.com", "mobile_number": "+1-303-555-0198" },
{ "first_name": "Carla", "last_name": "Mendez", "email_address": "carla.mendez@example.com", "mobile_number": "+1-415-555-0132" },
{ "first_name": "David", "last_name": "Sharma", "email_address": "david.sharma@example.com", "mobile_number": "+1-646-555-0174" },
{ "first_name": "Elena", "last_name": "Petrov", "email_address": "elena.petrov@example.com", "mobile_number": "+1-718-555-0129" },
{ "first_name": "Felix", "last_name": "Johnson", "email_address": "felix.johnson@example.com", "mobile_number": "+1-210-555-0163" },
{ "first_name": "Grace", "last_name": "Martinez", "email_address": "grace.martinez@example.com", "mobile_number": "+1-512-555-0186" },
{ "first_name": "Hiroshi", "last_name": "Tanaka", "email_address": "hiroshi.tanaka@example.com", "mobile_number": "+1-917-555-0155" },
{ "first_name": "Isabella", "last_name": "Moretti", "email_address": "isabella.moretti@example.com", "mobile_number": "+1-305-555-0118" },
{ "first_name": "Jonas", "last_name": "Keller", "email_address": "jonas.keller@example.com", "mobile_number": "+1-408-555-0192" }
]"""

contacts = json.loads(my_contacts)

def search_contacts(first_name: str | None, last_name: str | None) -> str:
    matches = []
    for contact in contacts:
        if first_name and first_name.lower() not in contact["first_name"].lower():
            continue
        if last_name and last_name.lower() not in contact["last_name"].lower():
            continue
        matches.append(contact)
    results = ""
    for match in matches:
        results += f'{match["first_name"]} {match["last_name"]}, email: {match["email_address"]}, mobile: {match["mobile_number"]}\n'
    return results

Next we need to describe the tool using OpenAI’s function definition syntax:

tools = [
    {
        "type": "function",
        "name": "search_contacts",
        "description": "Search for contacts by first name and/or last name.",
        "parameters": {
            "type": "object",
            "properties": {
                "first_name": {
                    "type": "string",
                    "description": "The first name of the contact to search for."
                },
                "last_name": {
                    "type": "string",
                    "description": "The last name of the contact to search for."
                }
            },
        },
    },
]

Now we can submit a query that includes the tool definition, which effectively tells the LLM that our search_contacts function is available for it to use.

client = OpenAI()

inputs = [
    {"role": "user", "content": "Find the contact information for Jonas."}
]

response = client.responses.create(
    model="gpt-5-mini",
    tools=tools,
    input=inputs,
)

Since the user of the API is responsible for performing the actual execution of the tool, we have to check if the response includes a function_call request. We do this by iterating over the items in the response.output. For each tool call request found, we need to execute our function with the parameters provided by the LLM and construct a special function_call_output dictionary to send back as part of a follow-up API call.

inputs += response.output

function_call_requests = filter(lambda x: x.type == "function_call", response.output)

for req in function_call_requests:
    if req.name == "search_contacts":
        print(f"Calling search_contacts with arguments: {req.arguments}")
        contact_info = search_contacts(**json.loads(req.arguments))

        inputs.append({
            "type": "function_call_output",
            "call_id": req.call_id,
            "output": contact_info
        })

Output:

Calling search_contacts with arguments: {"first_name":"Jonas","last_name":""}

Now that we’ve performed the requested tool call and appended the result to our list of inputs to the LLM, we need to send another query to the LLM with our tool’s output (now part of the inputs list) in order to get the final response:

response = client.responses.create(
    model="gpt-5-mini",
    instructions="Display only the requested contact information.",
    tools=tools,
    input=inputs,
)
print(response.output_text)

This results in the final output for our query:

Jonas Keller
Email: jonas.keller@example.com
Mobile: +1-408-555-0192

While this is a simple example, it demonstrates how one can provide new capabilities to LLMs using custom tools. Combined with the rich ecosystem of Python packages, there are endless ways of harnessing LLMs to do useful tasks. Here are just a few examples:

  • Some home security cameras allow programmatic access to a still image. You could periodically retrieve an image and send it to the LLM for analysis. The results could then be sent via email or SMS.
  • Tools like macOS Shortcuts can execute Python scripts in response to events, such as downloading a file. A script could, using the OpenAI API and a custom tool, automatically organize the new files in your Downloads folder based on criteria you specify in a prompt.
  • A script could retrieve the day’s weather forecast from a weather provider API, retrieve your calendar appointments for the day using a custom tool, and then notify you if the weather might disrupt your plans.

Other concepts

Once you become familiar with the OpenAI API, structured output, and custom tools, I’d suggest you look at the OpenAI Agents SDK. Agents package these concepts and abstract some details. For example, the Agents SDK automatically handles function calling when requested by the LLM. I used the Agents SDK to implement my genomics rare disease assistant, using custom tools to provide the agents with the resources needed to search a database of genetic variants.

I hope this post proves useful for those wanting to get started building AI-enabled tools and apps. I always recommend that people start by experimenting - oftentimes, you won’t know exactly how well an AI model will perform at specific tasks until you try!