Pydantic-AI: Image Processing with Multi-Model Support

2 min readDec 21, 2024

Pydantic-AI is revolutionizing how developers interact with Large Language Models (LLMs). By bringing type safety, structured outputs, and seamless LLM integration, this library makes LLM-powered applications more robust and user-friendly. Whether you’re building agents, setting up system prompts, or processing streaming responses, Pydantic-AI streamlines it all.

In this article, we’ll dive into a practical implementation: extracting structured information from resume images using Pydantic-AI and OpenAI’s model. Let’s explore how this powerful combination bridges the gap between unstructured visual data and reliable outputs and how to add or process with pydantic-ai using multi-modal.

Turning Resumes into Structured Insights

Imagine automating the extraction of LinkedIn profiles, GitHub links, emails, work experiences, and more from a pile of resume images. That’s precisely what our project does, thanks to Pydantic-AI’s type-safe, structured output capabilities.

Here’s how we made it happen.

Breaking Down the Implementation

Our solution revolves around two key components: data structure definition and image processing logic.

1. Defining the Data Structure

We begin by defining what information we want to extract using a Pydantic model. This ensures that every output is not only structured but also validated.

class Summary(BaseModel):  
    linkedin_profile: str  
    github_profile: str  
    email: str  
    work_experience: str  
    education: str  
    skills: str

This Summary class acts as a blueprint for the data we need. Each field represents a critical piece of information from the resumes.

2. Building the Image Processor

Next, we handle the heavy lifting with an ImageSummarizer class.

Initializing the Agent

The first step is to set up an agent using Pydantic-AI.

def __init__(self, model_name: str = settings.LLM_MODEL, api_key: str = settings.OPENAI_API_KEY):  
    self._model = OpenAIModel(model_name, api_key=api_key)  
    self._agent = Agent(  
        model=self._model,  
        system_prompt="You are a helpful assistant that can summarize images",  
        result_type=Summary,  
        model_settings={"temperature": 0, "max_tokens": 10000}  
    )

Here’s what’s happening:

Model Initialization: We configure OpenAI’s model.
Agent Setup: The agent is built with the Summary class as its result type, ensuring outputs adhere to our predefined structure.

The Summarize Method

This is where the magic happens.

def summarize(self, image_urls: List[str], prompt: str) -> Summary:  
    image_params = [  
        ChatCompletionContentPartImageParam(  
            type='image_url',  
            image_url=ImageURL(url=url, detail='low')  
        ) for url in image_urls  
    ]  
    result = self._agent.run_sync([  
        ChatCompletionContentPartTextParam(text=prompt, type='text'),  
        *image_params  
    ])  
    return result.data

In this method:

Input Conversion: Resume image URLs are transformed into OpenAI-compatible parameters.
Agent Execution: The agent processes the prompt and images together.
Structured Output: Results are returned as validated Summary objects.

Enhancing Multi-Modal Support

While our implementation works seamlessly, Pydantic-AI is still evolving. For example, there’s an open issue to introduce multi-modal support. Once resolved, the library will make projects like ours even more streamlined.

Here’s a detailed guide on how to setup and run this project GitHub Repo Link

Have you tried Pydantic-AI? Let me know your experience in the comments! And don’t forget to check out my GitHub repo for the complete code.

What do you think? Let me know if you’d like further adjustments!