Pydantic-AI: Image Processing with Multi-Model Support
Pydantic-AI is revolutionizing how developers interact with Large Language Models (LLMs). By bringing type safety, structured outputs, and seamless LLM integration, this library makes LLM-powered applications more robust and user-friendly. Whether you’re building agents, setting up system prompts, or processing streaming responses, Pydantic-AI streamlines it all.
In this article, we’ll dive into a practical implementation: extracting structured information from resume images using Pydantic-AI and OpenAI’s model. Let’s explore how this powerful combination bridges the gap between unstructured visual data and reliable outputs and how to add or process with pydantic-ai using multi-modal.
Turning Resumes into Structured Insights
Imagine automating the extraction of LinkedIn profiles, GitHub links, emails, work experiences, and more from a pile of resume images. That’s precisely what our project does, thanks to Pydantic-AI’s type-safe, structured output capabilities.
Here’s how we made it happen.
Breaking Down the Implementation
Our solution revolves around two key components: data structure definition and image processing logic.
1. Defining the Data Structure
We begin by defining what information we want to extract using a Pydantic model. This ensures that every output is not only structured but also validated.
class Summary(BaseModel):
linkedin_profile: str
github_profile: str
email: str
work_experience: str
education: str
skills: str
This Summary
class acts as a blueprint for the data we need. Each field represents a critical piece of information from the resumes.
2. Building the Image Processor
Next, we handle the heavy lifting with an ImageSummarizer
class.
Initializing the Agent
The first step is to set up an agent using Pydantic-AI.
def __init__(self, model_name: str = settings.LLM_MODEL, api_key: str = settings.OPENAI_API_KEY):
self._model = OpenAIModel(model_name, api_key=api_key)
self._agent = Agent(
model=self._model,
system_prompt="You are a helpful assistant that can summarize images",
result_type=Summary,
model_settings={"temperature": 0, "max_tokens": 10000}
)
Here’s what’s happening:
- Model Initialization: We configure OpenAI’s model.
- Agent Setup: The agent is built with the
Summary
class as its result type, ensuring outputs adhere to our predefined structure.
The Summarize Method
This is where the magic happens.
def summarize(self, image_urls: List[str], prompt: str) -> Summary:
image_params = [
ChatCompletionContentPartImageParam(
type='image_url',
image_url=ImageURL(url=url, detail='low')
) for url in image_urls
]
result = self._agent.run_sync([
ChatCompletionContentPartTextParam(text=prompt, type='text'),
*image_params
])
return result.data
In this method:
- Input Conversion: Resume image URLs are transformed into OpenAI-compatible parameters.
- Agent Execution: The agent processes the prompt and images together.
- Structured Output: Results are returned as validated
Summary
objects.
Enhancing Multi-Modal Support
While our implementation works seamlessly, Pydantic-AI is still evolving. For example, there’s an open issue to introduce multi-modal support. Once resolved, the library will make projects like ours even more streamlined.
Here’s a detailed guide on how to setup and run this project GitHub Repo Link
Have you tried Pydantic-AI? Let me know your experience in the comments! And don’t forget to check out my GitHub repo for the complete code.
What do you think? Let me know if you’d like further adjustments!