When you need to serve AI models behind an API, Python is the natural choice. The entire machine learning ecosystem lives in Python, from PyTorch and TensorFlow to HuggingFace Transformers and LangChain. But choosing the right web framework to wrap those models matters just as much as the models themselves. FastAPI has emerged as the clear winner for AI backend development, and for good reason. It is async-native, it generates OpenAPI documentation automatically, and its type hint system catches errors before they reach production. In this guide, we build a complete AI API backend with FastAPI that serves text generation, embeddings, and classification through clean REST endpoints.
Whether you are building an internal tool that needs AI capabilities or exposing AI services to external consumers, this article gives you the architecture and code to do it right.
Why FastAPI Is Ideal for AI Services
FastAPI was built for exactly the kind of workloads that AI services demand. Here is why it stands apart from alternatives like Flask or Django for this use case:
Async-first design. AI inference can take anywhere from milliseconds to minutes. With Flask, a long-running inference call blocks the entire worker thread. FastAPI is built on Starlette and supports native async/await, which means your server can handle hundreds of concurrent requests even when individual inferences take several seconds. This is critical for AI APIs where you often have a mix of fast and slow endpoints.
Automatic type validation with Pydantic. AI endpoints have complex input and output schemas. A text generation request might include the prompt, temperature, max tokens, stop sequences, and model selection. Pydantic models validate all of this automatically and generate clear error messages when clients send malformed requests:
from pydantic import BaseModel, Field
class GenerationRequest(BaseModel):
prompt: str = Field(..., min_length=1, max_length=10000)
model: str = Field(default="llama3.2")
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
max_tokens: int = Field(default=512, ge=1, le=4096)
stop_sequences: list[str] = Field(default_factory=list)
class GenerationResponse(BaseModel):
text: str
model: str
usage: dict[str, int]
latency_ms: float
Auto-generated OpenAPI docs. The moment you define your endpoints and Pydantic models, FastAPI generates interactive Swagger UI documentation at /docs. For AI APIs, this is invaluable. Your team, your frontend developers, and your API consumers can test endpoints directly in the browser without writing a single line of client code.
Performance. FastAPI consistently benchmarks as one of the fastest Python web frameworks, approaching Node.js and Go for I/O-bound workloads. While inference speed is bound by your model, the framework overhead is minimal.
Project Structure for AI Backends
A well-organized project structure keeps your AI backend maintainable as you add models and endpoints. Here is the layout we recommend:
ai-backend/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI app setup, middleware, startup
│ ├── config.py # Settings via pydantic-settings
│ ├── routers/
│ │ ├── __init__.py
│ │ ├── generation.py # Text generation endpoints
│ │ ├── embeddings.py # Embedding endpoints
│ │ ├── classification.py # Classification endpoints
│ │ └── health.py # Health check endpoints
│ ├── services/
│ │ ├── __init__.py
│ │ ├── ollama.py # Ollama integration
│ │ ├── huggingface.py # HuggingFace model loading
│ │ └── task_queue.py # Background task management
│ ├── models/
│ │ ├── __init__.py
│ │ └── schemas.py # Pydantic request/response models
│ └── middleware/
│ ├── __init__.py
│ └── rate_limit.py # Rate limiting for AI endpoints
├── tests/
│ ├── test_generation.py
│ ├── test_embeddings.py
│ └── conftest.py
├── Dockerfile
├── requirements.txt
└── docker-compose.yml
The main application file ties everything together:
# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.config import settings
from app.routers import generation, embeddings, classification, health
from app.services.ollama import OllamaService
from app.services.huggingface import HuggingFaceService
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: load models into memory
app.state.ollama = OllamaService(base_url=settings.ollama_url)
app.state.hf_service = HuggingFaceService()
await app.state.hf_service.load_models()
print("Models loaded and ready to serve")
yield
# Shutdown: clean up resources
await app.state.hf_service.unload_models()
app = FastAPI(
title="AI Backend API",
description="Production AI inference API powered by FastAPI",
version="1.0.0",
lifespan=lifespan,
)
app.add_middleware(
CORSMiddleware,
allow_origins=settings.allowed_origins,
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(health.router, tags=["Health"])
app.include_router(generation.router, prefix="/api/v1", tags=["Generation"])
app.include_router(embeddings.router, prefix="/api/v1", tags=["Embeddings"])
app.include_router(classification.router, prefix="/api/v1", tags=["Classification"])
The lifespan context manager is critical for AI services. Loading a model from disk into GPU memory can take 10 to 30 seconds. You want this to happen once at startup, not on every request.
Creating AI Endpoints: Generation, Embeddings, and Classification
Now let us build the core endpoints that serve AI models. Each endpoint follows the same pattern: validate input, run inference, return structured output.
Text Generation with Ollama
Ollama makes it straightforward to run open-source language models locally. Here is a generation endpoint that streams responses for a better user experience:
# app/routers/generation.py
import time
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
from app.models.schemas import GenerationRequest, GenerationResponse
router = APIRouter()
@router.post("/generate", response_model=GenerationResponse)
async def generate_text(req: GenerationRequest, request: Request):
"""Generate text using a local language model via Ollama."""
start = time.perf_counter()
ollama: OllamaService = request.app.state.ollama
result = await ollama.generate(
model=req.model,
prompt=req.prompt,
temperature=req.temperature,
max_tokens=req.max_tokens,
stop=req.stop_sequences,
)
latency = (time.perf_counter() - start) * 1000
return GenerationResponse(
text=result["response"],
model=req.model,
usage={
"prompt_tokens": result.get("prompt_eval_count", 0),
"completion_tokens": result.get("eval_count", 0),
},
latency_ms=round(latency, 2),
)
@router.post("/generate/stream")
async def generate_text_stream(req: GenerationRequest, request: Request):
"""Stream generated text token by token."""
ollama: OllamaService = request.app.state.ollama
async def token_stream():
async for chunk in ollama.generate_stream(
model=req.model,
prompt=req.prompt,
temperature=req.temperature,
max_tokens=req.max_tokens,
):
yield f"data: {chunk}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
token_stream(),
media_type="text/event-stream",
)
The Ollama service wraps the HTTP calls to the local Ollama server:
# app/services/ollama.py
import httpx
from typing import AsyncIterator
class OllamaService:
def __init__(self, base_url: str = "http://localhost:11434"):
self.base_url = base_url
self.client = httpx.AsyncClient(base_url=base_url, timeout=120.0)
async def generate(
self, model: str, prompt: str, temperature: float = 0.7,
max_tokens: int = 512, stop: list[str] | None = None,
) -> dict:
response = await self.client.post("/api/generate", json={
"model": model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": temperature,
"num_predict": max_tokens,
"stop": stop or [],
},
})
response.raise_for_status()
return response.json()
async def generate_stream(
self, model: str, prompt: str, temperature: float = 0.7,
max_tokens: int = 512,
) -> AsyncIterator[str]:
async with self.client.stream(
"POST", "/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": True,
"options": {
"temperature": temperature,
"num_predict": max_tokens,
},
},
) as response:
async for line in response.aiter_lines():
import json
data = json.loads(line)
if not data.get("done"):
yield data["response"]
Embeddings with HuggingFace
Embeddings power semantic search, retrieval-augmented generation, and recommendation systems. Here is an endpoint that generates embeddings using a locally loaded HuggingFace model:
# app/routers/embeddings.py
from fastapi import APIRouter, Request
from pydantic import BaseModel, Field
router = APIRouter()
class EmbeddingRequest(BaseModel):
texts: list[str] = Field(..., min_length=1, max_length=100)
model: str = Field(default="all-MiniLM-L6-v2")
class EmbeddingResponse(BaseModel):
embeddings: list[list[float]]
model: str
dimensions: int
@router.post("/embeddings", response_model=EmbeddingResponse)
async def create_embeddings(req: EmbeddingRequest, request: Request):
"""Generate vector embeddings for a list of texts."""
hf: HuggingFaceService = request.app.state.hf_service
vectors = await hf.embed(texts=req.texts, model_name=req.model)
return EmbeddingResponse(
embeddings=vectors,
model=req.model,
dimensions=len(vectors[0]),
)
The HuggingFace service loads the sentence transformer model at startup and runs inference in a thread pool to avoid blocking the event loop:
# app/services/huggingface.py
import asyncio
from functools import partial
from sentence_transformers import SentenceTransformer
class HuggingFaceService:
def __init__(self):
self.models: dict[str, SentenceTransformer] = {}
async def load_models(self):
"""Pre-load commonly used models at startup."""
loop = asyncio.get_event_loop()
self.models["all-MiniLM-L6-v2"] = await loop.run_in_executor(
None, SentenceTransformer, "all-MiniLM-L6-v2"
)
async def embed(
self, texts: list[str], model_name: str = "all-MiniLM-L6-v2"
) -> list[list[float]]:
model = self.models.get(model_name)
if not model:
raise ValueError(f"Model {model_name} is not loaded")
loop = asyncio.get_event_loop()
embeddings = await loop.run_in_executor(
None, partial(model.encode, texts, normalize_embeddings=True)
)
return embeddings.tolist()
async def unload_models(self):
self.models.clear()
Running synchronous model inference inside run_in_executor is the key pattern here. Sentence transformer encoding is CPU-bound and would block the async event loop if called directly. By offloading it to a thread pool, you keep the server responsive to other requests.
Text Classification
Classification endpoints let you categorize text into predefined labels. This is useful for sentiment analysis, content moderation, intent detection, and ticket routing:
# app/routers/classification.py
from fastapi import APIRouter, Request
from pydantic import BaseModel, Field
router = APIRouter()
class ClassificationRequest(BaseModel):
text: str = Field(..., min_length=1)
labels: list[str] = Field(..., min_length=2)
multi_label: bool = Field(default=False)
class ClassificationResult(BaseModel):
label: str
score: float
class ClassificationResponse(BaseModel):
results: list[ClassificationResult]
@router.post("/classify", response_model=ClassificationResponse)
async def classify_text(req: ClassificationRequest, request: Request):
"""Classify text into one or more of the provided labels."""
hf: HuggingFaceService = request.app.state.hf_service
scores = await hf.zero_shot_classify(
text=req.text,
labels=req.labels,
multi_label=req.multi_label,
)
return ClassificationResponse(
results=[
ClassificationResult(label=label, score=round(score, 4))
for label, score in scores
]
)
Zero-shot classification is particularly powerful because it does not require fine-tuning. You pass in arbitrary labels at inference time, and the model ranks them by relevance to the input text.
Async Processing for Long-Running AI Tasks
Some AI workloads take minutes, not milliseconds. Document summarization, batch embedding generation, and fine-tuning jobs should not block your API. Instead, use a background task pattern with status polling:
# app/services/task_queue.py
import asyncio
import uuid
from enum import Enum
from datetime import datetime
class TaskStatus(str, Enum):
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
class TaskManager:
def __init__(self):
self.tasks: dict[str, dict] = {}
def create_task(self, task_type: str) -> str:
task_id = str(uuid.uuid4())
self.tasks[task_id] = {
"id": task_id,
"type": task_type,
"status": TaskStatus.PENDING,
"created_at": datetime.utcnow().isoformat(),
"result": None,
"error": None,
}
return task_id
async def run_task(self, task_id: str, coroutine):
self.tasks[task_id]["status"] = TaskStatus.RUNNING
try:
result = await coroutine
self.tasks[task_id]["status"] = TaskStatus.COMPLETED
self.tasks[task_id]["result"] = result
except Exception as e:
self.tasks[task_id]["status"] = TaskStatus.FAILED
self.tasks[task_id]["error"] = str(e)
def get_task(self, task_id: str) -> dict | None:
return self.tasks.get(task_id)
Then create endpoints that accept a job, return a task ID immediately, and let clients poll for the result:
# In your router
from fastapi import BackgroundTasks
@router.post("/batch/embeddings")
async def create_batch_embeddings(
req: BatchEmbeddingRequest,
request: Request,
background_tasks: BackgroundTasks,
):
"""Submit a batch embedding job. Returns a task ID for polling."""
task_mgr: TaskManager = request.app.state.task_manager
task_id = task_mgr.create_task("batch_embeddings")
async def process():
hf = request.app.state.hf_service
results = await hf.embed(req.texts, req.model)
return {"count": len(results), "dimensions": len(results[0])}
background_tasks.add_task(task_mgr.run_task, task_id, process())
return {"task_id": task_id, "status": "pending",
"poll_url": f"/api/v1/tasks/{task_id}"}
@router.get("/tasks/{task_id}")
async def get_task_status(task_id: str, request: Request):
"""Check the status of a background task."""
task_mgr: TaskManager = request.app.state.task_manager
task = task_mgr.get_task(task_id)
if not task:
raise HTTPException(status_code=404, detail="Task not found")
return task
For production systems with multiple workers, replace the in-memory task manager with Celery backed by Redis, or use a proper job queue like Dramatiq or ARQ. The API contract (submit job, get task ID, poll for result) stays the same regardless of the backend.
Deployment Tips for AI Services
Deploying AI backends differs from typical web services because of GPU requirements and model sizes. Here are practical recommendations:
Dockerize everything. A multi-stage Dockerfile keeps your image lean while including all the ML dependencies:
FROM python:3.12-slim AS base
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ app/
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]
Use model caching. Mount a persistent volume for model weights so they do not re-download on every deployment. Set the HF_HOME environment variable to point to your mounted volume, and HuggingFace will cache models there automatically.
Scale with GPU awareness. If you are running on GPU instances, set --workers 1 for uvicorn and let the GPU handle parallelism through batching. Multiple workers competing for the same GPU will cause out-of-memory errors. For CPU-only inference, scale workers to match your CPU count.
Add health checks. Kubernetes and container orchestrators need to know when your service is ready to accept traffic. A health endpoint that verifies model availability prevents routing requests to pods that are still loading:
@router.get("/health")
async def health_check(request: Request):
ollama = request.app.state.ollama
models_loaded = bool(request.app.state.hf_service.models)
ollama_ok = await ollama.ping()
return {
"status": "healthy" if (models_loaded and ollama_ok) else "degraded",
"models_loaded": models_loaded,
"ollama_connected": ollama_ok,
}
Rate limiting. AI inference is expensive. Protect your endpoints with rate limiting based on API keys or user tiers. The slowapi library integrates cleanly with FastAPI and supports per-endpoint limits.
Wrapping Up
Python and FastAPI give you the fastest path from a trained model to a production AI API. The async-first architecture handles the inherently variable latency of AI inference, Pydantic validates complex request schemas automatically, and the auto-generated docs make your API immediately accessible to consumers. Combined with local model serving through Ollama or HuggingFace, you can build powerful AI backends without depending on expensive third-party APIs.
The patterns in this guide, structured project layout, streaming endpoints, thread-pool offloading for CPU-bound inference, and async task queues for long-running jobs, scale from a weekend prototype to a production service handling thousands of requests per day.
If you are looking to integrate AI capabilities into your products or need help building a robust AI API backend, our team at Maranatha Technologies has deep experience in AI solutions that deliver real business value. Whether you need a custom inference pipeline, a retrieval-augmented generation system, or a full AI platform, we would enjoy helping you bring it to life. Reach out to us and let us talk about what you are building.