Building AI Backends with Python and FastAPI: A Practical Guide

When you need to serve AI models behind an API, Python is the natural choice. The entire machine learning ecosystem lives in Python, from PyTorch and TensorFlow to HuggingFace Transformers and LangChain. But choosing the right web framework to wrap those models matters just as much as the models themselves. FastAPI has emerged as the clear winner for AI backend development, and for good reason. It is async-native, it generates OpenAPI documentation automatically, and its type hint system catches errors before they reach production. In this guide, we build a complete AI API backend with FastAPI that serves text generation, embeddings, and classification through clean REST endpoints.

Whether you are building an internal tool that needs AI capabilities or exposing AI services to external consumers, this article gives you the architecture and code to do it right.

Why FastAPI Is Ideal for AI Services

FastAPI was built for exactly the kind of workloads that AI services demand. Here is why it stands apart from alternatives like Flask or Django for this use case:

Async-first design. AI inference can take anywhere from milliseconds to minutes. With Flask, a long-running inference call blocks the entire worker thread. FastAPI is built on Starlette and supports native async/await, which means your server can handle hundreds of concurrent requests even when individual inferences take several seconds. This is critical for AI APIs where you often have a mix of fast and slow endpoints.

Automatic type validation with Pydantic. AI endpoints have complex input and output schemas. A text generation request might include the prompt, temperature, max tokens, stop sequences, and model selection. Pydantic models validate all of this automatically and generate clear error messages when clients send malformed requests:

from pydantic import BaseModel, Field

class GenerationRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=10000)
    model: str = Field(default="llama3.2")
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    max_tokens: int = Field(default=512, ge=1, le=4096)
    stop_sequences: list[str] = Field(default_factory=list)

class GenerationResponse(BaseModel):
    text: str
    model: str
    usage: dict[str, int]
    latency_ms: float

Auto-generated OpenAPI docs. The moment you define your endpoints and Pydantic models, FastAPI generates interactive Swagger UI documentation at /docs. For AI APIs, this is invaluable. Your team, your frontend developers, and your API consumers can test endpoints directly in the browser without writing a single line of client code.

Performance. FastAPI consistently benchmarks as one of the fastest Python web frameworks, approaching Node.js and Go for I/O-bound workloads. While inference speed is bound by your model, the framework overhead is minimal.

Project Structure for AI Backends

A well-organized project structure keeps your AI backend maintainable as you add models and endpoints. Here is the layout we recommend:

ai-backend/
├── app/
│   ├── __init__.py
│   ├── main.py              # FastAPI app setup, middleware, startup
│   ├── config.py            # Settings via pydantic-settings
│   ├── routers/
│   │   ├── __init__.py
│   │   ├── generation.py    # Text generation endpoints
│   │   ├── embeddings.py    # Embedding endpoints
│   │   ├── classification.py # Classification endpoints
│   │   └── health.py        # Health check endpoints
│   ├── services/
│   │   ├── __init__.py
│   │   ├── ollama.py        # Ollama integration
│   │   ├── huggingface.py   # HuggingFace model loading
│   │   └── task_queue.py    # Background task management
│   ├── models/
│   │   ├── __init__.py
│   │   └── schemas.py       # Pydantic request/response models
│   └── middleware/
│       ├── __init__.py
│       └── rate_limit.py    # Rate limiting for AI endpoints
├── tests/
│   ├── test_generation.py
│   ├── test_embeddings.py
│   └── conftest.py
├── Dockerfile
├── requirements.txt
└── docker-compose.yml

The main application file ties everything together:

# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.config import settings
from app.routers import generation, embeddings, classification, health
from app.services.ollama import OllamaService
from app.services.huggingface import HuggingFaceService

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: load models into memory
    app.state.ollama = OllamaService(base_url=settings.ollama_url)
    app.state.hf_service = HuggingFaceService()
    await app.state.hf_service.load_models()
    print("Models loaded and ready to serve")
    yield
    # Shutdown: clean up resources
    await app.state.hf_service.unload_models()

app = FastAPI(
    title="AI Backend API",
    description="Production AI inference API powered by FastAPI",
    version="1.0.0",
    lifespan=lifespan,
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.allowed_origins,
    allow_methods=["*"],
    allow_headers=["*"],
)

app.include_router(health.router, tags=["Health"])
app.include_router(generation.router, prefix="/api/v1", tags=["Generation"])
app.include_router(embeddings.router, prefix="/api/v1", tags=["Embeddings"])
app.include_router(classification.router, prefix="/api/v1", tags=["Classification"])

The lifespan context manager is critical for AI services. Loading a model from disk into GPU memory can take 10 to 30 seconds. You want this to happen once at startup, not on every request.

Creating AI Endpoints: Generation, Embeddings, and Classification

Now let us build the core endpoints that serve AI models. Each endpoint follows the same pattern: validate input, run inference, return structured output.

Text Generation with Ollama

Ollama makes it straightforward to run open-source language models locally. Here is a generation endpoint that streams responses for a better user experience:

# app/routers/generation.py
import time
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
from app.models.schemas import GenerationRequest, GenerationResponse

router = APIRouter()

@router.post("/generate", response_model=GenerationResponse)
async def generate_text(req: GenerationRequest, request: Request):
    """Generate text using a local language model via Ollama."""
    start = time.perf_counter()
    ollama: OllamaService = request.app.state.ollama

    result = await ollama.generate(
        model=req.model,
        prompt=req.prompt,
        temperature=req.temperature,
        max_tokens=req.max_tokens,
        stop=req.stop_sequences,
    )

    latency = (time.perf_counter() - start) * 1000

    return GenerationResponse(
        text=result["response"],
        model=req.model,
        usage={
            "prompt_tokens": result.get("prompt_eval_count", 0),
            "completion_tokens": result.get("eval_count", 0),
        },
        latency_ms=round(latency, 2),
    )

@router.post("/generate/stream")
async def generate_text_stream(req: GenerationRequest, request: Request):
    """Stream generated text token by token."""
    ollama: OllamaService = request.app.state.ollama

    async def token_stream():
        async for chunk in ollama.generate_stream(
            model=req.model,
            prompt=req.prompt,
            temperature=req.temperature,
            max_tokens=req.max_tokens,
        ):
            yield f"data: {chunk}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        token_stream(),
        media_type="text/event-stream",
    )

The Ollama service wraps the HTTP calls to the local Ollama server:

# app/services/ollama.py
import httpx
from typing import AsyncIterator

class OllamaService:
    def __init__(self, base_url: str = "http://localhost:11434"):
        self.base_url = base_url
        self.client = httpx.AsyncClient(base_url=base_url, timeout=120.0)

    async def generate(
        self, model: str, prompt: str, temperature: float = 0.7,
        max_tokens: int = 512, stop: list[str] | None = None,
    ) -> dict:
        response = await self.client.post("/api/generate", json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": temperature,
                "num_predict": max_tokens,
                "stop": stop or [],
            },
        })
        response.raise_for_status()
        return response.json()

    async def generate_stream(
        self, model: str, prompt: str, temperature: float = 0.7,
        max_tokens: int = 512,
    ) -> AsyncIterator[str]:
        async with self.client.stream(
            "POST", "/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": True,
                "options": {
                    "temperature": temperature,
                    "num_predict": max_tokens,
                },
            },
        ) as response:
            async for line in response.aiter_lines():
                import json
                data = json.loads(line)
                if not data.get("done"):
                    yield data["response"]

Embeddings with HuggingFace

Embeddings power semantic search, retrieval-augmented generation, and recommendation systems. Here is an endpoint that generates embeddings using a locally loaded HuggingFace model:

# app/routers/embeddings.py
from fastapi import APIRouter, Request
from pydantic import BaseModel, Field

router = APIRouter()

class EmbeddingRequest(BaseModel):
    texts: list[str] = Field(..., min_length=1, max_length=100)
    model: str = Field(default="all-MiniLM-L6-v2")

class EmbeddingResponse(BaseModel):
    embeddings: list[list[float]]
    model: str
    dimensions: int

@router.post("/embeddings", response_model=EmbeddingResponse)
async def create_embeddings(req: EmbeddingRequest, request: Request):
    """Generate vector embeddings for a list of texts."""
    hf: HuggingFaceService = request.app.state.hf_service

    vectors = await hf.embed(texts=req.texts, model_name=req.model)

    return EmbeddingResponse(
        embeddings=vectors,
        model=req.model,
        dimensions=len(vectors[0]),
    )

The HuggingFace service loads the sentence transformer model at startup and runs inference in a thread pool to avoid blocking the event loop:

# app/services/huggingface.py
import asyncio
from functools import partial
from sentence_transformers import SentenceTransformer

class HuggingFaceService:
    def __init__(self):
        self.models: dict[str, SentenceTransformer] = {}

    async def load_models(self):
        """Pre-load commonly used models at startup."""
        loop = asyncio.get_event_loop()
        self.models["all-MiniLM-L6-v2"] = await loop.run_in_executor(
            None, SentenceTransformer, "all-MiniLM-L6-v2"
        )

    async def embed(
        self, texts: list[str], model_name: str = "all-MiniLM-L6-v2"
    ) -> list[list[float]]:
        model = self.models.get(model_name)
        if not model:
            raise ValueError(f"Model {model_name} is not loaded")

        loop = asyncio.get_event_loop()
        embeddings = await loop.run_in_executor(
            None, partial(model.encode, texts, normalize_embeddings=True)
        )
        return embeddings.tolist()

    async def unload_models(self):
        self.models.clear()

Running synchronous model inference inside run_in_executor is the key pattern here. Sentence transformer encoding is CPU-bound and would block the async event loop if called directly. By offloading it to a thread pool, you keep the server responsive to other requests.

Text Classification

Classification endpoints let you categorize text into predefined labels. This is useful for sentiment analysis, content moderation, intent detection, and ticket routing:

# app/routers/classification.py
from fastapi import APIRouter, Request
from pydantic import BaseModel, Field

router = APIRouter()

class ClassificationRequest(BaseModel):
    text: str = Field(..., min_length=1)
    labels: list[str] = Field(..., min_length=2)
    multi_label: bool = Field(default=False)

class ClassificationResult(BaseModel):
    label: str
    score: float

class ClassificationResponse(BaseModel):
    results: list[ClassificationResult]

@router.post("/classify", response_model=ClassificationResponse)
async def classify_text(req: ClassificationRequest, request: Request):
    """Classify text into one or more of the provided labels."""
    hf: HuggingFaceService = request.app.state.hf_service

    scores = await hf.zero_shot_classify(
        text=req.text,
        labels=req.labels,
        multi_label=req.multi_label,
    )

    return ClassificationResponse(
        results=[
            ClassificationResult(label=label, score=round(score, 4))
            for label, score in scores
        ]
    )

Zero-shot classification is particularly powerful because it does not require fine-tuning. You pass in arbitrary labels at inference time, and the model ranks them by relevance to the input text.

Async Processing for Long-Running AI Tasks

Some AI workloads take minutes, not milliseconds. Document summarization, batch embedding generation, and fine-tuning jobs should not block your API. Instead, use a background task pattern with status polling:

# app/services/task_queue.py
import asyncio
import uuid
from enum import Enum
from datetime import datetime

class TaskStatus(str, Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"

class TaskManager:
    def __init__(self):
        self.tasks: dict[str, dict] = {}

    def create_task(self, task_type: str) -> str:
        task_id = str(uuid.uuid4())
        self.tasks[task_id] = {
            "id": task_id,
            "type": task_type,
            "status": TaskStatus.PENDING,
            "created_at": datetime.utcnow().isoformat(),
            "result": None,
            "error": None,
        }
        return task_id

    async def run_task(self, task_id: str, coroutine):
        self.tasks[task_id]["status"] = TaskStatus.RUNNING
        try:
            result = await coroutine
            self.tasks[task_id]["status"] = TaskStatus.COMPLETED
            self.tasks[task_id]["result"] = result
        except Exception as e:
            self.tasks[task_id]["status"] = TaskStatus.FAILED
            self.tasks[task_id]["error"] = str(e)

    def get_task(self, task_id: str) -> dict | None:
        return self.tasks.get(task_id)

Then create endpoints that accept a job, return a task ID immediately, and let clients poll for the result:

# In your router
from fastapi import BackgroundTasks

@router.post("/batch/embeddings")
async def create_batch_embeddings(
    req: BatchEmbeddingRequest,
    request: Request,
    background_tasks: BackgroundTasks,
):
    """Submit a batch embedding job. Returns a task ID for polling."""
    task_mgr: TaskManager = request.app.state.task_manager
    task_id = task_mgr.create_task("batch_embeddings")

    async def process():
        hf = request.app.state.hf_service
        results = await hf.embed(req.texts, req.model)
        return {"count": len(results), "dimensions": len(results[0])}

    background_tasks.add_task(task_mgr.run_task, task_id, process())

    return {"task_id": task_id, "status": "pending",
            "poll_url": f"/api/v1/tasks/{task_id}"}

@router.get("/tasks/{task_id}")
async def get_task_status(task_id: str, request: Request):
    """Check the status of a background task."""
    task_mgr: TaskManager = request.app.state.task_manager
    task = task_mgr.get_task(task_id)
    if not task:
        raise HTTPException(status_code=404, detail="Task not found")
    return task

For production systems with multiple workers, replace the in-memory task manager with Celery backed by Redis, or use a proper job queue like Dramatiq or ARQ. The API contract (submit job, get task ID, poll for result) stays the same regardless of the backend.

Deployment Tips for AI Services

Deploying AI backends differs from typical web services because of GPU requirements and model sizes. Here are practical recommendations:

Dockerize everything. A multi-stage Dockerfile keeps your image lean while including all the ML dependencies:

FROM python:3.12-slim AS base
WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app/ app/

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

Use model caching. Mount a persistent volume for model weights so they do not re-download on every deployment. Set the HF_HOME environment variable to point to your mounted volume, and HuggingFace will cache models there automatically.

Scale with GPU awareness. If you are running on GPU instances, set --workers 1 for uvicorn and let the GPU handle parallelism through batching. Multiple workers competing for the same GPU will cause out-of-memory errors. For CPU-only inference, scale workers to match your CPU count.

Add health checks. Kubernetes and container orchestrators need to know when your service is ready to accept traffic. A health endpoint that verifies model availability prevents routing requests to pods that are still loading:

@router.get("/health")
async def health_check(request: Request):
    ollama = request.app.state.ollama
    models_loaded = bool(request.app.state.hf_service.models)
    ollama_ok = await ollama.ping()

    return {
        "status": "healthy" if (models_loaded and ollama_ok) else "degraded",
        "models_loaded": models_loaded,
        "ollama_connected": ollama_ok,
    }

Rate limiting. AI inference is expensive. Protect your endpoints with rate limiting based on API keys or user tiers. The slowapi library integrates cleanly with FastAPI and supports per-endpoint limits.

Wrapping Up

Python and FastAPI give you the fastest path from a trained model to a production AI API. The async-first architecture handles the inherently variable latency of AI inference, Pydantic validates complex request schemas automatically, and the auto-generated docs make your API immediately accessible to consumers. Combined with local model serving through Ollama or HuggingFace, you can build powerful AI backends without depending on expensive third-party APIs.

The patterns in this guide, structured project layout, streaming endpoints, thread-pool offloading for CPU-bound inference, and async task queues for long-running jobs, scale from a weekend prototype to a production service handling thousands of requests per day.

If you are looking to integrate AI capabilities into your products or need help building a robust AI API backend, our team at Maranatha Technologies has deep experience in AI solutions that deliver real business value. Whether you need a custom inference pipeline, a retrieval-augmented generation system, or a full AI platform, we would enjoy helping you bring it to life. Reach out to us and let us talk about what you are building.

Whether you are building an internal tool that needs AI capabilities or exposing AI services to external consumers, this article gives you the architecture and code to do it right.

Why FastAPI Is Ideal for AI Services

FastAPI was built for exactly the kind of workloads that AI services demand. Here is why it stands apart from alternatives like Flask or Django for this use case:

from pydantic import BaseModel, Field

class GenerationRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=10000)
    model: str = Field(default="llama3.2")
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    max_tokens: int = Field(default=512, ge=1, le=4096)
    stop_sequences: list[str] = Field(default_factory=list)

class GenerationResponse(BaseModel):
    text: str
    model: str
    usage: dict[str, int]
    latency_ms: float

Project Structure for AI Backends

A well-organized project structure keeps your AI backend maintainable as you add models and endpoints. Here is the layout we recommend:

ai-backend/
├── app/
│   ├── __init__.py
│   ├── main.py              # FastAPI app setup, middleware, startup
│   ├── config.py            # Settings via pydantic-settings
│   ├── routers/
│   │   ├── __init__.py
│   │   ├── generation.py    # Text generation endpoints
│   │   ├── embeddings.py    # Embedding endpoints
│   │   ├── classification.py # Classification endpoints
│   │   └── health.py        # Health check endpoints
│   ├── services/
│   │   ├── __init__.py
│   │   ├── ollama.py        # Ollama integration
│   │   ├── huggingface.py   # HuggingFace model loading
│   │   └── task_queue.py    # Background task management
│   ├── models/
│   │   ├── __init__.py
│   │   └── schemas.py       # Pydantic request/response models
│   └── middleware/
│       ├── __init__.py
│       └── rate_limit.py    # Rate limiting for AI endpoints
├── tests/
│   ├── test_generation.py
│   ├── test_embeddings.py
│   └── conftest.py
├── Dockerfile
├── requirements.txt
└── docker-compose.yml

The main application file ties everything together:

# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.config import settings
from app.routers import generation, embeddings, classification, health
from app.services.ollama import OllamaService
from app.services.huggingface import HuggingFaceService

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: load models into memory
    app.state.ollama = OllamaService(base_url=settings.ollama_url)
    app.state.hf_service = HuggingFaceService()
    await app.state.hf_service.load_models()
    print("Models loaded and ready to serve")
    yield
    # Shutdown: clean up resources
    await app.state.hf_service.unload_models()

app = FastAPI(
    title="AI Backend API",
    description="Production AI inference API powered by FastAPI",
    version="1.0.0",
    lifespan=lifespan,
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.allowed_origins,
    allow_methods=["*"],
    allow_headers=["*"],
)

app.include_router(health.router, tags=["Health"])
app.include_router(generation.router, prefix="/api/v1", tags=["Generation"])
app.include_router(embeddings.router, prefix="/api/v1", tags=["Embeddings"])
app.include_router(classification.router, prefix="/api/v1", tags=["Classification"])

The lifespan context manager is critical for AI services. Loading a model from disk into GPU memory can take 10 to 30 seconds. You want this to happen once at startup, not on every request.

Creating AI Endpoints: Generation, Embeddings, and Classification

Now let us build the core endpoints that serve AI models. Each endpoint follows the same pattern: validate input, run inference, return structured output.

Text Generation with Ollama

Ollama makes it straightforward to run open-source language models locally. Here is a generation endpoint that streams responses for a better user experience:

# app/routers/generation.py
import time
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
from app.models.schemas import GenerationRequest, GenerationResponse

router = APIRouter()

@router.post("/generate", response_model=GenerationResponse)
async def generate_text(req: GenerationRequest, request: Request):
    """Generate text using a local language model via Ollama."""
    start = time.perf_counter()
    ollama: OllamaService = request.app.state.ollama

    result = await ollama.generate(
        model=req.model,
        prompt=req.prompt,
        temperature=req.temperature,
        max_tokens=req.max_tokens,
        stop=req.stop_sequences,
    )

    latency = (time.perf_counter() - start) * 1000

    return GenerationResponse(
        text=result["response"],
        model=req.model,
        usage={
            "prompt_tokens": result.get("prompt_eval_count", 0),
            "completion_tokens": result.get("eval_count", 0),
        },
        latency_ms=round(latency, 2),
    )

@router.post("/generate/stream")
async def generate_text_stream(req: GenerationRequest, request: Request):
    """Stream generated text token by token."""
    ollama: OllamaService = request.app.state.ollama

    async def token_stream():
        async for chunk in ollama.generate_stream(
            model=req.model,
            prompt=req.prompt,
            temperature=req.temperature,
            max_tokens=req.max_tokens,
        ):
            yield f"data: {chunk}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        token_stream(),
        media_type="text/event-stream",
    )

The Ollama service wraps the HTTP calls to the local Ollama server:

# app/services/ollama.py
import httpx
from typing import AsyncIterator

class OllamaService:
    def __init__(self, base_url: str = "http://localhost:11434"):
        self.base_url = base_url
        self.client = httpx.AsyncClient(base_url=base_url, timeout=120.0)

    async def generate(
        self, model: str, prompt: str, temperature: float = 0.7,
        max_tokens: int = 512, stop: list[str] | None = None,
    ) -> dict:
        response = await self.client.post("/api/generate", json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": temperature,
                "num_predict": max_tokens,
                "stop": stop or [],
            },
        })
        response.raise_for_status()
        return response.json()

    async def generate_stream(
        self, model: str, prompt: str, temperature: float = 0.7,
        max_tokens: int = 512,
    ) -> AsyncIterator[str]:
        async with self.client.stream(
            "POST", "/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": True,
                "options": {
                    "temperature": temperature,
                    "num_predict": max_tokens,
                },
            },
        ) as response:
            async for line in response.aiter_lines():
                import json
                data = json.loads(line)
                if not data.get("done"):
                    yield data["response"]

Embeddings with HuggingFace

Embeddings power semantic search, retrieval-augmented generation, and recommendation systems. Here is an endpoint that generates embeddings using a locally loaded HuggingFace model:

# app/routers/embeddings.py
from fastapi import APIRouter, Request
from pydantic import BaseModel, Field

router = APIRouter()

class EmbeddingRequest(BaseModel):
    texts: list[str] = Field(..., min_length=1, max_length=100)
    model: str = Field(default="all-MiniLM-L6-v2")

class EmbeddingResponse(BaseModel):
    embeddings: list[list[float]]
    model: str
    dimensions: int

@router.post("/embeddings", response_model=EmbeddingResponse)
async def create_embeddings(req: EmbeddingRequest, request: Request):
    """Generate vector embeddings for a list of texts."""
    hf: HuggingFaceService = request.app.state.hf_service

    vectors = await hf.embed(texts=req.texts, model_name=req.model)

    return EmbeddingResponse(
        embeddings=vectors,
        model=req.model,
        dimensions=len(vectors[0]),
    )

The HuggingFace service loads the sentence transformer model at startup and runs inference in a thread pool to avoid blocking the event loop:

# app/services/huggingface.py
import asyncio
from functools import partial
from sentence_transformers import SentenceTransformer

class HuggingFaceService:
    def __init__(self):
        self.models: dict[str, SentenceTransformer] = {}

    async def load_models(self):
        """Pre-load commonly used models at startup."""
        loop = asyncio.get_event_loop()
        self.models["all-MiniLM-L6-v2"] = await loop.run_in_executor(
            None, SentenceTransformer, "all-MiniLM-L6-v2"
        )

    async def embed(
        self, texts: list[str], model_name: str = "all-MiniLM-L6-v2"
    ) -> list[list[float]]:
        model = self.models.get(model_name)
        if not model:
            raise ValueError(f"Model {model_name} is not loaded")

        loop = asyncio.get_event_loop()
        embeddings = await loop.run_in_executor(
            None, partial(model.encode, texts, normalize_embeddings=True)
        )
        return embeddings.tolist()

    async def unload_models(self):
        self.models.clear()

Text Classification

Classification endpoints let you categorize text into predefined labels. This is useful for sentiment analysis, content moderation, intent detection, and ticket routing:

# app/routers/classification.py
from fastapi import APIRouter, Request
from pydantic import BaseModel, Field

router = APIRouter()

class ClassificationRequest(BaseModel):
    text: str = Field(..., min_length=1)
    labels: list[str] = Field(..., min_length=2)
    multi_label: bool = Field(default=False)

class ClassificationResult(BaseModel):
    label: str
    score: float

class ClassificationResponse(BaseModel):
    results: list[ClassificationResult]

@router.post("/classify", response_model=ClassificationResponse)
async def classify_text(req: ClassificationRequest, request: Request):
    """Classify text into one or more of the provided labels."""
    hf: HuggingFaceService = request.app.state.hf_service

    scores = await hf.zero_shot_classify(
        text=req.text,
        labels=req.labels,
        multi_label=req.multi_label,
    )

    return ClassificationResponse(
        results=[
            ClassificationResult(label=label, score=round(score, 4))
            for label, score in scores
        ]
    )

Zero-shot classification is particularly powerful because it does not require fine-tuning. You pass in arbitrary labels at inference time, and the model ranks them by relevance to the input text.

Async Processing for Long-Running AI Tasks

# app/services/task_queue.py
import asyncio
import uuid
from enum import Enum
from datetime import datetime

class TaskStatus(str, Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"

class TaskManager:
    def __init__(self):
        self.tasks: dict[str, dict] = {}

    def create_task(self, task_type: str) -> str:
        task_id = str(uuid.uuid4())
        self.tasks[task_id] = {
            "id": task_id,
            "type": task_type,
            "status": TaskStatus.PENDING,
            "created_at": datetime.utcnow().isoformat(),
            "result": None,
            "error": None,
        }
        return task_id

    async def run_task(self, task_id: str, coroutine):
        self.tasks[task_id]["status"] = TaskStatus.RUNNING
        try:
            result = await coroutine
            self.tasks[task_id]["status"] = TaskStatus.COMPLETED
            self.tasks[task_id]["result"] = result
        except Exception as e:
            self.tasks[task_id]["status"] = TaskStatus.FAILED
            self.tasks[task_id]["error"] = str(e)

    def get_task(self, task_id: str) -> dict | None:
        return self.tasks.get(task_id)

Then create endpoints that accept a job, return a task ID immediately, and let clients poll for the result:

# In your router
from fastapi import BackgroundTasks

@router.post("/batch/embeddings")
async def create_batch_embeddings(
    req: BatchEmbeddingRequest,
    request: Request,
    background_tasks: BackgroundTasks,
):
    """Submit a batch embedding job. Returns a task ID for polling."""
    task_mgr: TaskManager = request.app.state.task_manager
    task_id = task_mgr.create_task("batch_embeddings")

    async def process():
        hf = request.app.state.hf_service
        results = await hf.embed(req.texts, req.model)
        return {"count": len(results), "dimensions": len(results[0])}

    background_tasks.add_task(task_mgr.run_task, task_id, process())

    return {"task_id": task_id, "status": "pending",
            "poll_url": f"/api/v1/tasks/{task_id}"}

@router.get("/tasks/{task_id}")
async def get_task_status(task_id: str, request: Request):
    """Check the status of a background task."""
    task_mgr: TaskManager = request.app.state.task_manager
    task = task_mgr.get_task(task_id)
    if not task:
        raise HTTPException(status_code=404, detail="Task not found")
    return task

Deployment Tips for AI Services

Deploying AI backends differs from typical web services because of GPU requirements and model sizes. Here are practical recommendations:

Dockerize everything. A multi-stage Dockerfile keeps your image lean while including all the ML dependencies:

FROM python:3.12-slim AS base
WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app/ app/

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

@router.get("/health")
async def health_check(request: Request):
    ollama = request.app.state.ollama
    models_loaded = bool(request.app.state.hf_service.models)
    ollama_ok = await ollama.ping()

    return {
        "status": "healthy" if (models_loaded and ollama_ok) else "degraded",
        "models_loaded": models_loaded,
        "ollama_connected": ollama_ok,
    }

Building AI Backends with Python and FastAPI: A Practical Guide

Why FastAPI Is Ideal for AI Services

Project Structure for AI Backends

Creating AI Endpoints: Generation, Embeddings, and Classification

Text Generation with Ollama

Embeddings with HuggingFace

Text Classification

Async Processing for Long-Running AI Tasks

Deployment Tips for AI Services

Wrapping Up

Need Help With This Technology?

Loading...

Building AI Backends with Python and FastAPI: A Practical Guide

Why FastAPI Is Ideal for AI Services

Project Structure for AI Backends

Creating AI Endpoints: Generation, Embeddings, and Classification

Text Generation with Ollama

Embeddings with HuggingFace

Text Classification

Async Processing for Long-Running AI Tasks

Deployment Tips for AI Services

Wrapping Up

Need Help With This Technology?