Inference

Inference is how you use a trained AI model - you send it an input, and it returns a prediction or response. On E2E AI Cloud, inference lets you deploy models as live API endpoints that your applications can call.

You can serve models using popular frameworks like vLLM or SGLang, or bring your own container. Models can be sourced from Hugging Face or your own repository. Endpoints are OpenAI-compatible, so they work with tools you already use.

E2E AI Cloud handles the infrastructure - including automatic scaling and scale-to-zero for serverless deployments - so you can focus on your model, not the servers.

Model RepositoryModel EndpointsInference Tutorials

Quick Start

📦

Model Repository

Store and manage model artifacts in a centralized repository.

↗

⚡

Model Endpoints

Deploy models as scalable inference APIs for production usage.

↗

📚

Tutorials

Follow guided tutorials for common inference workflows.

↗

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Quick Start​

Quick Start