Skip to main content

FAQs

FAQs to help you resolve common issues and understand Model Endpoints behavior.

Creation and Deployment

Q: Why do I get "framework is mandatory" or "allowed values are..."?

A: You must specify a framework when creating an endpoint (e.g. vLLM, SGLang, Triton, Custom, Riva, Finetuned). The framework must be one of the supported values for your region.

Q: Why does validation fail with "failed to validate the bucket path" or "No files found at the given path"?

A: The model path (bucket + prefix, or Hugging Face ID) is incorrect, inaccessible, or empty. Ensure the path exists, your account has read access, and (for Hugging Face) your token has Read scope and you have accepted the license for gated models.

Q: Why does my endpoint stay in "Deploying" for a long time?

A: Large models can take 10–15+ minutes to download and load. If it stays in Deploying longer:

  • Check Logs and Deployment events for errors (e.g. model download failure, insufficient quota).
  • For Hugging Face: verify token, model ID, and license acceptance for gated models.
  • Confirm the SKU/plan has enough quota in your region.

Q: Why do I get "SKU Inventory temporarily unavailable. Please try again"?

A: The selected GPU/compute type (SKU) has no available capacity in your region at that moment.

Q: Do I need an access token before creating an endpoint?

A: Yes. Access tokens (e.g. API key, Hugging Face token for model download) must be created and configured before creating an inference endpoint. Without a valid token:

  • Endpoint creation may fail (e.g. when pulling from Hugging Face or a private registry).
  • API authentication to invoke the endpoint will not work.

Create the required tokens in the appropriate service (e.g. Hugging Face, container registry) and enter them when prompted during endpoint creation.

Custom Models and Integrations

Q: Can I bring my own custom model?

A: Yes, you can bring your own custom model, provided:

  • It is compatible with a supported runtime framework (e.g. vLLM, Triton, Custom).
  • Required dependencies are included (e.g. in your container image or model package).
  • Model artifacts follow the expected repository structure for the framework you use.

Use Bring your own container or attach a Model Repository (object storage) with the correct layout. Validate the model before creating the endpoint.

Q: Can inference endpoints integrate with external systems?

A: Yes. Inference endpoints can integrate with:

  • External APIs
  • Databases
  • Object storage
  • Monitoring systems

Outbound access depends on your VPC and security group configurations. Define outbound rules to allow traffic to the services you need. Use Reserved VPC IP and private networking when possible for secure communication.

Committed (Reserved) Endpoints

Q: Why can't I stop my endpoint? I get "Committed Model Endpoints cannot be Stopped".

A: Committed (reserved) endpoints cannot be stopped during the commitment period. You pay for the reserved capacity for the full term. Use Hourly billing if you need to stop the endpoint when idle.

Q: Can I change what happens when my committed plan ends?

A: Yes. Use update reserve instance updation policy (or equivalent in the UI) to set:

  • Convert to hourly — At the end of the term, the endpoint switches to hourly billing (you must specify the hourly plan).
  • Auto renew — The endpoint renews for another committed term (you must specify the next committed plan).
  • Auto terminate — The endpoint is terminated at the end of the term.

Q: Why can't I scale below my committed replica count?

A: You cannot scale below the number of committed replicas. The error message will indicate the minimum. Scale down only to the committed replica count, or change your committed plan if you need fewer replicas.

Scaling and Autoscaling

Q: Why do I get "Scaling replicas (min and max) to zero is not supported. You can stop the inference instead."?

A: You cannot set both min and max replicas to 0 via scaling. Stop the endpoint when you want zero cost. For Hourly endpoints, stopping avoids charges.

Q: Why do I get "Cannot downscale below the committed plan's worker count"?

A: Your committed replicas set a floor. You cannot scale below that. Keep replicas at or above the committed count, or change your committed plan.

Q: How long does scaling down take (e.g. from 4 replicas to 1)?

A: Scaling down typically takes a few seconds to a few minutes, depending on:

  • Model size — Larger models may take longer to drain.
  • Active request load — Replicas finish in-flight requests before terminating.
  • Graceful shutdown configuration — Proper readiness/liveness probes and drain settings affect how quickly replicas are removed.

Ongoing requests are handled gracefully before termination when configured correctly.

Async Invocation

Q: Why does my async request never complete or I don't see a result file?

A: Check the following:

  • Confirm async is enabled for the endpoint and the target (dataset) is set and writable.
  • Ensure the route you called is in the async routes list and accepts POST (and OPTIONS).
  • Use the async status API with the returned request_id to check progress or errors.
  • For dataset destination: the result is written to api/request/<request_id>.json relative to the dataset root. Verify your app has read access to that dataset.

Q: How do I set up async inference (AsyncQ)?

A: For long-running requests, async inference is recommended. Setup includes:

  1. Enable async mode during endpoint creation (or via update_async_configuration).
  2. Configure the target — Set the dataset (EOS) where results will be stored.
  3. Configure routes — List the API routes that should use async (e.g. QUEUE_1/v1/chat/completions/).
  4. Callback or polling — Use the returned request_id to poll the async status API or read the result file from the dataset at api/request/<request_id>.json.
  5. Queue limits and retry — Define queue limits and retry policies in your application logic.

Monitor job status using the request_id returned from each async invocation.

Q: Is WebSocket access supported?

A: Yes, WebSocket is supported for models that require real-time bidirectional communication — for example, nemotron-speech-streaming-en-0.6b for live speech-to-text streaming.

Not all models support WebSocket — it depends on whether the model's serving framework exposes a WebSocket route. For setup details and code examples, see WebSocket API for Model Endpoints.