GPU deployment


  • yatai-deployment

Because GPU support is related to the BentotDeployment CRD, it relies on yatai-deployment

GPU Deployment with Kubernetes

Yatai allows you to deploy bentos on Nvidia GPUs on demand. You should make sure there is Nvidia GPU available in the cluster, see your cluster provider for more details, or if you are using Yatai in your own Cluster. Once you have ensured there is “” resource available in your cluster, Yatai is ready to serve GPU-based bentos.

Through the Web UI

Steps to deploy a GPU supported bento to Yatai: 1. select the “Deployments” tab of your Yatai Web UI, click “Create” button to create a new Deployment. 2. select the target bento 3. scroll down to “Runners”, select the runner you want to accelerate with GPU, and add a custom resources request with key and value 1 to request 1 GPU for each replica of this runner.

Note: Typically you don’t need to allocate GPUs to the bento service itself, since it can not be accelerated by GPUs. Instead, allocate GPU to the runner that will take care of the actual inference.

Through the CLI

Apply the following yaml for a BentoDeployment CR:

  kind: BentoDeployment
    name: my-bento-deployment
    namespace: my-namespace
    bento: iris-1
      enabled: true
    - name: foo
      value: bar
        cpu: 2000m
        memory: "1Gi"
        cpu: 1000m
        memory: "500Mi"
      maxReplicas: 5
      minReplicas: 1
    - name: runner1
          cpu: 2000m
          memory: "4Gi"
          cpu: 1000m
          memory: "2Gi"
        maxReplicas: 2
        minReplicas: 1

Fractional-GPU resource allocation

Sometimes you may want to allocate a fraction of a GPU to a runner, for example, you have a GPU with 8GB memory, and you want to allocate 4GB memory to a runner, and 4GB memory to another. Yatai is designed taking this into consideration. However, the cluster needs to be configured to support this feature first.

For managed Kubernetes solutions, you could seek help from your cluster provider to see if there is a solution. For example, in AWS EKS, see

For self-managed Kubernetes cluster, you could install an open source solution like

Once setup, you could allocate a fraction of a GPU to a runner by replacing the in resource request with the resource name provided in the solution.