Kubernetes v1.36: Resizing Pod Resources on Suspended Jobs (Beta Guide)

From Farkesli, the free encyclopedia of technology

Overview

Starting with Kubernetes v1.36, the ability to modify container resource requests and limits in the pod template of a suspended Job has been promoted to beta. First introduced as alpha in v1.35, this feature empowers queue controllers and cluster administrators to dynamically adjust CPU, memory, GPU, and extended resource specifications on a Job while it is suspended, before it starts or resumes running. This capability addresses a long-standing pain point for batch and machine learning workloads, where optimal resource allocation often depends on real-time cluster capacity, queue priorities, and hardware availability.

Kubernetes v1.36: Resizing Pod Resources on Suspended Jobs (Beta Guide)

Before this feature, resource requirements in a Job's pod template were immutable once set. If a queue controller like Kueue determined that a suspended Job should run with different resources, the only option was to delete and recreate the Job, losing metadata, status, and history. With this new beta feature, you can now adjust resource allocations without destroying the Job, enabling more resilient and efficient scheduling—for example, allowing a specific CronJob instance to progress slowly with reduced resources rather than failing outright under heavy cluster load.

Prerequisites

To take advantage of this feature, ensure your environment meets the following:

  • Kubernetes cluster running version v1.36 or later (the feature is beta, so it is enabled by default).
  • kubectl configured to communicate with the cluster (version v1.36+ recommended for full compatibility).
  • A suspended Job (spec.suspend: true) that you intend to adjust before resuming.
  • For queue controllers: update your controller to support mutable pod resources on suspended Jobs (e.g., Kueue v0.10+).

Step-by-Step Guide

1. Create a Suspended Job

Start by defining a Job with spec.suspend: true and initial resource requests that may need adjustment later. Below is an example of a machine learning training Job requesting 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
          limits:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
      restartPolicy: Never

Apply the Job:

kubectl apply -f training-job.yaml

2. Modify Resources While Suspended

Suppose your queue controller determines that only 2 GPUs are currently available. You can update the container resource requests and limits directly on the suspended Job. Use kubectl patch or edit the resource YAML. For example, to adjust CPU to 4, memory to 16Gi, and GPU to 2:

kubectl patch job training-job-example-abcd123 --type='merge' -p='{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "trainer",
          "resources": {
            "requests": {
              "cpu": "4",
              "memory": "16Gi",
              "example-hardware-vendor.com/gpu": "2"
            },
            "limits": {
              "cpu": "4",
              "memory": "16Gi",
              "example-hardware-vendor.com/gpu": "2"
            }
          }
        }]
      }
    }
  }
}'

The updated YAML would appear as:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
          limits:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
      restartPolicy: Never

3. Resume the Job

Once the resources are updated to match available capacity, resume the Job by setting spec.suspend to false:

kubectl patch job training-job-example-abcd123 --type='merge' -p='{"spec":{"suspend":false}}'

The Job controller will then create pods using the adjusted resource specifications. The previous modification is only allowed while the Job is suspended; after resumption, the pod template becomes immutable again (until the Job is suspended again, if ever).

4. Verify the Outcome

Check the Job's pods to confirm the new resource requests:

kubectl get pods -l job-name=training-job-example-abcd123 -o yaml | grep -A 10 resources

You should see the updated CPU, memory, and GPU values.

Common Mistakes

  • Editing after resuming the Job: Once a suspended Job is resumed (spec.suspend becomes false), the pod template resource fields become immutable again. Any attempt to modify them will result in an API error. Always update resources before setting suspend to false.
  • Forgetting to suspend first: The feature only applies to Jobs where spec.suspend is true. If you try to modify resource requests on a non-suspended Job, the API server will reject the change.
  • Changing non-resource fields: Mutability relaxations are limited to container resource requests/limits and extended resources. Other pod template fields (e.g., image, command, environment variables) remain immutable. Attempts to modify them will fail.
  • Using a controller that doesn't support the feature: If you are using an external queue controller like Kueue, ensure it has been updated to handle mutable pod resources. Older versions may not recognize the ability to adjust resources, leading to errors or incorrect behavior.
  • Forgetting to adjust limits to match requests when needed: In some cases, you may need to keep requests and limits consistent (e.g., for GPU resources). Leaving limits unchanged while reducing requests can cause pod scheduling failures.

Summary

The beta promotion of mutable pod resources for suspended Jobs in Kubernetes v1.36 brings significant flexibility to batch and ML workloads. By allowing dynamic adjustments to CPU, memory, and extended resources (such as GPUs) before a Job resumes, administrators and queue controllers can optimize resource utilization without the overhead of deleting and recreating Jobs. This feature seamlessly integrates with the existing Job API—no new objects or CRDs required—and works with any controller that respects the spec.suspend field. As the feature matures, expect to see broader adoption in batch scheduling frameworks and improved efficiency in shared cluster environments.