Resource Sizing Guidance
On this page
Purpose: This page provides prescriptive starting guidance for sizing Ververica Platform Deployments. All numeric values below are starting points only and must be validated with workload-specific benchmark or load test results. Stateful job sizing depends on event rate, key cardinality, operator mix, state growth, checkpoint behavior, sink latency, and recovery objectives. Use this guidance together with the Load testing item in the Production Readiness Checklist.
This guide is intentionally written in Ververica Platform terms: Deployments, Deployment Templates, parallelism, taskmanager.numberOfTaskSlots, and Deployment resources for jobmanager and taskmanager.
Version scope: the sizing principles and workload signals apply to VVP 2.15.x and VVP 3.0–3.1.x; the application mechanics vary by version, especially default delivery, Kubernetes Operator CR shape, and Autopilot behavior.
Deployment sizing uses spec.template.spec.resources.jobmanager and spec.template.spec.resources.taskmanager; slots are set in spec.template.spec.flinkConfiguration as taskmanager.numberOfTaskSlots.
Do not hard-code a full Kubernetes Operator CR wrapper path unless it has been verified for the deployed version: VVP2 Operator documentation uses a spec.deployment wrapper with userMetadata, while VVP3 Operator documentation maps the VVP deployment body under spec.deployment with metadata; earlier or beta CRDs may differ.
The exact Deployment-spec field for JobManager replica count and the exact namespace-level limit-factor field name could not be verified from the available official VVP version documentation, so keep those names version-checked in the target environment before using them as YAML/API fields.
Do not treat these values as universal defaults. Start here, run a realistic load test, then adjust from observed signals such as OOM kills, spill, checkpoint duration, backpressure, GC, failover time, and CPU throttling.
Deployment Templates
Use Deployment Templates to standardize sizing by namespace or workload class. The goal is not to force one universal shape, but to make the default shape explicit and repeatable.
- Put the most common starting profile into the template: JobManager availability setting,
spec.template.spec.resources.jobmanager,spec.template.spec.resources.taskmanager, parallelism, andtaskmanager.numberOfTaskSlotsunderspec.template.spec.flinkConfiguration. - For VVP 2.15.x, standardize namespace defaults through the namespace
deployment-defaultsAPI where it is available, complemented by Helm global defaults for platform-wide baselines. - For VVP 3.0–3.1.x, do not rely on the namespace
deployment-defaultsAPI as a delivery mechanism. Use Helm global defaults plus versioned Kubernetes Operator CR templates managed through the normal GitOps workflow. - Keep defaults conservative enough to start safely, but not so large that every new Deployment begins overprovisioned.
- Standardize by namespace when namespaces map to similar workload types, SLOs, or quota envelopes.
- Keep exceptions explicit. A Deployment that needs materially higher memory, unusual slot density, or tighter CPU limits should document the measured signal that justified the deviation.
Recommended operating model: define 2–3 approved sizing profiles per namespace, then require a lightweight exception review for anything outside those profiles. In VVP2 this usually means namespace defaults plus Helm globals; in VVP3 this usually means Helm globals plus reviewed CR templates in Git.
Signals that template defaults need revision: repeated per-Deployment overrides, recurring scheduling failures, namespace quota saturation, consistent throttling under peak load, or measured evidence that most jobs in the namespace need the same adjustment.
JobManager
Size the JobManager through spec.template.spec.resources.jobmanager. Set JobManager availability through the JobManager replica setting exposed by the deployed VVP version; the exact YAML/API field name for that replica count was not verified in the available official version documentation and should not be guessed.
Replicas and HA
For production Deployments using the Ververica Platform HA service, start with 2 JobManager replicas for HA. This is the normal starting point, not a universal rule.
- Confirm with: leader failover time, restart duration, and whether recovery meets the expected RTO.
- Increase only if measured need exists: unusually large metadata coordination burden, operational pattern requiring more resilience testing, or version/platform-specific guidance for the environment.
CPU and memory starting points
JobManager demand is driven less by raw record throughput and more by job graph complexity, checkpoint coordination, metadata volume, and failover or restore operations.
Adjustment signals: increase memory if the JobManager shows heap pressure, long GC pauses, or unstable restore behavior. Increase CPU if failover, submission, or checkpoint coordination slows under load. Reduce only after steady-state and failure scenarios show stable headroom.
TaskManager
TaskManager sizing is the main capacity decision for most stateful jobs. In Ververica Platform, size it through spec.template.spec.resources.taskmanager.
Memory model as exposed by VVP
Review TaskManager memory as a split across the Flink memory areas surfaced through VVP. For state backends, read RocksDB guidance as applicable to VVP2 and VVP3; Gemini applies only where available in VVP3.
- Heap for framework and task heap, JVM objects, and user-code allocations.
- Managed memory for state backends and memory-managed operators, including RocksDB or Gemini working areas.
- Network memory for shuffle and network buffers.
- Off-heap / JVM overhead for native memory, metaspace, and process overhead outside ordinary Java heap.
For stateful jobs, container memory alone is not a safe sizing proxy. A larger state footprint can increase backend working set, checkpoint I/O pressure, restore time, and native-memory demand even when Java heap appears healthy.
For the managed-memory share, set taskmanager.memory.managed.fraction in the Deployment spec.template.spec.flinkConfiguration. Increase this fraction for large-state RocksDB/Gemini jobs before increasing slot density, especially when backend memory pressure, spill growth, or compaction stalls appear while Java heap still has headroom.
CPU and memory starting points
Start with enough TaskManager headroom to survive realistic peaks, checkpointing, and restore operations.
Adjustment logic: add memory when you see OOM kills, backend pressure, spill, or checkpoint instability. Add CPU when busy time is high, backpressure persists, or recovery becomes too slow. Scale out with more TaskManagers when a single TaskManager becomes too dense or when larger parallelism is needed for throughput and recovery.
Parallelism and TaskManagers
In VVP, effective execution capacity is shaped by three related inputs:
- parallelism at the Deployment level
taskmanager.numberOfTaskSlotsinspec.template.spec.flinkConfiguration- the resulting number of TaskManagers needed to host that work
A useful planning approximation is:
TaskManagers needed ≈ parallelism ÷ slots per TaskManager, rounded up.
Prefer simple slot layouts unless a measured reason justifies denser packing.
- Start with 1 slot per TaskManager for many stateful or operationally sensitive jobs.
- Use 2 slots per TaskManager when the workload is well understood and benchmarked to behave well with denser packing.
- Avoid high slot density by default. It can make noisy-neighbor effects, checkpoint contention, memory debugging, and recovery behavior harder to reason about.
Practical rule: if you are still learning the workload, simplify first: fewer slots per TaskManager, clearer memory ownership, easier failure isolation. Increase slot density only when benchmarks show that it improves efficiency without hurting checkpoints, latency, or recovery.
Adjustment signals: sustained backpressure on a subset of subtasks, long checkpoint duration, uneven CPU usage, hotspot TaskManagers, or slower restore after failures. If scaling parallelism helps throughput but makes each TaskManager unstable, reduce slot density and spread the job across more TaskManagers.
VVP3 Autopilot note: when Autopilot 2.0 is active, it can adjust parallelism and memory, so the values in this page should be treated as the initial starting point and guardrail baseline. In VVP2, expect more of this tuning to remain manual.
Requests vs limits
Use the Ververica Platform resource model deliberately. VVP derives Kubernetes requests and limits from Deployment resources and configured limit factors; however, the exact field name and scope for limit factors in the deployed VVP version could not be verified from the available official version documentation, so do not copy an unverified YAML/API path into templates. The Deployment resource value is not just documentation; it directly affects scheduling reservation, burst behavior, and throttling risk. If pod templates are used for Kubernetes-level customization, note the version-specific shape: VVP3 exposes separate JobManager and TaskManager pod templates (kubernetes.jobManagerPodTemplate / kubernetes.taskManagerPodTemplate), while VVP2 uses the shared kubernetes.pods.* block.
- Requests influence what the scheduler reserves.
- Limits constrain burst capacity and can trigger CPU throttling or memory OOM kills.
- Limit factors should be explicit at namespace default level so teams understand how much burst they really have.
Be careful with tight CPU limits on latency-sensitive streaming jobs. CPU throttling often looks like application backpressure even when the job logic is otherwise healthy.
Adjustment signals: increase CPU request when the steady state is consistently above the reserved level; increase CPU limit or limit factor when throttling appears during load or recovery; increase memory when pods are OOMKilled or when native/backend pressure appears before Java heap looks full.
Reference workload profiles
These profiles are reference starting ranges for VVP Deployments. They are not production guarantees. Validate every profile by benchmark and load test before standardizing it in a Deployment Template.
Validation reminder: every profile above must be confirmed with benchmark or load test results against realistic event rate, key cardinality, state growth, sink behavior, and recovery scenarios. If a workload is stateful, do not sign off sizing without explicitly checking the Load testing item in the Production Readiness Checklist.
Suggested exception workflow for non-standard sizing
Require the Deployment owner to record: the chosen profile, the exact deviation from template defaults, the benchmark evidence supporting the change, the expected improvement, and the review point after production observation. Good exception candidates include higher TaskManager memory for state growth, lower slot density for operational isolation, or higher CPU limit factors to avoid throttling during checkpointing and recovery.