Resource Sizing Guidance

Applies toSelf-Managed v2

6 min read

On this page

Deployment Templates
JobManager
- Replicas and HA
- CPU and memory starting points
TaskManager
- Memory model as exposed by VVP
- CPU and memory starting points
Parallelism and TaskManagers
Requests vs limits
Reference workload profiles

Info

Purpose: This page provides prescriptive starting guidance for sizing Ververica Platform Deployments. All numeric values below are starting points only and must be validated with workload-specific benchmark or load test results. Stateful job sizing depends on event rate, key cardinality, operator mix, state growth, checkpoint behavior, sink latency, and recovery objectives. Use this guidance together with the Load testing item in the Production Readiness Checklist.

This guide is intentionally written in Ververica Platform terms: Deployments, Deployment Templates, parallelism, taskmanager.numberOfTaskSlots, and Deployment resources for jobmanager and taskmanager.

Version scope: the sizing principles and workload signals apply to VVP 2.15.x and VVP 3.0–3.1.x; the application mechanics vary by version, especially default delivery, Kubernetes Operator CR shape, and Autopilot behavior.

Deployment sizing uses spec.template.spec.resources.jobmanager and spec.template.spec.resources.taskmanager; slots are set in spec.template.spec.flinkConfiguration as taskmanager.numberOfTaskSlots.

Do not hard-code a full Kubernetes Operator CR wrapper path unless it has been verified for the deployed version: VVP2 Operator documentation uses a spec.deployment wrapper with userMetadata, while VVP3 Operator documentation maps the VVP deployment body under spec.deployment with metadata; earlier or beta CRDs may differ.

The exact Deployment-spec field for JobManager replica count and the exact namespace-level limit-factor field name could not be verified from the available official VVP version documentation, so keep those names version-checked in the target environment before using them as YAML/API fields.

Do not treat these values as universal defaults. Start here, run a realistic load test, then adjust from observed signals such as OOM kills, spill, checkpoint duration, backpressure, GC, failover time, and CPU throttling.

Deployment Templates

Use Deployment Templates to standardize sizing by namespace or workload class. The goal is not to force one universal shape, but to make the default shape explicit and repeatable.

Put the most common starting profile into the template: JobManager availability setting, spec.template.spec.resources.jobmanager, spec.template.spec.resources.taskmanager, parallelism, and taskmanager.numberOfTaskSlots under spec.template.spec.flinkConfiguration.
For VVP 2.15.x, standardize namespace defaults through the namespace deployment-defaults API where it is available, complemented by Helm global defaults for platform-wide baselines.
For VVP 3.0–3.1.x, do not rely on the namespace deployment-defaults API as a delivery mechanism. Use Helm global defaults plus versioned Kubernetes Operator CR templates managed through the normal GitOps workflow.
Keep defaults conservative enough to start safely, but not so large that every new Deployment begins overprovisioned.
Standardize by namespace when namespaces map to similar workload types, SLOs, or quota envelopes.
Keep exceptions explicit. A Deployment that needs materially higher memory, unusual slot density, or tighter CPU limits should document the measured signal that justified the deviation.

Info

Recommended operating model: define 2–3 approved sizing profiles per namespace, then require a lightweight exception review for anything outside those profiles. In VVP2 this usually means namespace defaults plus Helm globals; in VVP3 this usually means Helm globals plus reviewed CR templates in Git.

Signals that template defaults need revision: repeated per-Deployment overrides, recurring scheduling failures, namespace quota saturation, consistent throttling under peak load, or measured evidence that most jobs in the namespace need the same adjustment.

JobManager

Size the JobManager through spec.template.spec.resources.jobmanager. Set JobManager availability through the JobManager replica setting exposed by the deployed VVP version; the exact YAML/API field name for that replica count was not verified in the available official version documentation and should not be guessed.

Replicas and HA

For production Deployments using the Ververica Platform HA service, start with 2 JobManager replicas for HA. This is the normal starting point, not a universal rule.

Confirm with: leader failover time, restart duration, and whether recovery meets the expected RTO.
Increase only if measured need exists: unusually large metadata coordination burden, operational pattern requiring more resilience testing, or version/platform-specific guidance for the environment.

CPU and memory starting points

JobManager demand is driven less by raw record throughput and more by job graph complexity, checkpoint coordination, metadata volume, and failover or restore operations.

Job shape	Starting point	Why	Signals to confirm or adjust
Job shape	Starting point	Why	Signals to confirm or adjust
Simple graph, modest state, limited operator count	Start around 0.5–1 CPU and 1–2G memory	Usually enough for coordination without large heap pressure	Heap usage, GC pauses, slow submission, checkpoint coordination latency
Moderate graph complexity or larger checkpoint metadata	Start around 1 CPU and 2–4G memory	More room for coordination and recovery operations	Heap growth during checkpoints, failover duration, restore stability
Large graph, high operator count, frequent restore or failover exercises	Start around 1–2 CPU and 4G+ memory	Protects coordination path during metadata-heavy operations	GC, restore time, checkpoint coordinator pressure, restart loops

Adjustment signals: increase memory if the JobManager shows heap pressure, long GC pauses, or unstable restore behavior. Increase CPU if failover, submission, or checkpoint coordination slows under load. Reduce only after steady-state and failure scenarios show stable headroom.

TaskManager

TaskManager sizing is the main capacity decision for most stateful jobs. In Ververica Platform, size it through spec.template.spec.resources.taskmanager.

Memory model as exposed by VVP

Review TaskManager memory as a split across the Flink memory areas surfaced through VVP. For state backends, read RocksDB guidance as applicable to VVP2 and VVP3; Gemini applies only where available in VVP3.

Heap for framework and task heap, JVM objects, and user-code allocations.
Managed memory for state backends and memory-managed operators, including RocksDB or Gemini working areas.
Network memory for shuffle and network buffers.
Off-heap / JVM overhead for native memory, metaspace, and process overhead outside ordinary Java heap.

For stateful jobs, container memory alone is not a safe sizing proxy. A larger state footprint can increase backend working set, checkpoint I/O pressure, restore time, and native-memory demand even when Java heap appears healthy.

For the managed-memory share, set taskmanager.memory.managed.fraction in the Deployment spec.template.spec.flinkConfiguration. Increase this fraction for large-state RocksDB/Gemini jobs before increasing slot density, especially when backend memory pressure, spill growth, or compaction stalls appear while Java heap still has headroom.

TaskManager concern	What to size for	Common starting view	Signals to confirm or adjust
TaskManager concern	What to size for	Common starting view	Signals to confirm or adjust
Heap pressure	User objects, serialization churn, operator logic	Leave headroom for spikes and checkpoint activity	JVM OOM, high old-gen usage, long GC pauses
Managed memory	RocksDB/Gemini and managed operators	Increase for state-heavy jobs before adding slot density	Backend memory pressure, compaction stalls, spill growth
Network memory	Shuffle and data exchange buffers	Validate under peak rate and checkpoint alignment	Network buffer exhaustion, backpressure, alignment delay
Off-heap / overhead	Native memory and process overhead	Do not consume all container memory with heap-centric assumptions	Container OOM kill without obvious Java heap saturation

CPU and memory starting points

Start with enough TaskManager headroom to survive realistic peaks, checkpointing, and restore operations.

Workload tendency	Starting point per TaskManager	What to watch
Workload tendency	Starting point per TaskManager	What to watch
Small or light stateful streaming job	1 CPU, 2–4G memory	Backpressure, JVM OOM, CPU saturation, checkpoint duration
State-heavy job with RocksDB/Gemini and larger keyed state	1–2 CPU, 4–8G memory	Spill, compaction pressure, checkpoint growth, restore time, container OOM
High-throughput pipeline with heavier serialization or network demand	2–4 CPU, 4–8G memory	CPU busy time, throttling, backpressure, network buffer pressure, sink lag

Adjustment logic: add memory when you see OOM kills, backend pressure, spill, or checkpoint instability. Add CPU when busy time is high, backpressure persists, or recovery becomes too slow. Scale out with more TaskManagers when a single TaskManager becomes too dense or when larger parallelism is needed for throughput and recovery.

Parallelism and TaskManagers

In VVP, effective execution capacity is shaped by three related inputs:

parallelism at the Deployment level
taskmanager.numberOfTaskSlots in spec.template.spec.flinkConfiguration
the resulting number of TaskManagers needed to host that work

A useful planning approximation is:

TaskManagers needed ≈ parallelism ÷ slots per TaskManager, rounded up.

Prefer simple slot layouts unless a measured reason justifies denser packing.

Start with 1 slot per TaskManager for many stateful or operationally sensitive jobs.
Use 2 slots per TaskManager when the workload is well understood and benchmarked to behave well with denser packing.
Avoid high slot density by default. It can make noisy-neighbor effects, checkpoint contention, memory debugging, and recovery behavior harder to reason about.

Info

Practical rule: if you are still learning the workload, simplify first: fewer slots per TaskManager, clearer memory ownership, easier failure isolation. Increase slot density only when benchmarks show that it improves efficiency without hurting checkpoints, latency, or recovery.

Adjustment signals: sustained backpressure on a subset of subtasks, long checkpoint duration, uneven CPU usage, hotspot TaskManagers, or slower restore after failures. If scaling parallelism helps throughput but makes each TaskManager unstable, reduce slot density and spread the job across more TaskManagers.

VVP3 Autopilot note: when Autopilot 2.0 is active, it can adjust parallelism and memory, so the values in this page should be treated as the initial starting point and guardrail baseline. In VVP2, expect more of this tuning to remain manual.

Requests vs limits

Use the Ververica Platform resource model deliberately. VVP derives Kubernetes requests and limits from Deployment resources and configured limit factors; however, the exact field name and scope for limit factors in the deployed VVP version could not be verified from the available official version documentation, so do not copy an unverified YAML/API path into templates. The Deployment resource value is not just documentation; it directly affects scheduling reservation, burst behavior, and throttling risk. If pod templates are used for Kubernetes-level customization, note the version-specific shape: VVP3 exposes separate JobManager and TaskManager pod templates (kubernetes.jobManagerPodTemplate / kubernetes.taskManagerPodTemplate), while VVP2 uses the shared kubernetes.pods.* block.

Requests influence what the scheduler reserves.
Limits constrain burst capacity and can trigger CPU throttling or memory OOM kills.
Limit factors should be explicit at namespace default level so teams understand how much burst they really have.

Be careful with tight CPU limits on latency-sensitive streaming jobs. CPU throttling often looks like application backpressure even when the job logic is otherwise healthy.

Resource topic	Guidance	Signals to watch
Resource topic	Guidance	Signals to watch
CPU request too low	Scheduler may overpack nodes relative to steady-state need	Scheduling pressure, noisy-neighbor behavior, unstable latency
CPU limit too tight	Can throttle during peaks, checkpoints, or recovery	Container throttled time, source lag, backpressure, slow restart
Memory limit too tight	Leaves no room for native overhead or state backend bursts	OOMKilled pods, evictions, restore failures, checkpoint instability

Adjustment signals: increase CPU request when the steady state is consistently above the reserved level; increase CPU limit or limit factor when throttling appears during load or recovery; increase memory when pods are OOMKilled or when native/backend pressure appears before Java heap looks full.

Reference workload profiles

These profiles are reference starting ranges for VVP Deployments. They are not production guarantees. Validate every profile by benchmark and load test before standardizing it in a Deployment Template.

Profile	Typical use	JobManager starting range	TaskManager starting range	Slots per TaskManager	Parallelism starting range	Primary signals to validate
Profile	Typical use	JobManager starting range	TaskManager starting range	Slots per TaskManager	Parallelism starting range	Primary signals to validate
Small	Simple pipelines, modest state, limited operator fan-out	2 replicas, 0.5–1 CPU, 1–2G	1 CPU, 2–4G	1	1–4	Backpressure, JVM OOM, GC pauses, checkpoint duration, CPU throttling
Stateful-heavy	Large keyed state, RocksDB/Gemini-heavy workloads, stricter restore expectations	2 replicas, 1 CPU, 2–4G	1–2 CPU, 4–8G	1 preferred	4–16	Container OOM, backend pressure, spill, checkpoint duration, restore time, backpressure during checkpointing
High-throughput	CPU-heavy pipelines, high event rate, heavier serialization or network traffic	2 replicas, 1–2 CPU, 2–4G	2–4 CPU, 4–8G	1–2	8–32	CPU busy time, throttling, source lag, sustained backpressure, network buffer pressure, checkpoint slowness

Note

Validation reminder: every profile above must be confirmed with benchmark or load test results against realistic event rate, key cardinality, state growth, sink behavior, and recovery scenarios. If a workload is stateful, do not sign off sizing without explicitly checking the Load testing item in the Production Readiness Checklist.

Suggested exception workflow for non-standard sizing

Require the Deployment owner to record: the chosen profile, the exact deviation from template defaults, the benchmark evidence supporting the change, the expected improvement, and the review point after production observation. Good exception candidates include higher TaskManager memory for state growth, lower slot density for operational isolation, or higher CPU limit factors to avoid throttling during checkpointing and recovery.

Was this helpful?

Yes No