Docs Home
Viewing docs for
Self-ManagedNot available for BYOC

Resource Sizing Guidance

On this page

This guide is intentionally written in Ververica Platform terms: Deployments, Deployment Templates, parallelism, taskmanager.numberOfTaskSlots, and Deployment resources for jobmanager and taskmanager.

Version scope: the sizing principles and workload signals apply to VVP 2.15.x and VVP 3.0–3.1.x; the application mechanics vary by version, especially default delivery, Kubernetes Operator CR shape, and Autopilot behavior.

Deployment sizing uses spec.template.spec.resources.jobmanager and spec.template.spec.resources.taskmanager; slots are set in spec.template.spec.flinkConfiguration as taskmanager.numberOfTaskSlots.

Do not hard-code a full Kubernetes Operator CR wrapper path unless it has been verified for the deployed version: VVP2 Operator documentation uses a spec.deployment wrapper with userMetadata, while VVP3 Operator documentation maps the VVP deployment body under spec.deployment with metadata; earlier or beta CRDs may differ.

The exact Deployment-spec field for JobManager replica count and the exact namespace-level limit-factor field name could not be verified from the available official VVP version documentation, so keep those names version-checked in the target environment before using them as YAML/API fields.

Do not treat these values as universal defaults. Start here, run a realistic load test, then adjust from observed signals such as OOM kills, spill, checkpoint duration, backpressure, GC, failover time, and CPU throttling.

Deployment Templates

Use Deployment Templates to standardize sizing by namespace or workload class. The goal is not to force one universal shape, but to make the default shape explicit and repeatable.

  • Put the most common starting profile into the template: JobManager availability setting, spec.template.spec.resources.jobmanager, spec.template.spec.resources.taskmanager, parallelism, and taskmanager.numberOfTaskSlots under spec.template.spec.flinkConfiguration.
  • For VVP 2.15.x, standardize namespace defaults through the namespace deployment-defaults API where it is available, complemented by Helm global defaults for platform-wide baselines.
  • For VVP 3.0–3.1.x, do not rely on the namespace deployment-defaults API as a delivery mechanism. Use Helm global defaults plus versioned Kubernetes Operator CR templates managed through the normal GitOps workflow.
  • Keep defaults conservative enough to start safely, but not so large that every new Deployment begins overprovisioned.
  • Standardize by namespace when namespaces map to similar workload types, SLOs, or quota envelopes.
  • Keep exceptions explicit. A Deployment that needs materially higher memory, unusual slot density, or tighter CPU limits should document the measured signal that justified the deviation.

Signals that template defaults need revision: repeated per-Deployment overrides, recurring scheduling failures, namespace quota saturation, consistent throttling under peak load, or measured evidence that most jobs in the namespace need the same adjustment.

JobManager

Size the JobManager through spec.template.spec.resources.jobmanager. Set JobManager availability through the JobManager replica setting exposed by the deployed VVP version; the exact YAML/API field name for that replica count was not verified in the available official version documentation and should not be guessed.

Replicas and HA

For production Deployments using the Ververica Platform HA service, start with 2 JobManager replicas for HA. This is the normal starting point, not a universal rule.

  • Confirm with: leader failover time, restart duration, and whether recovery meets the expected RTO.
  • Increase only if measured need exists: unusually large metadata coordination burden, operational pattern requiring more resilience testing, or version/platform-specific guidance for the environment.

CPU and memory starting points

JobManager demand is driven less by raw record throughput and more by job graph complexity, checkpoint coordination, metadata volume, and failover or restore operations.

Job shapeStarting pointWhySignals to confirm or adjust
Simple graph, modest state, limited operator countStart around 0.5–1 CPU and 1–2G memoryUsually enough for coordination without large heap pressureHeap usage, GC pauses, slow submission, checkpoint coordination latency
Moderate graph complexity or larger checkpoint metadataStart around 1 CPU and 2–4G memoryMore room for coordination and recovery operationsHeap growth during checkpoints, failover duration, restore stability
Large graph, high operator count, frequent restore or failover exercisesStart around 1–2 CPU and 4G+ memoryProtects coordination path during metadata-heavy operationsGC, restore time, checkpoint coordinator pressure, restart loops

Adjustment signals: increase memory if the JobManager shows heap pressure, long GC pauses, or unstable restore behavior. Increase CPU if failover, submission, or checkpoint coordination slows under load. Reduce only after steady-state and failure scenarios show stable headroom.

TaskManager

TaskManager sizing is the main capacity decision for most stateful jobs. In Ververica Platform, size it through spec.template.spec.resources.taskmanager.

Memory model as exposed by VVP

Review TaskManager memory as a split across the Flink memory areas surfaced through VVP. For state backends, read RocksDB guidance as applicable to VVP2 and VVP3; Gemini applies only where available in VVP3.

  • Heap for framework and task heap, JVM objects, and user-code allocations.
  • Managed memory for state backends and memory-managed operators, including RocksDB or Gemini working areas.
  • Network memory for shuffle and network buffers.
  • Off-heap / JVM overhead for native memory, metaspace, and process overhead outside ordinary Java heap.

For stateful jobs, container memory alone is not a safe sizing proxy. A larger state footprint can increase backend working set, checkpoint I/O pressure, restore time, and native-memory demand even when Java heap appears healthy.

For the managed-memory share, set taskmanager.memory.managed.fraction in the Deployment spec.template.spec.flinkConfiguration. Increase this fraction for large-state RocksDB/Gemini jobs before increasing slot density, especially when backend memory pressure, spill growth, or compaction stalls appear while Java heap still has headroom.

TaskManager concernWhat to size forCommon starting viewSignals to confirm or adjust
Heap pressureUser objects, serialization churn, operator logicLeave headroom for spikes and checkpoint activityJVM OOM, high old-gen usage, long GC pauses
Managed memoryRocksDB/Gemini and managed operatorsIncrease for state-heavy jobs before adding slot densityBackend memory pressure, compaction stalls, spill growth
Network memoryShuffle and data exchange buffersValidate under peak rate and checkpoint alignmentNetwork buffer exhaustion, backpressure, alignment delay
Off-heap / overheadNative memory and process overheadDo not consume all container memory with heap-centric assumptionsContainer OOM kill without obvious Java heap saturation

CPU and memory starting points

Start with enough TaskManager headroom to survive realistic peaks, checkpointing, and restore operations.

Workload tendencyStarting point per TaskManagerWhat to watch
Small or light stateful streaming job1 CPU, 2–4G memoryBackpressure, JVM OOM, CPU saturation, checkpoint duration
State-heavy job with RocksDB/Gemini and larger keyed state1–2 CPU, 4–8G memorySpill, compaction pressure, checkpoint growth, restore time, container OOM
High-throughput pipeline with heavier serialization or network demand2–4 CPU, 4–8G memoryCPU busy time, throttling, backpressure, network buffer pressure, sink lag

Adjustment logic: add memory when you see OOM kills, backend pressure, spill, or checkpoint instability. Add CPU when busy time is high, backpressure persists, or recovery becomes too slow. Scale out with more TaskManagers when a single TaskManager becomes too dense or when larger parallelism is needed for throughput and recovery.

Parallelism and TaskManagers

In VVP, effective execution capacity is shaped by three related inputs:

  • parallelism at the Deployment level
  • taskmanager.numberOfTaskSlots in spec.template.spec.flinkConfiguration
  • the resulting number of TaskManagers needed to host that work

A useful planning approximation is:

TaskManagers needed ≈ parallelism ÷ slots per TaskManager, rounded up.

Prefer simple slot layouts unless a measured reason justifies denser packing.

  • Start with 1 slot per TaskManager for many stateful or operationally sensitive jobs.
  • Use 2 slots per TaskManager when the workload is well understood and benchmarked to behave well with denser packing.
  • Avoid high slot density by default. It can make noisy-neighbor effects, checkpoint contention, memory debugging, and recovery behavior harder to reason about.

Adjustment signals: sustained backpressure on a subset of subtasks, long checkpoint duration, uneven CPU usage, hotspot TaskManagers, or slower restore after failures. If scaling parallelism helps throughput but makes each TaskManager unstable, reduce slot density and spread the job across more TaskManagers.

VVP3 Autopilot note: when Autopilot 2.0 is active, it can adjust parallelism and memory, so the values in this page should be treated as the initial starting point and guardrail baseline. In VVP2, expect more of this tuning to remain manual.

Requests vs limits

Use the Ververica Platform resource model deliberately. VVP derives Kubernetes requests and limits from Deployment resources and configured limit factors; however, the exact field name and scope for limit factors in the deployed VVP version could not be verified from the available official version documentation, so do not copy an unverified YAML/API path into templates. The Deployment resource value is not just documentation; it directly affects scheduling reservation, burst behavior, and throttling risk. If pod templates are used for Kubernetes-level customization, note the version-specific shape: VVP3 exposes separate JobManager and TaskManager pod templates (kubernetes.jobManagerPodTemplate / kubernetes.taskManagerPodTemplate), while VVP2 uses the shared kubernetes.pods.* block.

  • Requests influence what the scheduler reserves.
  • Limits constrain burst capacity and can trigger CPU throttling or memory OOM kills.
  • Limit factors should be explicit at namespace default level so teams understand how much burst they really have.

Be careful with tight CPU limits on latency-sensitive streaming jobs. CPU throttling often looks like application backpressure even when the job logic is otherwise healthy.

Resource topicGuidanceSignals to watch
CPU request too lowScheduler may overpack nodes relative to steady-state needScheduling pressure, noisy-neighbor behavior, unstable latency
CPU limit too tightCan throttle during peaks, checkpoints, or recoveryContainer throttled time, source lag, backpressure, slow restart
Memory limit too tightLeaves no room for native overhead or state backend burstsOOMKilled pods, evictions, restore failures, checkpoint instability

Adjustment signals: increase CPU request when the steady state is consistently above the reserved level; increase CPU limit or limit factor when throttling appears during load or recovery; increase memory when pods are OOMKilled or when native/backend pressure appears before Java heap looks full.

Reference workload profiles

These profiles are reference starting ranges for VVP Deployments. They are not production guarantees. Validate every profile by benchmark and load test before standardizing it in a Deployment Template.

ProfileTypical useJobManager starting rangeTaskManager starting rangeSlots per TaskManagerParallelism starting rangePrimary signals to validate
SmallSimple pipelines, modest state, limited operator fan-out2 replicas, 0.5–1 CPU, 1–2G1 CPU, 2–4G11–4Backpressure, JVM OOM, GC pauses, checkpoint duration, CPU throttling
Stateful-heavyLarge keyed state, RocksDB/Gemini-heavy workloads, stricter restore expectations2 replicas, 1 CPU, 2–4G1–2 CPU, 4–8G1 preferred4–16Container OOM, backend pressure, spill, checkpoint duration, restore time, backpressure during checkpointing
High-throughputCPU-heavy pipelines, high event rate, heavier serialization or network traffic2 replicas, 1–2 CPU, 2–4G2–4 CPU, 4–8G1–28–32CPU busy time, throttling, source lag, sustained backpressure, network buffer pressure, checkpoint slowness

Suggested exception workflow for non-standard sizing

Require the Deployment owner to record: the chosen profile, the exact deviation from template defaults, the benchmark evidence supporting the change, the expected improvement, and the review point after production observation. Good exception candidates include higher TaskManager memory for state growth, lower slot density for operational isolation, or higher CPU limit factors to avoid throttling during checkpointing and recovery.

Was this helpful?