Docs Home
Viewing docs for
Self-ManagedNot available for BYOC

Savepoints

On this page

A Savepoint resource points to a single savepoint or retained checkpoint in Apache Flink®. A single Apache Flink® savepoint can be referenced by multiple Ververica Platform Savepoint resources.

Please consult the official Apache Flink® documentation on savepoints and checkpoints (https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/checkpoints/) for more details on savepoints and checkpoints in Apache Flink®.

Specification

The Restore Strategy of your Deployment resources controls which Savepoint will be used to restore the state of a Apache Flink® job.

Ververica Platform only keeps track of Apache Flink® savepoints that are created within the Ververica Platform.

Savepoint Origins

A Savepoint can be created in various ways. Its origin is described by the metadata.origin attribute:

  • USER_REQUEST: The Savepoint was requested manually by a user through Ververica Platform.
  • SUSPEND: The Savepoint was requested when the corresponding Deployment was suspended.
  • COPIED: The Savepoint is either a copy of another Savepoint resource, or was created manually using an existing savepointLocation (see below). Both Savepoint resources point to the same physical Apache Flink® savepoint.
  • RETAINED_CHECKPOINT: The Savepoint is a retained Apache Flink® checkpoint that was not discarded after the Apache Flink® job was cancelled.

Savepoint States

The current state of a Savepoint resource is described by the status.state attribute:

  • STARTED: The Savepoint was started, but is not completed yet.
  • COMPLETED: The Savepoint was completed successfully and can be restored from.
  • FAILED: Creation of the Savepoint failed. Details on the cause of failure can be found in the status.failure field.
  • PENDING_DELETION: The Savepoint was marked for deletion. It will automatically be deleted if it meets all prerequisites. It can no longer be restored from.
  • DELETING: The Savepoint is currently being deleted. It can no longer be restored from.
  • FAILED_DELETION: Deletion of the Savepoint failed. Details on the cause of failure can be found in the status.failure field. It can no longer be restored from. Deletion can be retried.

Savepoint Types

The metadata.typeattribute of a Savepoint resource describes the structure of the underlying savepoint or checkpoint in Apache Flink®. More information in incremental checkpoints can be found in the Apache Flink® documentation.

  • INCREMENTAL: The Savepoint resource references an incremental checkpoint.
  • FULL: The Savepoint resource references a savepoint or a full checkpoint.
  • UNKNOWN: The type of the underlying savepoint or checkpoint is not known.

Savepoint resources created using a version of Ververica Platform prior to 2.5 do not have metadata.type populated. They will be treated as if the type was UNKNOWN.

Requirements

Triggering Savepoints requires configuration of a path under which to store savepoints. If Ververica Platform was configured with blob storage, it will preconfigure each Deployment for checkpoints, savepoints and high-availability.

Otherwise, please provide an entry in the flinkConfiguration map with the key state.savepoints.dir:

YAML
1kind: Deployment
2spec:
3  template:
4    spec:
5      flinkConfiguration:
6        state.savepoints.dir: s3://flink/savepoints

The provided blob storage location needs to be accessible by all nodes of your cluster. If Ververica Platform was configured with blob storage, the platform will handle the credentials distribution transparently and no further actions is required. Otherwise, you can, for instance, use a custom volume mount or a custom filesystem configuration.

Manually Adding a Savepoint Resource

Savepoints triggered by or through Ververica Platform are automatically added to the Deployment. Yet, in some cases you might want to recover or start your Deployment from a specific Apache Flink® state snapshot that is not yet tracked by Ververica Platform. In that case, you need to manually add a Savepoint resource to your Deployment provided that you already have a savepoint or checkpoint at hand to resume from. This can be done in either of the following ways:

Using the Ververica Platform user interface

In the Deployment list view, select the Deployment you want your Savepoint to be added to. In the Snapshots Tab, find the Add Savepoint Manually button and fill out the form that opens.

Using the REST API

Send a request with body like the example below to the following endpoint, specifying the ID of the Deployment to add the Savepoint to:

TEXT
1POST /api/v1/namespaces/{namespace}/savepoints

Using either method, savepointLocation is required. The flinkSavepointId is optional. If not specified, the Deployment annotation com.dataartisans.appmanager.controller.deployment.spec.version will be set to the one of the current Deployment. The type of the Savepoint will default to UNKNOWN. The origin of the Savepoint will be COPIED.

Afterwards the web user interface for this Deployment will show (in the Snapshots Tab) that the Deployment will be started from this Savepoint.

Deleting a Savepoint Resource

Savepoint resources which are no longer needed can be deleted to free up space. The underlying data in the configured blob storage will be deleted automatically.

Both Savepoint resources referencing Apache Flink® savepoints as well as those referencing Apache Flink® retained checkpoints can be deleted.

When deleting a Savepoint resource, Ververica Platform will also attempt to delete all other Savepoint resources that point to the same physical location in blob storage.

Prerequisites

To delete a Savepoint resource, the following conditions must be true:

  • Universal blob storage is enabled.
  • The user requesting the deletion has the editor or owner role inside the Namespace.

Additionally, the below conditions must be true for all Savepoint resources pointing to the same physical location:

  • The Savepoint resource is in state COMPLETED, FAILED, or FAILED_DELETION.
  • The Savepoint resource references a savepoint or a full checkpoint (precise logic below).
  • If the Savepoint resource is associated with an active Deployment, it must not be the latest snapshot (savepoint or checkpoint) to ensure its deletion will not impact the underlying Job's failure recovery.

In order to provide better handling of Savepoint resources created using Ververica Platform 2.4 or below, a Savepoint resource passes the second condition if either of the following is true:

  • metadata.type is FULL.
  • metadata.type is UNKNOWN or not set, and metadata.origin is USER_REQUEST or SUSPEND.

Additionally, if multiple Savepoint resources share the same physical location, it is sufficient if one of them passes and none are incremental (according to metadata.type).

Force Deletion

To skip the above prerequisites and to ensure that the Savepoint resource will always be removed, regardless of any failures while deleting the underlying data, it is possible to "force delete" the resource. Deletion of the underlying data will be attempted, but regardless of its outcome, deletion of the Savepoint resource will proceed.

Methods of deletion

The following call to the REST API marks the affected Savepoint resource for deletion:

TEXT
1DELETE /api/v1/namespaces/{namespace}/savepoints/{savepointId}[?force=true]

To trigger force deletion, specify the force=true query parameter.

Regular responses will exhibit one of the following status codes:

  • 202: The request for deletion was accepted, both the Savepoint resource as well as the underlying physical Apache Flink® savepoint or checkpoint are scheduled for deletion.
  • Additionally, all Savepoint resources with the same physical location will be scheduled for deletion.
  • 409: One of the prerequisites was not met. Check the error message for details.
  • 400: A user-error occurred, likely a data type issue. Check the error message for details.

Savepoint resources can also be deleted using the web user interface. To do so, navigate to the Snapshots tab of the corresponding Deployment and choose the Delete Snapshot or Force delete Snapshot action.

Limitations

  • Job-specific Apache Flink® configuration is only picked up if set via the Deployment template. If incremental checkpoints are configured directly in the code of the submitted JAR, this will not be recognized.
  • While Savepoint resources referencing incremental checkpoints can be force deleted, this will not remove the part of the underlying data that is shared with other incremental checkpoints.
Was this helpful?