Configuring Lake Tiering

Applies toSelf-Managed v3

10 min read

On this page

Overview
Prerequisites
Choosing a Catalog Flavor
Configuring Hadoop Catalog
Configuring REST Catalog
- On Amazon S3
Running the Tiering Service
Verify Lake Tiering Is Configured
Troubleshooting
Further Reading
Related manuals:

Overview

Lake tiering moves older Fluss data into an open table format on object storage so that historical data becomes addressable as a standard lake table. Fresh data continues to live in Fluss for sub-second read and write latency. Older data is compacted into the lake by the Fluss tiering service and is then readable by any engine that speaks the lake format. Flink's Fluss connector can union-read both layers transparently.

This manual covers Apache Iceberg as the lake format. Fluss supports two Iceberg catalog flavors and several remote-storage flavors. The supported combinations are:

Catalog	Storage Backend	Plugin
Catalog	Storage Backend	Plugin
Hadoop (filesystem metastore)	Amazon S3	lake-iceberg-s3
Hadoop (filesystem metastore)	Azure Blob Storage (ADLS Gen2)	lake-iceberg-abs
Hadoop (filesystem metastore)	NooBaa or S3-compatible	lake-iceberg-s3
Iceberg REST (for example, Apache Polaris)	Amazon S3	lake-iceberg-s3

REST catalog on ABS is not currently supported in this distribution and is not documented below.

Enabling lake tiering requires three things:

A remote-storage backend already configured on the Fluss server. Lake tiering does not replace remote storage. Both coexist and the lake warehouse can technically live on a different object-store family from the remote-data bucket, but in practice you typically use the same backend for both (one S3 account, one ADLS Gen2 account, and so on).
The matching lake-iceberg-* plugin installed into the Fluss pods using an init container.
datalake.* keys set in configurationOverrides to point Fluss at the catalog and warehouse.

After the server side is configured, a separate Flink-side tiering service (a long-running Flink streaming job) does the actual tiering. It reads from Fluss and writes to the Iceberg warehouse. This manual covers both the Fluss server-side configuration (sections below) and the per-flavor JAR, configuration, and argument set the tiering job needs (see Configuring Lake Tiering.docx). The Flink-side configuration for reading tiered tables is covered in the upstream Fluss Iceberg integration documentation.

Prerequisites

A running Fluss cluster deployed using the fluss-bundle chart. See Deploying Fluss on Kubernetes.docx
A remote storage backend configured on the Fluss server (S3, Azure Blob Storage, or an S3-compatible store such as NooBaa). Lake tiering does not replace remote storage. Both coexist and Fluss writes to both. See Configuring Remote Storage.docxfor credentials, IAM, and the filesystem-plugin (fs-s3 or fs-azure) install.
A warehouse location for the Iceberg tables: an S3 bucket prefix, an ADLS Gen2 path, or, for the REST flavor, a warehouse name registered in your Iceberg REST catalog. The warehouse can be (and typically is) different from the remote-storage bucket.
For the REST catalog flavor: an Iceberg REST catalog endpoint already deployed and reachable from the Fluss pods, with an OAuth2 client credential and scope. Bringing up a REST catalog (Polaris, Tabular, Lakekeeper, Nessie's REST adapter, and so on) is out of scope for this manual.

Choosing a Catalog Flavor

	Hadoop Catalog	REST Catalog
	Hadoop Catalog	REST Catalog
Metastore	Filesystem only. Table state lives in object-storage paths next to the data.	External REST service (Polaris, Tabular, Nessie REST, Lakekeeper, and others).
Bring-up effort	None beyond the Fluss config.	Requires a separate REST catalog service to be deployed and reachable.
Concurrency and multi-engine writes	Limited. Hadoop catalog uses object-store atomic-rename semantics, which most S3-compatible stores do not provide reliably.	Designed for it. The REST catalog mediates commits.
Authentication	Inherits from the underlying object store.	Catalog-level OAuth2 (client credential and scope), independent of object-store auth.
Engine support	Universal. Every Iceberg client supports Hadoop tables.	Most modern engines support REST. Check your engine's Iceberg version.

Pick Hadoop catalog for the simplest operational footprint, especially when Fluss is the only writer to the warehouse and you do not already run an Iceberg REST service. Pick REST catalog when you already operate one (or want to from day one), need governance and authentication at the catalog level, or expect multiple Iceberg writers against the same warehouse.

Configuring Hadoop Catalog

The Hadoop catalog stores table metadata as files inside the warehouse directory itself. There is no external metastore service to deploy. Fluss reaches the warehouse through Iceberg's HadoopFileIO, which is driven by Hadoop S3A (for S3 and S3-compatible stores) or the ABFS connector (for Azure Blob Storage).

Install the lake-iceberg-s3 or lake-iceberg-abs plugin on both coordinator and tablet. See Installing Fluss .docx.

Note

About the datalake.iceberg.iceberg.hadoop.* keys.
All Hadoop-catalog snippets below pass Hadoop S3A and ABFS settings under the doubled prefix datalake.iceberg.iceberg.hadoop.*. This is the prefix Fluss's current Iceberg integration expects: datalake.iceberg. is stripped to feed the Iceberg catalog, and the remaining iceberg.hadoop. is what Iceberg's HadoopUtils looks for on the resulting catalog properties. The doubled name is upstream Fluss behavior and is awkward but not a typo. If a future Fluss release simplifies it to datalake.iceberg.hadoop.*, the keys below will need to be updated accordingly.

On Amazon S3

Make sure remote storage on S3 is already configured per Configuring Remote Storage.docx › Amazon S3, with the fs-s3 plugin installed and s3.access.key, s3.secret.key, and s3.region set.

You need:

An S3 bucket prefix for the lake warehouse. This can be the same bucket as remote storage but a different prefix, or a different bucket entirely.
The same IAM access key used for remote storage typically suffices, provided its policy covers the warehouse path. If you separate the buckets, extend the policy to include the warehouse bucket.

Add the following to configurationOverrides:

fluss:

configurationOverrides:

# ... your existing remote.data.dir and s3.* keys (see Configuring Remote Storage) ...

datalake.format: iceberg

datalake.iceberg.type: hadoop

datalake.iceberg.metastore: filesystem

# The warehouse is read by Iceberg through HadoopFileSystem; use the s3a:// scheme,

# not s3:// (which is Fluss's own filesystem layer).

datalake.iceberg.warehouse: s3a://<LAKE_BUCKET>/<WAREHOUSE_PREFIX>

datalake.iceberg.iceberg.hadoop.fs.s3a.access.key: <AWS_ACCESS_KEY_ID>

datalake.iceberg.iceberg.hadoop.fs.s3a.secret.key: <AWS_SECRET_ACCESS_KEY>

datalake.iceberg.iceberg.hadoop.fs.s3a.region: <AWS_REGION>

For the full set of configurable fields under fluss:, see the Fluss Helm chart documentation.

Storing S3 Credentials in a Kubernetes Secret

The same approach used for remote storage credentials applies. Source AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from a Kubernetes Secret using extraEnv or envFrom and reference them with ${VAR} placeholders inside configurationOverrides, or use a Helm-render-time injector. See Configuring Remote Storage.docx› Keeping Credentials Out of values.yaml and Fluss Helm Chart: Additional Notes › Environment-Variable Substitution in server.yaml.

On Azure Blob Storage (ADLS Gen2)

Make sure remote storage on Azure Blob Storage is already configured per Configuring Remote Storage.docx › Azure Blob Storage (ADLS Gen2), with the fs-azure plugin installed, fs.azure.* keys set, and the OAuth endpoint configured.

You need:

An ADLS Gen2 path for the lake warehouse.
The storage account access key for the account hosting the warehouse. The Hadoop ABFS connector reads this from the catalog properties below.

Add the following to configurationOverrides:

fluss:

configurationOverrides:

# ... your existing remote.data.dir and fs.azure.* keys (see Configuring Remote Storage) ...

datalake.format: iceberg

datalake.iceberg.type: hadoop

datalake.iceberg.metastore: filesystem

# ABFS path: abfs://<CONTAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<PATH>

datalake.iceberg.warehouse: abfs://<CONTAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<WAREHOUSE_PATH>

datalake.iceberg.iceberg.hadoop.fs.azure.account.key: <STORAGE_ACCOUNT_KEY>

For the full set of configurable fields under fluss:, see the Fluss Helm chart documentation.

Storing Azure Credentials in a Kubernetes Secret

The Azure account key has no documented env-var fallback. Use either a Helm-render-time secret injector (External Secrets Operator, Vault Agent, sealed-secrets) or ${VAR} substitution in server.yaml. See Configuring Remote Storage.docx › Storing Azure Credentials in a Kubernetes Secret and Fluss Helm Chart: Additional Notes › Environment-Variable Substitution in server.yaml.

On NooBaa or Another S3-Compatible Store

Make sure remote storage on the same S3-compatible store is already configured per Configuring Remote Storage.docx › OpenShift Data Foundation (ODF) and Other S3-Compatible Stores, with the fs-s3 plugin installed and s3.endpoint, s3.path-style-access, and static credentials set.

NooBaa, MinIO, Ceph RGW, and similar S3-compatible services work with the Hadoop catalog by pointing Hadoop S3A at the same endpoint and credentials Fluss uses for remote storage. The warehouse is typically a different bucket from the remote-storage bucket. In OBC-provisioned environments (ODF), provision a second OBC for the lake warehouse.

Warning

Flink-side configuration is required for STS-less stores.
S3-compatible stores without an STS endpoint need explicit Iceberg and Hadoop configuration on the Flink side so that HadoopFileIO can read and write tiered files. The tiering-service core-site.xml for this flavor is shown inin Lake-Job Credentials › On NooBaa or another S3-compatible store (Hadoop catalog). Reader-side configuration is covered in the upstream Fluss Iceberg integration documentation.

Add the following to configurationOverrides:

fluss:

configurationOverrides:

# ... your existing remote.data.dir, s3.endpoint, s3.* keys (see Configuring Remote Storage) ...

datalake.format: iceberg

datalake.iceberg.type: hadoop

datalake.iceberg.metastore: filesystem

datalake.iceberg.warehouse: s3a://<LAKE_BUCKET>/<WAREHOUSE_PREFIX>

datalake.iceberg.iceberg.hadoop.fs.s3a.endpoint: <S3_ENDPOINT>

datalake.iceberg.iceberg.hadoop.fs.s3a.access.key: <ACCESS_KEY_ID>

datalake.iceberg.iceberg.hadoop.fs.s3a.secret.key: <SECRET_ACCESS_KEY>

datalake.iceberg.iceberg.hadoop.fs.s3a.region: us-east-1

datalake.iceberg.iceberg.hadoop.fs.s3a.path.style.access: "true"

datalake.iceberg.iceberg.hadoop.fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

Note

fs.s3a.path.style.access: "true" and the explicit SimpleAWSCredentialsProvider are required for S3-compatible stores. fs.s3a.region is required by the S3 client, but NooBaa-class services do not enforce its value.

For the full set of configurable fields under fluss:, see the Fluss Helm chart documentation.

Configuring REST Catalog

With the REST flavor, table metadata lives in an external Iceberg REST catalog service. Fluss talks to the catalog over HTTP, authenticates with an OAuth2 client credential, and writes the table data files directly to object storage through Iceberg's HadoopFileIO. The warehouse name and storage location are properties of the catalog, not of Fluss.

Install the lake-iceberg-s3 plugin on both coordinator and tablet. See Installing Fluss .docx. This is the same plugin used for the Hadoop-on-S3 flavor.

On Amazon S3

Make sure remote storage on S3 is already configured per Configuring Remote Storage.docx › Amazon S3, with the fs-s3 plugin installed and s3.* keys set.

You need:

A reachable Iceberg REST catalog endpoint (for example, http://<catalog-host>:8181/api/catalog).
A warehouse name registered in the catalog. Its S3 location is configured on the catalog side, not in Fluss.
An OAuth2 client credential in <CLIENT_ID>:<CLIENT_SECRET> form and an OAuth2 scope authorized for that warehouse.
An S3 access key that can write data files into the warehouse's storage location. Even with a REST catalog mediating metadata commits, Fluss writes the actual Parquet and ORC files through HadoopFileIO and needs S3 credentials of its own.

Add the following to configurationOverrides:

fluss:

configurationOverrides:

# ... your existing remote.data.dir and s3.* keys (see Configuring Remote Storage) ...

datalake.format: iceberg

datalake.iceberg.type: rest

# REST catalog endpoint reachable from the Fluss pods.

datalake.iceberg.uri: <REST_CATALOG_URI>

datalake.iceberg.warehouse: <WAREHOUSE_NAME>

# OAuth2 client credential and scope.

datalake.iceberg.credential: <CLIENT_ID>:<CLIENT_SECRET>

datalake.iceberg.scope: <OAUTH2_SCOPE>

# FileIO impl for data files. HadoopFileIO uses Hadoop S3A under the hood,

# which is why S3 credentials still need to be supplied below.

datalake.iceberg.io-impl: org.apache.iceberg.hadoop.HadoopFileIO

datalake.iceberg.iceberg.hadoop.fs.s3a.access.key: <AWS_ACCESS_KEY_ID>

datalake.iceberg.iceberg.hadoop.fs.s3a.secret.key: <AWS_SECRET_ACCESS_KEY>

datalake.iceberg.iceberg.hadoop.fs.s3a.region: <AWS_REGION>

If the catalog vends scoped storage credentials at table-load time (the default for many REST catalog implementations, including Polaris when not configured to skip credential subscoping), those will take precedence over the static fs.s3a. keys above for any given table operation. The static keys remain useful as a fallback for catalogs that do not vend credentials and for cases where Fluss needs to write the initial metadata file before the catalog has a chance to scope credentials.

For the full set of configurable fields under fluss:, see the Fluss Helm chart documentation.

Storing REST Catalog Credentials in a Kubernetes Secret

Treat the OAuth2 client credential the same way you treat S3 credentials. The simplest path is ${VAR} substitution against an extraEnv-injected secret:

fluss:

coordinator:

extraEnv:

- name: ICEBERG_REST_CREDENTIAL

valueFrom:

secretKeyRef:

name: <ICEBERG_REST_SECRET>

key: credential

tablet:

extraEnv:

- name: ICEBERG_REST_CREDENTIAL

valueFrom:

secretKeyRef:

name: <ICEBERG_REST_SECRET>

key: credential

configurationOverrides:

datalake.iceberg.credential: ${ICEBERG_REST_CREDENTIAL}

See Fluss Helm Chart: Additional Notes › Environment-Variable Substitution in server.yaml for the full mechanism and caveats. Apply the same extraEnv block to both coordinator and tablet.

Running the Tiering Service

The tiering service is a long-running Flink streaming job that reads from Fluss and writes to the Iceberg warehouse. It is a Flink client of the Fluss cluster, operationally similar to a union-read job, and is documented alongside other Flink-side configuration in Running Lakehouse (Iceberg) Jobs against Fluss.docx.

For the per-table DDL that enables tiering, and for reading tiered tables (union reads through the Fluss connector or direct reads through any Iceberg-compatible engine), see Running Lakehouse (Iceberg) Jobs against Fluss.docx

Verify Lake Tiering Is Configured

After applying the updated values, confirm the lake configuration reached the running server:

kubectl exec -n <NAMESPACE> coordinator-server-0 -- \

cat /opt/fluss/conf/server.yaml | grep -E 'datalake\.'

You should see the datalake.format, datalake.iceberg.type, datalake.iceberg.warehouse, and (for REST) datalake.iceberg.uri values you configured.

Confirm the Iceberg plugin directory is populated:

kubectl exec -n <NAMESPACE> coordinator-server-0 -- ls /opt/fluss/plugins/iceberg/

The directory should contain the Iceberg JARs delivered by the lake-iceberg-s3 or lake-iceberg-abs plugin.

End-to-end verification (confirming that data is actually being tiered into the Iceberg warehouse) requires running the Flink tiering service and a writer. See Running Lakehouse (Iceberg) Jobs against Fluss.docx.

Troubleshooting

These items are server-side. For client-side issues (Flink tiering job credentials, Flink-side REST catalog connectivity, and so on), see Running Lakehouse (Iceberg) Jobs against Fluss.docx.

Iceberg Plugin Directory Empty After Pod Start

Inspect the init container logs:

kubectl logs -n <NAMESPACE> <POD_NAME> -c install-plugins

Verify that install.sh fluss lake-iceberg-s3 (or lake-iceberg-abs) ran without error, and that the extraVolumeMounts subPath is iceberg on both the init container and the main container. See Installing Fluss › Troubleshooting for general init-container diagnostics.

datalake.iceberg.iceberg.hadoop.* Keys Appear Ignored

The doubled prefix is intentional. See the note at the top of Configuring Hadoop Catalog. If credentials are still not picked up:

For Hadoop on S3 or S3-compatible: confirm datalake.iceberg.iceberg.hadoop.fs.s3a.aws.credentials.provider is set to org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider when using static keys. The default Hadoop provider chain might otherwise fall back to a credential source that has no access to the lake bucket.
For Hadoop on Azure Blob Storage: confirm the storage account in the warehouse URI matches the one whose key you set in datalake.iceberg.iceberg.hadoop.fs.azure.account.key. Account-key auth is per-account.

REST Catalog: Unauthorized or Failed to Obtain Access Token (Fluss Server Side)

The catalog rejected the OAuth2 client credential when the Fluss server connected to it. Verify that:

datalake.iceberg.uri is reachable from inside the Fluss pods (DNS, network policy, TLS).
datalake.iceberg.credential is in <CLIENT_ID>:<CLIENT_SECRET> form (single colon, no scheme).
datalake.iceberg.scope matches a scope that the catalog associates with the principal.
The warehouse name in datalake.iceberg.warehouse exists in the catalog.

REST Catalog: Failed to Get Subscoped Credentials: roleArn Must Not Be Null

The catalog tried to vend STS-scoped credentials for the warehouse but is missing the IAM role ARN it needs to call AssumeRole. This is a catalog-side configuration issue. Register a roleArn against the warehouse's storage configuration in the catalog, or configure the catalog to skip credential subscoping. The exact mechanism is catalog-specific. For Polaris this is SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION, but production deployments should provide a roleArn instead.