This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version.

プロダクションの準備ができたチェックポイント #

実稼働準備チェックリストは、Apache Flinkジョブを実稼働に持ち込む前に慎重に考慮する必要がある設定オプションの概要を提供します。 Flinkコミュニティは各設定に適切なデフォルトを提供しようとしていますが、このリストを確認し、選択したオプションが要望を満たしていることを確認することは重要です。

明示的な最大並行度を設定 #

ジョブごとおよびオペレータごとの粒度で設定された最大並行度は、ステートフルオペレータがスケーリングできる最大並列度を決定します。 There is currently no way to change the maximum parallelism of an operator after a job has started without discarding that operators state. The reason maximum parallelism exists, versus allowing stateful operators to be infinitely scalable, is that it has some impact on your application’s performance and state size. Flink は最大の並列処理で線形に成長する状態を再スケールするために、特定のメタデータを維持する必要があります。一般的に、スケーラビリティの将来のニーズに合わせて十分に高い最大並列度を選択する必要がありますが、妥当なパフォーマンスを維持するために十分に低く保ちます。

Maximum parallelism must fulfill the following conditions: 0 < parallelism <= max parallelism <= 2^15

You can explicitly set maximum parallelism by using setMaxParallelism(int maxparallelism). 最大並行度が設定されていない場合、Flinkはジョブが最初に開始された時にオペレータの並列度の関数を使うことを決定します。

128 : for all parallelism <= 128.
MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15) : for all parallelism > 128.

全てのオペレータについてUUIDを設定 #

As mentioned in the documentation for savepoints, users should set uids for each operator in their DataStream. Uids are necessary for Flink’s mapping of operator states to operators which, in turn, is essential for savepoints. デフォルトではオペレータのuidはJobGraphを操作して特定のオペレータのプロパティをハッシュすることにより生成されます。これはユーザの観点からは快適ですが、JobGraph (例えばオペレータの交換)への変換により新しいUUIDが生成されるため、非常に脆弱です。 To establish a stable mapping, we need stable operator uids provided by the user through setUid(String uid).

正しい状態のバックエンドを選択 #

See the description of state backends for choosing the right one for your use case.

Choose The Right Checkpoint Interval #

Checkpointing is Flink’s primary fault-tolerance mechanism, wherein a snapshot of your job’s state persisted periodically to some durable location. In the case of failure, Flink will restart from the most recent checkpoint and resume processing. A jobs checkpoint interval configures how often Flink will take these snapshots. While there is no single correct answer on the perfect checkpoint interval, the community can guide what factors to consider when configuring this parameter.

What is the SLA of your service: Checkpoint interval is best understood as an expression of the jobs service level agreement (SLA). In the worst-case scenario, where a job fails one second before the next checkpoint, how much data can you tolerate reprocessing? A checkpoint interval of 5 minutes implies that Flink will never reprocess more than 5 minutes worth of data after a failure.
How often must your service deliver results: Exactly once sinks, such as Kafka or the FileSink, only make results visible on checkpoint completion. Shorter checkpoint intervals make results available more quickly but may also put additional pressure on these systems. It is important to work with stakeholders to find a delivery time that meet product requirements without putting undue load on your sinks.
How much load can your Task Managers sustain: All of Flinks’ built-in state backends support asynchronous checkpointing, meaning the snapshot process will not pause data processing. However, it still does require CPU cycles and network bandwidth from your machines. Incremental checkpointing can be a powerful tool to reduce the cost of any given checkpoint.

And most importantly, test and measure your job. Every Flink application is unique, and the best way to find the appropriate checkpoint interval is to see how yours behaves in practice.

ジョブマネージャの高可用性の設定 #

ジョブマネージャは各Flinkの配備の中央コーディネータとして機能し、クラスタのスケジューリングーとリソース管理の両方を担当します。クラスタ内の単一障害点であり、クラッシュした場合、新しいジョブをサブミットできず、実行中のアプリケーションは失敗します。

Configuring High Availability, in conjunction with Apache Zookeeper or Flinks Kubernetes based service, allows for a swift recovery and is highly recommended for production setups.