Monitoring Checkpointing
This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version.

チェックポイントの監視 #

概要 #

Flink’s web interface provides a tab to monitor the checkpoints of jobs. これらの状態はジョブが完了した後でも利用可能です。チェックポイントについての情報を表示する4つの異なるタブがあります: 概要、履歴、サマリ、および設定。以下の章はそれらの全てを順番にカバーするでしょう。

監視 #

Overview タブ #

概要タブは以下の統計をリスト化します。Note that these statistics don’t survive a JobManager loss and are reset to if your JobManager fails over.

  • Checkpoint Counts
    • Triggered: ジョブが開始されてから起動されたチェックポイントの総数。
    • In Progress: 進行中のチェックポイントの現在の数
    • Completed: ジョブが開始されてから完了した成功のチェックポイントの総数
    • Failed: ジョブが開始されてから失敗したチェックポイントの総数。
    • Restored: ジョブが開始されてからの回復オペレーションの数これはサブミットされてからジョブが再開された回数も伝えます。セーブポイントを持つ初期のサブミットも回復としてカウントし、もしJobManagerが操作中に紛失した場合はカウントも再セットされることに注意してください。
  • Latest Completed Checkpoint: The latest successfully completed checkpoints. Clicking on More details gives you detailed statistics down to the subtask level.
  • Latest Failed Checkpoint: The latest failed checkpoint. Clicking on More details gives you detailed statistics down to the subtask level.
  • Latest Savepoint: The latest triggered savepoint with its external path. Clicking on More details gives you detailed statistics down to the subtask level.
  • Latest Restore: There are two types of restore operations.
    • チェックポイントからの回復: 通常の定期的なチェックポイントからの回復。
    • セーブポイントからの回復: セーブポイントからの回復。

History タブ #

チェックポイントの履歴は現在進行中のものを含むもっと最近に引き起こされたチェックポイントについての統計を保持します。

Note that for failed checkpoints, metrics are updated on a best efforts basis and may be not accurate.

Checkpoint Monitoring: History
  • ID: The ID of the triggered checkpoint. IDはそれぞれのチェックポイントについて増加され1から始まります。
  • Status: The current status of the checkpoint, which is either In Progress, Completed, or Failed. If the triggered checkpoint is a savepoint, you will see a floppy-disk symbol.
  • Acknowledged: The number of acknowledged subtask with total subtask.
  • Trigger Time: The time when the checkpoint was triggered at the JobManager.
  • Latest Acknowledgement: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).
  • End to End Duration: The duration from the trigger timestamp until the latest acknowledgement (or n/a if no acknowledgement received yet). 完了のチェックポイントについてのこのエンド トゥ エンドの持続時間はチェックポイントを通知する最後のサブタスクによって決定されます。この時間は通常1つのサブタスクが実際に状態をチェックポイントするのに必要なものよりも大きいです。
  • Checkpointed Data Size: The persisted data size during the sync and async phases of that checkpoint, the value could be different from full checkpoint data size if incremental checkpoint or changelog is enabled.
  • Full Checkpoint Data Size: The accumulated checkpoint data size over all acknowledged subtasks.
  • Processed (persisted) in-flight data: The approximate number of bytes processed/persisted during the alignment (time between receiving the first and the last checkpoint barrier) over all acknowledged subtasks. Persisted data could be larger than zero only if the unaligned checkpoints are enabled.

For subtasks there are a couple of more detailed stats available.

Checkpoint Monitoring: History
  • Sync Duration: The duration of the synchronous part of the checkpoint. This includes snapshotting state of the operators and blocks all other activity on the subtask (processing records, firing timers, etc).
  • Async Duration: The duration of the asynchronous part of the checkpoint. This includes time it took to write the checkpoint on to the selected filesystem. For unaligned checkpoints this also includes also the time the subtask had to wait for last of the checkpoint barriers to arrive (alignment duration) and the time it took to persist the in-flight data.
  • Alignment Duration: The time between processing the first and the last checkpoint barrier. For aligned checkpoints, during the alignment, the channels that have already received checkpoint barrier are blocked from processing more data.
  • Start Delay: The time it took for the first checkpoint barrier to reach this subtask since the checkpoint barrier has been created.
  • Unaligned Checkpoint: Whether the checkpoint for the subtask is completed as an unaligned checkpoint. An aligned checkpoint can switch to an unaligned checkpoint if the alignment timeouts.

履歴サイズの設定 #

以下の設定キーによって履歴のために記憶される最新のチェックポイントの数を設定することができます。The default is 10.

# Number of recent checkpoints that are remembered
web.checkpoints.history: 15

Summary タブ #

The summary computes a simple min/average/maximum statistics over all completed checkpoints for the End to End Duration, Incremental Checkpoint Data Size, Full Checkpoint Data Size, and Bytes Buffered During Alignment (see History for details about what these mean).

Checkpoint Monitoring: Summary

Note that these statistics don’t survive a JobManager loss and are reset to if your JobManager fails over.

Configuration タブ #

ストリーム設定の設定リスト:

  • Checkpointing Mode: Either Exactly Once or At least Once.
  • Interval: The configured checkpointing interval. この間隔でチェックポイントを引き起こします。
  • Timeout: Timeout after which a checkpoint is cancelled by the JobManager and a new checkpoint is triggered.
  • Minimum Pause Between Checkpoints: Minimum required pause between checkpoints. チェックポイントが完了した後で、次のものを起動数前に少なくともこの時間待ちます。潜在的に通常の間隔を遅らせます。
  • Maximum Concurrent Checkpoints: The maximum number of checkpoints that can be in progress concurrently.
  • Persist Checkpoints Externally: Enabled or Disabled. 有効にすると、外部化されたチェックポイントのクリーンアップ設定がさらに一覧表示されます (キャンセル時に削除または維持)。

チェックポイントの詳細 #

When you click on a More details link for a checkpoint, you get a Minimum/Average/Maximum summary over all its operators and also the detailed numbers per single subtask.

Checkpoint Monitoring: Details

オペレータごとのサマリ #

Checkpoint Monitoring: Details Summary

全てのサブタスクの統計 #

Checkpoint Monitoring: Subtasks

Back to top

inserted by FC2 system