オンラインのスキーマ変更

Clustrixはテーブルへのreadまたはwriteのブロッキング無しにオンラインでスキーマを変更することができます。These changes can range from simply adding an index or column to completely redesigning the schema. Many other database systems require partial or full table locks while schema changes process each row and transform the data structure.

Clustrixデータベースは新しい一時的なコンテナ(あるいはコンテナのセット)を作成し、データをその新しい構造にコピーすることで、ロッキングの問題を回避しています。変更されているテーブルへのDML操作がある場合は、一時的なトランザクションログは古いコンテナンへのwriteあるいはupdateのトランザクションを記録し、その後新しいコンテナに適用します。全ての元のレコードがコピーされ、記録されたトランザクションが処理されると、ALTERトランザクションがcommitされ、新しいテーブルがliveになり、古いテーブルが削除されます。

Clustrix maintains read consistency during the ALTER – read and write queries that run against the table before the ALTER commits see the original table schema. ALTERがcommitした後のreadとwriteは新しいスキーマを見ます。一つのクエリあるいはユーザの視点からすると、ALTERは瞬間的です。

準備

Clustrixデータベースエンジンは簡単にオンラインでどのようなサイズのテーブルでもスキーマの変更をサポートするように設計されていますが、以下のbest practiceがアプリケーションへの否定的な予期せぬ影響を最小限にするのに役立つです。以下のものが含まれます:

データのバックアップ
隔離された環境で新しいスキーマをアプリケーションのクエリでテストするCompare the output of EXPLAIN <query> plans before and after the change to ensure that query plans are not negatively impacted.
ピークではない時間あるいはメンテナンスwindowの間に、スキーマの変更を実行します。
十分なディスク容量があることを書くにしてください(以下を参照)。
リプリケーションの影響を理解し、それに応じてプランを変更します。

ディスクスペース要件の見積もり

To understand how much free space you need before running an ALTER, you first need an understanding of Clustrix’s garbage collection process, which we refer to as BigC.

BigC is essentially a checkpoint that constantly moves forward through time as transactions complete. The BigC value of the cluster (shown in system.stats) will always show the point in time at which the longest currently running transaction has started. Any explicit transaction will “pin” BigC. While garbage collection is pinned, the cluster will log any write, delete, update, and DDL transactions into the undo-log in the event any long-running transaction fails and needs to be rolled back.

Because ALTERs are DDL transactions, they pin BigC, which means that all garbage collection is suspended during the ALTER.

ALTERのためにどれだけの空き領域が必要かを計算するために、以下のそれぞれを評価しなければなりません:

現在の総使用領域。
望ましい最小の総空き領域。
- ClustrixDB will kill long-running transactions if free space falls under 5%. This will cause your ALTER to rollback.
- スキーマの変更の間は少なくとも最小で10%の空き領域を確保する予定です。
ALTERされるテーブルのサイズ。
- テーブルのコピーを保持するのに十分な空き領域があることを確認。
- テーブルのサイズは system.table_sizesから推測することができます。
ALTERが完了するのに掛かる時間の推定。
- クラスタのサイズによります(例えば、サーバが多いと速くなります)。
- This can be extrapolated based on the speed of previous, smaller ALTERs operations.
新しいデータの成長率。
undoログの成長率。ALTERの間の全てのINSERT、UPDATE、DELETEのundoログは蓄積されるでしょう。
リプリケートしている場合は、binlogの成長率。
- Binlog trims will still run, but the disk space will not be freed until BigC is unpinned.

When you add up all the planned growth and size of the table copy plus the current space consumed, you should still have at least 10% of disk remaining. そうでなければ、ALTERを開始する前にback binlogの切り取り、テーブルの剪定、あるいはサーバを追加してクラスタを拡張することをお勧めします。

レプリケーションの懸念

Performing ALTERs in replicated environments requires additional planning. The ALTER can be performed in-line with replication, or independently, depending on whether schema structure is changed. Whenever possible, it is highly recommended to allow the ALTER to run over the replication stream in order to avoid errors in replication where transactions are looking for either the old or new schema when the other is present.

リプリケーションの上でALTERを実行することは、ALTERが最初にマスター上で、次にスレーブ上で順番に実行されなければならないことを意味します。更に、スレーブはリプリケーションストリームの中でALTERを順番に実行しなければなりません。スレーブがALTERを完了するまでの間、他のwriteは一時停止されます。大きなテーブルのALTERの場合、これによりスレーブがマスターより遅れることがあります。Plan to adjust binlog retention on the master and monitor replication regularly to ensure the slave does not fall off the back of your master's binlog retention during this process.

列のマッチング

行ベースのbinlogを使ってリプリケートする場合、列の同期が重要です。RBR (row-based replication) は古い行と新しい行の両方をスレーブに送信し、新しい行が適用される前に古い行が一致する必要があります。Alters that modify columns must be executed within replication and in sequence to avoid slave errors trying to applying writes for one schema to the other.

ステートメントベース(SBR)のbinlogを使ってリプリケートしている場合は、ステートメントは古いスキーマと新しいスキーマの上で同じく実行するという条件でもっと柔軟性が許可されています。However, for environments with high concurrency, it is extremely difficult to assess whether all statements are equivalent on the old and schemas.

Exceptions to these column matching concerns include ALTERs adding an index, changing the storage type, or modifying the number of slices. これらは行の挿入または更新に影響しないため、それらはリプリケーションストリームの外部で並行して実行されるかも知れません。

オンラインスキーマ変更のスケールアウトの懸念点

ユーザデータでのALTERのパフォーマンスは環境の特異性と負荷によって変わるでしょう。以下をガイダンスとして解釈し、経験とテストに基づいて適用してください。

単純、小さなテーブルの変更

約 N*10^6 行のテーブルまたは同時実効制御が軽い場合、ALTERはライブでクラスタのサイズ、行数、全体の負荷によりますが、数分以内に完了すべきです。Cache and query plan refreshes occur regularly and the automatic refresh should handle any performance issues.

オンラインスキーマ変更での中程度のスケールアウトの問題点

Tables that are being accessed with high concurrency or have more than N*10^7 rows may experience degraded cluster performance immediately after the ALTER completes. Clustrixはそれぞれのテーブルについて統計とクエリプランを格納し、それらの値はキャッシュされます。Clustrix may not refresh or flush these caches quickly enough to avoid a performance impact, so once the ALTER is complete, flush the cache using the following commands:

qpc_flushとpdcache_updateはノードごとを基本として実行されることに注意してください。Therefore, these should be executed using the clx cmd utility to ensure they are run on all nodes.

Chaining these commands together in the shell as command && command && command is recommended to avoid delays between the completion of the ALTER and flushing of the caches.

オンラインスキーマ変更に関する極端なスケールアウト

N*10^8以上の行のテーブルに関しては、ALTERは完了するまで数時間あるいは数日も掛かるかもしれません。容量についての考慮、同時実効制御等々に関して上の全てのルールと推奨に特に注意を払ってください。In addition, once the ALTER completes, the BigC garbage collection may cause a performance impact on your system for an extended period of time while it catches up. Should this occur during your application's peak hours, you may want to temporarily throttle BigC by adjusting the following system variable to its minimum value of 524,288:

以下のクエリを使って、どれだけの数のキューされたlayer_mergesを残っているか、それらはいつ開始されたかを測定することができます。When you see that the longest running merges have recent start times, the merges related to your ALTER are complete.

Once processing of layer_merges is current, the layer_copy_speed variable can be restored via set global layer_copy_speed_bytes = default;

If you have adequate free space you may want to pin BigC with a separate transaction so you can control the release to a time of your choosing. You can accomplish this with a separate session and an explicit BEGIN; with no COMMIT. Once off-peak, either run COMMIT or ctrl+c out of that session. However, we've found that in most cases, reducing layer_copy_speed is sufficient to avoid excessive impact to production workloads. Keep in mind that artificially pinning BigC is subject to the idle_trx_timeout killer. Be sure to check that global in the database to prevent that session from being killed.

オンラインスキーマ変更の例

以下の例では、1テラバイトのテーブルが追加の列を含むようにALTERされるでしょう。必要な時間を見積もることは科学というよりも芸術であり、ALTERを実行するクラスタのサイズにも依存します。この例では、以前に小さなテーブルで行ったALTERによって、1テラバイトのテーブルがALTERを完了するまで～25時間掛かると信じるに至りました。

Let's consider the free space plan for our 10 TB cluster:

容量の要求の説明	必要とされるもの
クラスタは現在のところ10TBの利用可能な容量があり、内部走査のためにシステムが5%を確保すると知っています。これを超過すると、writeはブロックされ、ALTERはロールバックされるかも知れません。	0.5 TB
負荷の急増を処理するために常に少なくとも10%の空き容量を残すようにユーザがリクエストした。	1.0 TB
新しいテーブルは追加の列により1TB以上のビットを消費するでしょう。バイトの行数倍で計算できますが、1.1TBと仮定します。これはコピーのため、ALTERが完了するまでこの容量は確保しなければなりません。	1.1 TB
クラスタ全体では、1日あたり約～1.5TBのwriteトランザクションを受け取ります(binlogへのwriteを含む)、ですので (1.5 / 24) * 25 hours = 1.5625 TB	1.6 TB
ALTERを安全にbeginするために必要な総空き容量:	4.2 TB

慎重な見積もりとしては、ALTERが完了するにはクラスタ上に少なくとも4.5TBの利用可能な容量があるべきです。

ALTERのパフォーマンス

外部からクラスタに接続している場合は、もしセッションが切断されたら以下の長く実行しているトランザクションが失敗しロールバックするだろうことを忘れないでください。screenまたはその他のなんらかのdetach可能なターミナルの利用をお勧めします。

一つのターミナルの接続を利用して、Session Aを初期化しBigCを固定するトランザクションを開始します:

別の端末では、SessionBを初期化しALTERを開始します。以下の3つのコマンドがぴったりと連続して実行される必要があるため、それらを以下のように一つに連結します:

ずっと後、ALTERとその後キャッシュフラッシュが終わった後で、Session Aに保持されているトランザクションを開放することができます。ガベージコレクションは多少重いため、ロードのピークが過ぎるまで待ちたいと思うかも知れませんが、layer_merges を小さくすると分からないくらいになることが分かりました。

From the Session A terminal:

結果の観察

Over the next few hours (depending on the size of your ALTER) you can examine the system.layer_merges relation for status of post-alter cleanup. この時間の間、システムができるだけ早く空き容量を回復できるように、他の大きな延期することができるタスクを避けます。

一端終了すると、layer_copy_speed を default に再設定します: