This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version.

Parquet Format #

Format: Serialization Schema Format: Deserialization Schema

Apache Parquetフォーマットを使って、Parquetデータを読み書きできます。

依存 #

In order to use the Parquet format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependency	SQL Client
`<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-parquet</artifactId> <version>1.19-SNAPSHOT</version> </dependency>` Copied to clipboard!	Only available for stable releases.

Parquet形式を持つテーブルの作成法 #

ここでは、ファイルシステムコネクタとParquet形式を使ってテーブルを作成する方法を示します。

CREATE TABLE user_behavior (
  user_id BIGINT,
  item_id BIGINT,
  category_id BIGINT,
  behavior STRING,
  ts TIMESTAMP(3),
  dt STRING
) PARTITIONED BY (dt) WITH (
 'connector' = 'filesystem',
 'path' = '/tmp/user_behavior',
 'format' = 'parquet'
)

フォーマットオプション #

オプション	必要条件	デフォルト	種類	説明
形式	必須	(none)	文字列	使うフォーマットを指定します。ここでは、'parquet'にする必要があります。
parquet.utc-timezone	オプション	false	真偽値	Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.

Parquet format also supports configuration from ParquetOutputFormat. zgip圧縮を有効にするために、parquet.compression=GZIPを設定することができます。

データ型マッピング #

Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:

Timestamp: mapping timestamp type to int96 whatever the precision is.
Decimal: mapping decimal type to fixed length byte array according to the precision.

以下の表は、Flink型からParquet型への型マッピングを一覧表示しています。

Flinkデータ型	Parquet型	Parquetの論理型
CHAR / VARCHAR / STRING	BINARY	UTF8
BOOLEAN	BOOLEAN
BINARY / VARBINARY	BINARY
DECIMAL	FIXED_LEN_BYTE_ARRAY	DECIMAL
TINYINT	INT32	INT_8
SMALLINT	INT32	INT_16
INT	INT32
BIGINT	INT64
FLOAT	FLOAT
DOUBLE	DOUBLE
DATE	INT32	DATE
TIME	INT32	TIME_MILLIS
TIMESTAMP	INT96
ARRAY	-	LIST
MAP	-	MAP
ROW	-	STRUCT

Parquet Format #

依存 #

Parquet形式を持つテーブルの作成法 #

フォーマットオプション #

形式

parquet.utc-timezone

データ型マッピング #