基本的な統計 - spark.mllib

統計の概要
相関関係
層別抽出法
仮説テスト
- ストリーミングの有意テスト
ランダムデータ生成
カーネル密度推定

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]

統計の概要

Statisticsで利用可能なcolStatsを使って、RDD[Vector] のためのカラムの統計の概要を提供します。

colStats() はMultivariateStatisticalSummaryのインスタンスを返し、これはカラム方向の最大、最小、平均、分散、および非ゼロの数と総数です。

APIの詳細はMultivariateStatisticalSummary Scala ドキュメントを参照してください。

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations: RDD[Vector] = ... // an RDD of Vectors

// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column

colStats() はMultivariateStatisticalSummaryのインスタンスを返し、これはカラム方向の最大、最小、平均、分散、および非ゼロの数と総数です。

APIの詳細はMultivariateStatisticalSummary Java ドキュメントを参照してください。

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;

JavaSparkContext jsc = ...

JavaRDD<Vector> mat = ... // an RDD of Vectors

// Compute column summary statistics.
MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
System.out.println(summary.mean()); // a dense vector containing the mean value for each column
System.out.println(summary.variance()); // column-wise variance
System.out.println(summary.numNonzeros()); // number of nonzeros in each column

colStats() はMultivariateStatisticalSummaryのインスタンスを返し、これはカラム方向の最大、最小、平均、分散、および非ゼロの数と総数です。

APIの詳細はMultivariateStatisticalSummary Python ドキュメントを参照してください。

from pyspark.mllib.stat import Statistics

sc = ... # SparkContext

mat = ... # an RDD of Vectors

# Compute column summary statistics.
summary = Statistics.colStats(mat)
print(summary.mean())
print(summary.variance())
print(summary.numNonzeros())

相関関係

データの2つの系列間の相関関係の計算は、統計では一般的な操作です。spark.mllibでは、多くの系列間でペア方向の相関関係を計算するための柔軟性を提供します。サポートされる相関関係メソッドは現ジアのところピアソンおよびスピアマンの相関関係です。

Statistics は系列間の相関関係を計算するためのメソッドを提供します。入力のタイプ、2つのRDD[Double]、あるいはRDD[Vector]に依存して、出力はそれぞれDouble あるいは相関関係マトリックスです。

APIの詳細はStatistics Scala ドキュメントを参照してください。

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics

val sc: SparkContext = ...

val seriesX: RDD[Double] = ... // a series
val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX

// compute the correlation using Pearson's method. スピアマンのメソッドのために"spearman"を入力します。もし、
// メソッドが指定されない場合は、ピアソンのメソッドがデフォルトで使われるでしょう。
val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")

val data: RDD[Vector] = ... // note that each Vector is a row and not a column

// calculate the correlation matrix using Pearson's method. スペルマンのメソッドのために"spearman"を使用します。
// もしメソッドが指定されない場合は、ピアソンのメソッドがデフォルトで使われるでしょう。
val correlMatrix: Matrix = Statistics.corr(data, "pearson")

Statistics は系列間の相関関係を計算するためのメソッドを提供します。入力のタイプ、2つのJavaDoubleRDD、あるいはJavaRDD[Vector]に依存して、出力はそれぞれDouble あるいは相関関係マトリックスです。

APIの詳細はStatistics Java ドキュメントを参照してください。

import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.linalg.*;
import org.apache.spark.mllib.stat.Statistics;

JavaSparkContext jsc = ...

JavaDoubleRDD seriesX = ... // a series
JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX

// compute the correlation using Pearson's method. スピアマンのメソッドのために"spearman"を入力します。もし 
// メソッドが指定されない場合は、ピアソンのメソッドがデフォルトで使われるでしょう。
Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");

JavaRDD<Vector> data = ... // note that each Vector is a row and not a column

// calculate the correlation matrix using Pearson's method. スペルマンのメソッドのために"spearman"を使用します
// もしメソッドが指定されない場合は、ピアソンのメソッドがデフォルトで使われるでしょう。
Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");

APIの詳細はStatistics Python ドキュメントを参照してください。

from pyspark.mllib.stat import Statistics

sc = ... # SparkContext

seriesX = ... # a series
seriesY = ... # must have the same number of partitions and cardinality as seriesX

# Compute the correlation using Pearson's method. スピアマンのメソッドのために"spearman"を入力します。もし 
# メソッドが指定されない場合は、ピアソンのメソッドがデフォルトで使われるでしょう。
print(Statistics.corr(seriesX, seriesY, method="pearson"))

data = ... # an RDD of Vectors
# calculate the correlation matrix using Pearson's method. スペルマンのメソッドのために"spearman"を使用します
# もしメソッドが指定されない場合は、ピアソンのメソッドがデフォルトで使われるでしょう。
print(Statistics.corr(data, method="pearson"))

層別抽出法

統計学の関数と違い、spark.mllibにある層別抽出法メソッド sampleByKey と sampleByKeyExact はRDDのキー-値ペアに実施することができます。送別抽出法のために、キーはラベルで、値は特定の属性と考えることができます。For example the key can be man or woman, or document ids, and the respective values can be the list of ages of the people in the population or the list of words in the documents. sampleByKeyメソッドは観察が評価される、あるいはされないかを決めるためにコインを投げるでしょう。従って、データ上を1回通過する必要があり、期待される 標本化のサイズを提供します。sampleByKeyExact はsampleByKeyで使われる層の単純なランダム標本化よりもとても多くのリソースを必要としますが、99.99%の信頼性のある精巧な標本化を提供するでしょう。sampleByKeyExact は現在のところpythonではサポートされません。

sampleByKeyExact() は正確に $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ 項目を標本化することができます。ここで、$f_k$ は $k$ の望ましい一部分、$n_k$ は $k$ のためのキー-値ペアの数、$K$ はキーのセットです。置換無しの標本化は、置換無しの標本化が2つの追加のRDD上の追加を必要とするのに対し、標本のサイズを保証するためにRDD上に更に追加の1回の通過を必要とします。

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions

val sc: SparkContext = ...

val data = ... // an RDD[(K, V)] of any key value pairs
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

// Get an exact sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions)
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)

import java.util.Map;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;

JavaSparkContext jsc = ...

JavaPairRDD<K, V> data = ... // an RDD of any key value pairs
Map<K, Object> fractions = ... // specify the exact fraction desired from each key

// Get an exact sample from each stratum
JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions);
JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions);

sampleByKey() は大まかに $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ 項目を標本化することができます。ここで、$f_k$ は $k$ の望ましい一部分、$n_k$ は $k$ のためのキー-値ペアの数、$K$ はキーのセットです。

注意: sampleByKeyExact() は現在のところPythonでサポートされません。

sc = ... # SparkContext

data = ... # an RDD of any key value pairs
fractions = ... # specify the exact fraction desired from each key as a dictionary

approxSample = data.sampleByKey(False, fractions);

仮説テスト

仮説テストは統計学において結果が統計学的に重要などうか、この結果が偶然によっておきたかどううかを決定する強力なツールです。spark.mllib は現在のところピアソンのカイ二乗 ( $\chi^2$) テストを適合度と独立のためにサポートします。入力データタイプは適合度と独立のテストを行うかどうかを決定します。独立テストがMatrixを入力として必要とするのに対し、適合度テストはVectorの入力タイプを必要とします。

spark.mllib はカイ二乗独立テストを使って特徴の選択を可能にするために入力タイプRDD[LabeledPoint]もサポートします。

Statistics はピアソンのカイ二乗テストを実行するためのメソッドを提供します。以下の例は仮説テストをどうやって実行および解釈するかを実演します。

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics._

val sc: SparkContext = ...

val vec: Vector = ... // a vector composed of the frequencies of events

// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
// the test runs against a uniform distribution.  
val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
                                 // test statistic, the method used, and the null hypothesis.

val mat: Matrix = ... // a contingency matrix

// conduct Pearson's independence test on the input contingency matrix
val independenceTestResult = Statistics.chiSqTest(mat) 
println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...

val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.

// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
// against the label.
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
var i = 1
featureTestResults.foreach { result =>
    println(s"Column $i:\n$result")
    i += 1
} // summary of the test

APIの詳細はChiSqTestResult Java ドキュメントを参照してください。

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.linalg.*;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.stat.Statistics;
import org.apache.spark.mllib.stat.test.ChiSqTestResult;

JavaSparkContext jsc = ...

Vector vec = ... // a vector composed of the frequencies of events

// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
// the test runs against a uniform distribution.  
ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
// summary of the test including the p-value, degrees of freedom, test statistic, the method used, 
// and the null hypothesis.
System.out.println(goodnessOfFitTestResult);

Matrix mat = ... // a contingency matrix

// conduct Pearson's independence test on the input contingency matrix
ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
// summary of the test including the p-value, degrees of freedom...
System.out.println(independenceTestResult);

JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points

// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
// against the label.
ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());
int i = 1;
for (ChiSqTestResult result : featureTestResults) {
    System.out.println("Column " + i + ":");
    System.out.println(result); // summary of the test
    i++;
}

APIの詳細はStatistics Python ドキュメントを参照してください。

from pyspark import SparkContext
from pyspark.mllib.linalg import Vectors, Matrices
from pyspark.mllib.regresssion import LabeledPoint
from pyspark.mllib.stat import Statistics

sc = SparkContext()

vec = Vectors.dense(...) # a vector composed of the frequencies of events

# compute the goodness of fit. If a second vector to test against is not supplied as a parameter,
# the test runs against a uniform distribution.
goodnessOfFitTestResult = Statistics.chiSqTest(vec)
print(goodnessOfFitTestResult) # summary of the test including the p-value, degrees of freedom,
                               # test statistic, the method used, and the null hypothesis.

mat = Matrices.dense(...) # a contingency matrix

# conduct Pearson's independence test on the input contingency matrix
independenceTestResult = Statistics.chiSqTest(mat)
print(independenceTestResult)  # summary of the test including the p-value, degrees of freedom...

obs = sc.parallelize(...)  # LabeledPoint(feature, label) .

# The contingency table is constructed from an RDD of LabeledPoint and used to conduct
# the independence test. Returns an array containing the ChiSquaredTestResult for every feature
# against the label.
featureTestResults = Statistics.chiSqTest(obs)

for i, result in enumerate(featureTestResults):
    print("Column $d:" % (i + 1))
    print(result)

更に、spark.mllibは可能性の分散の平等のためにコルモゴロフ-スミルノフ(KS)検定の1-サンプル 2-サイド実装を提供します。理論的な分散の名前(現在は通常の分散に対してのみサポートされます)とパラメータ、あるいは指定された理論的な分散の累積に応じて関数を計算するための関数を提供することで、分散から取り出された分散のゼロ仮説をユーザがテストすることができます。ユーザが通常の分散 (distName="norm")をテストするうが、分散パラメータが提供されない場合は、テストは標準の通常の分散に初期化され、適切なメッセージが記録されます。

Statistics は、1-サンプル 2-サイドのコルモゴロフ-スミルノフ検定を実行するメソッドを提供します。以下の例は仮説テストをどうやって実行および解釈するかを実演します。

APIの詳細はStatistics Scala ドキュメントを参照してください。

import org.apache.spark.mllib.stat.Statistics

val data: RDD[Double] = ... // an RDD of sample data

// run a KS test for the sample versus a standard normal distribution
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
println(testResult) // summary of the test including the p-value, test statistic,
                    // and null hypothesis
                    // if our p-value indicates significance, we can reject the null hypothesis

// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)

APIの詳細はStatistics Java ドキュメントを参照してください。

import java.util.Arrays;

import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.mllib.stat.Statistics;
import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;

JavaSparkContext jsc = ...
JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0);
// summary of the test including the p-value, test statistic,
// and null hypothesis
// if our p-value indicates significance, we can reject the null hypothesis
System.out.println(testResult);

APIの詳細はStatistics Python ドキュメントを参照してください。

from pyspark.mllib.stat import Statistics

parallelData = sc.parallelize([1.0, 2.0, ... ])

# run a KS test for the sample versus a standard normal distribution
testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
print(testResult) # summary of the test including the p-value, test statistic,
                  # and null hypothesis
                  # if our p-value indicates significance, we can reject the null hypothesis
# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with
# a lambda to calculate the CDF is not made available in the Python API

ストリーミングの有意テスト

spark.mllib はA/Bテストのようなユースケースをサポートするために、幾つかのテストのオンライン実装を提供します。These tests may be performed on a Spark Streaming DStream[(Boolean,Double)] where the first element of each tuple indicates control group (false) or treatment group (true) and the second element is the value of an observation.

ストリーミングの有意テストは以下のパラメータをサポートします:

peacePeriod - The number of initial data points from the stream to ignore, used to mitigate novelty effects.
windowSize - The number of past batches to perform hypothesis testing over. Setting to 0 will perform cumulative processing using all prior batches.

StreamingTest provides streaming hypothesis testing.

val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {
  case Array(label, value) => BinarySample(label.toBoolean, value.toDouble)
})

val streamingTest = new StreamingTest()
  .setPeacePeriod(0)
  .setWindowSize(0)
  .setTestMethod("welch")

val out = streamingTest.registerStream(data)
out.print()

例の完全なコードは Spark のリポジトリの "examples/src/main/scala/org/apache/spark/examples/mllib/StreamingTestExample.scala" で見つかります。

ランダムデータ生成

ランダムデータ生成はランダム化アルゴリズム、プロトタイピングおよびテストを実施するのに便利です。spark.mllibは i.i.dを使ったランダムなRDDの生成をサポートします。指定された分散から取り出されt値: uniform, standard normal, あるいは Poisson.

RandomRDDs はランダムなdouble RDDあるいはベクトルRDDを生成するファクトリーメソッドを提供します。以下の例はランダムなdouble RDDを生成します。値は標準の通常の分散N(0, 1) に従い、それをN(1, 4)にマップします。

APIの詳細はRandomRDD Scala ドキュメントを参照してください。

import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

val sc: SparkContext = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)

APIの詳細はRandomRDD Java ドキュメントを参照してください。

import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;

JavaSparkContext jsc = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
// Apply a transform to get a random double RDD following `N(1, 4)`.
JavaDoubleRDD v = u.map(
  new Function<Double, Double>() {
    public Double call(Double x) {
      return 1.0 + 2.0 * x;
    }
  });

APIの詳細はRandomRDD Python ドキュメントを参照してください。

from pyspark.mllib.random import RandomRDDs

sc = ... # SparkContext

# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
u = RandomRDDs.normalRDD(sc, 1000000L, 10)
# Apply a transform to get a random double RDD following `N(1, 4)`.
v = u.map(lambda x: 1.0 + 2.0 * x)

カーネル密度推定

Kernel density estimation は、観察される標本が取り出された特定の分散に関する仮定を必要とせずに経験的な確率分散を視覚化するのに便利な方法です。指定されたポイントのセットで評価されるランダムな値の可能性の濃度関数の推測を計算します。It achieves this estimate by expressing the PDF of the empirical distribution at a particular point as the the mean of PDFs of normal distributions centered around each of the samples.

KernelDensity は標本のRDDからカーネル密度の推測を計算するメソッドを提供します。以下の例はどうやって行うかを実演します。

APIの詳細はKernelDensity Scala ドキュメントを参照してください。

import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD

val data: RDD[Double] = ... // an RDD of sample data

// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val kd = new KernelDensity()
  .setSample(data)
  .setBandwidth(3.0)

// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))

KernelDensity は標本のRDDからカーネル密度の推測を計算するメソッドを提供します。以下の例はどうやって行うかを実演します。

APIの詳細はKernelDensity Java ドキュメントを参照してください。

import org.apache.spark.mllib.stat.KernelDensity;
import org.apache.spark.rdd.RDD;

RDD<Double> data = ... // an RDD of sample data

// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
KernelDensity kd = new KernelDensity()
  .setSample(data)
  .setBandwidth(3.0);

// Find density estimates for the given values
double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});

KernelDensity は標本のRDDからカーネル密度の推測を計算するメソッドを提供します。以下の例はどうやって行うかを実演します。

APIの詳細はKernelDensity Python ドキュメントを参照してください。

from pyspark.mllib.stat import KernelDensity

data = ... # an RDD of sample data

# Construct the density estimator with the sample data and a standard deviation for the Gaussian
# kernels
kd = KernelDensity()
kd.setSample(data)
kd.setBandwidth(3.0)

# Find density estimates for the given values
densities = kd.estimate([-1.0, 2.0, 5.0])

spark.ml パッケージ

spark.mllib パッケージ

基本的な統計 - spark.mllib

統計の概要

相関関係

層別抽出法

仮説テスト

ストリーミングの有意テスト

ランダムデータ生成

カーネル密度推定