Regression - spark.mllib

アイソトニック回帰

アイソトニック回帰は回帰アルゴリズムの一族に所属します。Formally isotonic regression is a problem where given a finite set of real numbers $Y = {y_1, y_2, ..., y_n}$ representing observed responses and $X = {x_1, x_2, ..., x_n}$ the unknown response values to be fitted finding a function that minimises

\begin{equation} f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2 \end{equation}

with respect to complete order subject to $x_1\le x_2\le ...\le x_n$ where $w_i$ are positive weights. 結果の関数はアイソトニック回帰と呼ばれ、ユニークです。それは順番制限下にある最小二乗問題と見ることができます。本質的にアイソトニック回帰は元のデータポイントに最も合致する単調関数です。

spark.mllibは並行アイソトニック回帰へのやり方を使用するpool adjacent violators アルゴリズムです。訓練入力は3つのdouble値のタプルのRDDで、この順にラベル、特徴、および重み付けを現します。更にアイソトニック回帰アルゴリズムはデフォルトがtrueの $isotonic$ と呼ばれる任意のパラメータを持ちます。この引数はアイソトニック回帰がアイソトニック(単調増加)か、アンチトニック(単調減少)かを指定します。

訓練は既知あるいは未知の特徴の両方のためのラベルを予想するために使うことができるIsotonicRegressionModelを返します。アイソトニック回帰の結果は区分的に線形な関数として扱われます。従って予想のルールは以下のようになります:

予想入力が厳密に訓練特徴と一致する場合は関連する予想が返されます。同じ特徴を持つ複数の予想がある場合は、それらのうちの一つが返されます。どちらかの一つが未定義です(java.util.Arrays.binarySearchと同じです)。
予想入力が全ての訓練特徴より低いあるいは高い場合は、一番低いあるいは高い特徴がそれぞれ返されます。同じ特徴を持つ複数の予想がある場合は、一番低いあるいは高いものがそれぞれ返されます。
予想入力が2つの訓練特徴の間にある場合は、予想は部分的に線形関数として扱われ、補間値が二つの近接する特徴の予想から計算されます。同じ特徴を持つ複数の値がある場合は、以前のポイントと同じルールが使われます。

例

各行がラベル,特徴の形式を持つファイルからデータを読み込みます。例えば、4710.28,500.00。データは訓練およびテストのために分割されます。モデルは訓練セットを使って生成され、平均二乗エラーは予想されたラベルとテストセットの中の実際のラベルを使って計算されます。

APIの詳細はIsotonicRegression Scala ドキュメントおよび IsotonicRegressionModel Scala ドキュメントを参照してください。

import org.apache.spark.mllib.regression.{IsotonicRegression, IsotonicRegressionModel}

val data = sc.textFile("data/mllib/sample_isotonic_regression_data.txt")

// Create label, feature, weight tuples from input data with weight set to default value 1.0.
val parsedData = data.map { line =>
  val parts = line.split(',').map(_.toDouble)
  (parts(0), parts(1), 1.0)
}

// Split data into training (60%) and test (40%) sets.
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)

// Create isotonic regression model from training data.
// Isotonic parameter defaults to true so it is only shown for demonstration
val model = new IsotonicRegression().setIsotonic(true).run(training)

// Create tuples of predicted and real labels.
val predictionAndLabel = test.map { point =>
  val predictedLabel = model.predict(point._2)
  (predictedLabel, point._1)
}

// Calculate mean squared error between predicted and real labels.
val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p - l), 2) }.mean()
println("Mean Squared Error = " + meanSquaredError)

// Save and load model
model.save(sc, "target/tmp/myIsotonicRegressionModel")
val sameModel = IsotonicRegressionModel.load(sc, "target/tmp/myIsotonicRegressionModel")

例の完全なコードは Spark のリポジトリの "examples/src/main/scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala" で見つかります。

APIの詳細はIsotonicRegression Java ドキュメントおよび IsotonicRegressionModel Java ドキュメントを参照してください。

import scala.Tuple2;
import scala.Tuple3;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.regression.IsotonicRegression;
import org.apache.spark.mllib.regression.IsotonicRegressionModel;

JavaRDD<String> data = jsc.textFile("data/mllib/sample_isotonic_regression_data.txt");

// Create label, feature, weight tuples from input data with weight set to default value 1.0.
JavaRDD<Tuple3<Double, Double, Double>> parsedData = data.map(
  new Function<String, Tuple3<Double, Double, Double>>() {
    public Tuple3<Double, Double, Double> call(String line) {
      String[] parts = line.split(",");
      return new Tuple3<>(new Double(parts[0]), new Double(parts[1]), 1.0);
    }
  }
);

// Split data into training (60%) and test (40%) sets.
JavaRDD<Tuple3<Double, Double, Double>>[] splits = parsedData.randomSplit(new double[]{0.6, 0.4}, 11L);
JavaRDD<Tuple3<Double, Double, Double>> training = splits[0];
JavaRDD<Tuple3<Double, Double, Double>> test = splits[1];

// Create isotonic regression model from training data.
// Isotonic parameter defaults to true so it is only shown for demonstration
final IsotonicRegressionModel model = new IsotonicRegression().setIsotonic(true).run(training);

// Create tuples of predicted and real labels.
JavaPairRDD<Double, Double> predictionAndLabel = test.mapToPair(
  new PairFunction<Tuple3<Double, Double, Double>, Double, Double>() {
    @Override
    public Tuple2<Double, Double> call(Tuple3<Double, Double, Double> point) {
      Double predictedLabel = model.predict(point._2());
      return new Tuple2<Double, Double>(predictedLabel, point._1());
    }
  }
);

// Calculate mean squared error between predicted and real labels.
Double meanSquaredError = new JavaDoubleRDD(predictionAndLabel.map(
  new Function<Tuple2<Double, Double>, Object>() {
    @Override
    public Object call(Tuple2<Double, Double> pl) {
      return Math.pow(pl._1() - pl._2(), 2);
    }
  }
).rdd()).mean();
System.out.println("Mean Squared Error = " + meanSquaredError);

// Save and load model
model.save(jsc.sc(), "target/tmp/myIsotonicRegressionModel");
IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(jsc.sc(), "target/tmp/myIsotonicRegressionModel");

例の完全なコードは Spark のリポジトリの "examples/src/main/java/org/apache/spark/examples/mllib/JavaIsotonicRegressionExample.java" で見つかります。

APIについての詳細はIsotonicRegression Python ドキュメントおよびIsotonicRegressionModel Python ドキュメントを参照してください。

import math
from pyspark.mllib.regression import IsotonicRegression, IsotonicRegressionModel

data = sc.textFile("data/mllib/sample_isotonic_regression_data.txt")

# Create label, feature, weight tuples from input data with weight set to default value 1.0.
parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) + (1.0,))

# Split data into training (60%) and test (40%) sets.
training, test = parsedData.randomSplit([0.6, 0.4], 11)

# Create isotonic regression model from training data.
# Isotonic parameter defaults to true so it is only shown for demonstration
model = IsotonicRegression.train(training)

# Create tuples of predicted and real labels.
predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0]))

# Calculate mean squared error between predicted and real labels.
meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 2)).mean()
print("Mean Squared Error = " + str(meanSquaredError))

# Save and load model
model.save(sc, "target/tmp/myIsotonicRegressionModel")
sameModel = IsotonicRegressionModel.load(sc, "target/tmp/myIsotonicRegressionModel")

例の完全なコードは Spark のリポジトリの "examples/src/main/python/mllib/isotonic_regression_example.py" で見つかります。

spark.ml パッケージ

spark.mllib パッケージ

Regression - spark.mllib

アイソトニック回帰

例