scala Spark中的分层抽样
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32238727/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stratified sampling in Spark
提问by add-semi-colons
I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
我有包含用户和购买数据的数据集。这是一个示例,其中第一个元素是 userId,第二个元素是 productId,第三个元素表示布尔值。
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
我想确保我只获取每个用户数据的 80% 并构建一个 RDD,同时获取其余的 20% 并构建另一个 RDD。让我们调用训练和测试。我想远离使用 groupBy 开始,因为它会产生内存问题,因为数据集很大。什么是最好的方法来做到这一点?
I could do following but this will not give 80% of each user.
我可以跟随,但这不会给每个用户 80%。
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
回答by eliasah
One possible solution is in Holden's answer, and here is some other solutions :
一种可能的解决方案是在 Holden 的回答中,这里有一些其他的解决方案:
Using RDDs :
使用 RDD:
You can use the sampleByKeyExact transformation, from the PairRDDFunctionsclass.
您可以使用PairRDDFunctions类中的 sampleByKeyExact 转换。
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed) Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed) 返回按键(通过分层采样)采样的 RDD 的一个子集,其中包含每个层(具有相同键的对组)的 math.ceil(numItems * samplingRate) )。
And this is how I would do :
这就是我要做的:
Considering the following list :
考虑以下列表:
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDDPair, mapping all the users as keys :
我会创建一个RDDPair,将所有用户映射为键:
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractionsfor each key as following, since sampleByKeyExacttakes a Map of fraction for each key :
然后我将为fractions每个键设置如下,因为每个键都 sampleByKeyExact需要一个分数映射:
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
我在这里所做的是映射键以找到不同的键,然后将每个键关联到一个等于 的分数0.8。我将整个收集为地图。
To sample now :
现在取样:
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
或者
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
您可以检查键或数据或数据样本的计数:
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
使用数据帧:
Let's consider the same data (seq) from the previous section.
让我们考虑seq上一节中的相同数据 ( )。
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDDto do that on which we creates tuples of the elements in this RDDby defining our key to be the first column :
我们将需要底层RDD来完成我们RDD通过将我们的键定义为第一列来创建元素元组的操作:
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or dfor data sample :
您现在可以检查您的密钥df或数据样本的计数:
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0you can use DataFrameStatFunctions.sampleBymethod:
从Spark 1.5.0 开始,您可以使用DataFrameStatFunctions.sampleBy方法:
df.stat.sampleBy("keyColumn", fractions, seed)
回答by Holden
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
像这样的东西可能非常适合“Blink DB”之类的东西,但让我们看看这个问题。有两种方法可以解释您所问的问题:
1) You want 80% of your users, and you want all of the data for them. 2) You want 80% of each users data
1) 您想要 80% 的用户,并且想要他们的所有数据。2) 你想要每个用户 80% 的数据
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFoldin MLUtilsor BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
对于 #1,您可以制作一个地图来获取用户 ID,调用 distinct,然后对其中的 80% 进行采样(您可能想查看kFoldinMLUtils或BernoulliCellSampler)。然后,您可以将输入数据过滤为您想要的一组 ID。
For #2 you could look at BernoulliCellSamplerand simply apply it directly.
对于#2,您可以查看BernoulliCellSampler并直接应用它。

