Python 如何从 RDD[PYSPARK] 中删除重复值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/25905596/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove duplicate values from a RDD[PYSPARK]
提问by Prince Bhatti
I have the following table as a RDD:
我有下表作为 RDD:
Key Value
1    y
1    y
1    y
1    n
1    n
2    y
2    n
2    n
I want to remove all the duplicates from Value.
我想从Value.
Output should come like this:
输出应该是这样的:
Key Value
1    y
1    n
2    y
2    n
While working in pyspark, output should come as list of key-value pairs like this:
在 pyspark 中工作时,输出应该是这样的键值对列表:
[(u'1',u'n'),(u'2',u'n')]
I don't know how to apply forloop here. In a normal Python program it would have been very easy. 
我不知道如何在for这里应用循环。在普通的 Python 程序中,这将非常容易。
I wonder if there is some function in pysparkfor the same.
我想知道是否有pyspark相同的功能。
采纳答案by Mikel Urkia
I am afraid I have no knowledge about python, so all the references and code I provide in this answer are relative to java. However, it should not be very difficult to translate it into pythoncode.
恐怕我对python一无所知,因此我在此答案中提供的所有参考资料和代码都与java 相关。但是,将其翻译成python代码应该不是很困难。
You should take a look to the following webpage. It redirects to Spark's official web page, which provides a list of all the transformations and actions supported by Spark.
你应该看看下面的网页。它重定向到Spark的官方网页,该网页提供了Spark支持的所有转换和操作的列表。
If I am not mistaken, the best approach (in your case) would be to use the distinct()transformation, which returns a new dataset that contains the distinct elements of the source dataset (taken from link). In java, it would be something like:
如果我没记错的话,最好的方法(在你的情况下)是使用distinct()转换,它返回一个新的数据集,其中包含源数据集的不同元素(取自链接)。在java中,它会是这样的:
JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();
So that, for example:
因此,例如:
Partition 1:
1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n
Partition 2:
2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n
Would get converted to:
将转换为:
Partition 1:
1-y | 2-y
1-n | 2-n 
Partition 2:
1-y | 2-g | 2-y
1-n | 2-n |
Of course, you still would have multiple RDD dataSets each wich a list of distinct elements.
当然,您仍然会有多个 RDD 数据集,每个数据集都有一个不同元素的列表。
回答by jsears
This problem is simple to solve using the distinctoperation of the pyspark library from Apache Spark.
使用distinctApache Spark的pyspark库的操作很容易解决这个问题。
from pyspark import SparkContext, SparkConf
# Set up a SparkContext for local testing
if __name__ == "__main__":
    sc = SparkContext(appName="distinctTuples", conf=SparkConf().set("spark.driver.host", "localhost"))
# Define the dataset
dataset = [(u'1',u'y'),(u'1',u'y'),(u'1',u'y'),(u'1',u'n'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n'),(u'2',u'n')]
# Parallelize and partition the dataset 
# so that the partitions can be operated
# upon via multiple worker processes.
allTuplesRdd = sc.parallelize(dataset, 4)
# Filter out duplicates
distinctTuplesRdd = allTuplesRdd.distinct() 
# Merge the results from all of the workers
# into the driver process.
distinctTuples = distinctTuplesRdd.collect()
print 'Output: %s' % distinctTuples
This will output the following:
这将输出以下内容:
Output: [(u'1',u'y'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n')]
回答by captClueless
If you want to remove all duplicates from a particular column or set of columns, i.e doing a distincton set of columns, then pyspark has the function dropDuplicates, which will accept specific set of columns to distinct on. 
如果你想从一个特定的列或一组列中删除所有重复项,即在一distinct组列上做一个,那么 pyspark 有函数dropDuplicates,它将接受特定的一组列来区分。
aka
又名
df.dropDuplicates(['value']).show()

