Python 如何从 RDD[PYSPARK] 中删除重复值

Question

提问by Prince Bhatti

I have the following table as a RDD:

我有下表作为 RDD：

Key Value
1    y
1    y
1    y
1    n
1    n
2    y
2    n
2    n

I want to remove all the duplicates from Value.

我想从Value.

Output should come like this:

输出应该是这样的：

Key Value
1    y
1    n
2    y
2    n

While working in pyspark, output should come as list of key-value pairs like this:

在 pyspark 中工作时，输出应该是这样的键值对列表：

[(u'1',u'n'),(u'2',u'n')]

I don't know how to apply forloop here. In a normal Python program it would have been very easy.

我不知道如何在for这里应用循环。在普通的 Python 程序中，这将非常容易。

I wonder if there is some function in pysparkfor the same.

我想知道是否有pyspark相同的功能。

Answer 1

采纳答案by Mikel Urkia

I am afraid I have no knowledge about python, so all the references and code I provide in this answer are relative to java. However, it should not be very difficult to translate it into pythoncode.

恐怕我对python一无所知，因此我在此答案中提供的所有参考资料和代码都与java 相关。但是，将其翻译成python代码应该不是很困难。

You should take a look to the following webpage. It redirects to Spark's official web page, which provides a list of all the transformations and actions supported by Spark.

你应该看看下面的网页。它重定向到Spark的官方网页，该网页提供了Spark支持的所有转换和操作的列表。

If I am not mistaken, the best approach (in your case) would be to use the distinct()transformation, which returns a new dataset that contains the distinct elements of the source dataset (taken from link). In java, it would be something like:

如果我没记错的话，最好的方法（在你的情况下）是使用distinct()转换，它返回一个新的数据集，其中包含源数据集的不同元素（取自链接）。在java中，它会是这样的：

JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();

So that, for example:

因此，例如：

Partition 1:

1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n

Partition 2:

2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n

Would get converted to:

将转换为：

Partition 1:

1-y | 2-y
1-n | 2-n 

Partition 2:

1-y | 2-g | 2-y
1-n | 2-n |

Of course, you still would have multiple RDD dataSets each wich a list of distinct elements.

当然，您仍然会有多个 RDD 数据集，每个数据集都有一个不同元素的列表。

Answer 2

回答by jsears

This problem is simple to solve using the distinctoperation of the pyspark library from Apache Spark.

使用distinctApache Spark的pyspark库的操作很容易解决这个问题。

from pyspark import SparkContext, SparkConf

# Set up a SparkContext for local testing
if __name__ == "__main__":
    sc = SparkContext(appName="distinctTuples", conf=SparkConf().set("spark.driver.host", "localhost"))

# Define the dataset
dataset = [(u'1',u'y'),(u'1',u'y'),(u'1',u'y'),(u'1',u'n'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n'),(u'2',u'n')]

# Parallelize and partition the dataset 
# so that the partitions can be operated
# upon via multiple worker processes.
allTuplesRdd = sc.parallelize(dataset, 4)

# Filter out duplicates
distinctTuplesRdd = allTuplesRdd.distinct() 

# Merge the results from all of the workers
# into the driver process.
distinctTuples = distinctTuplesRdd.collect()

print 'Output: %s' % distinctTuples

This will output the following:

这将输出以下内容：

Output: [(u'1',u'y'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n')]

Answer 3

回答by captClueless

If you want to remove all duplicates from a particular column or set of columns, i.e doing a distincton set of columns, then pyspark has the function dropDuplicates, which will accept specific set of columns to distinct on.

如果你想从一个特定的列或一组列中删除所有重复项，即在一distinct组列上做一个，那么 pyspark 有函数dropDuplicates，它将接受特定的一组列来区分。

aka

又名

df.dropDuplicates(['value']).show()

Python 如何从 RDD[PYSPARK] 中删除重复值

提问by Prince Bhatti

采纳答案by Mikel Urkia

回答by jsears

回答by captClueless

相关推荐

最近更新

标签

Python 如何从 RDD[PYSPARK] 中删除重复值

提问by Prince Bhatti

采纳答案by Mikel Urkia

回答by jsears

回答by captClueless

相关推荐

Python 对 numpy 数组中的每个第 n 个条目进行子采样

Python 是否可以在 matplotlib 中添加字符串作为图例项

Python Django - 配置不当：模块“django.contrib.auth.middleware”

Python 复数字符串格式

相关推荐

最近更新

标签