scala Spark SQL DataFrame - distinct() 与 dropDuplicates()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35666967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark SQL DataFrame - distinct() vs dropDuplicates()
提问by Shankar
I was looking at the DataFrame API, i can see two different methods doing the same functionality for removing duplicates from a data set.
我正在查看 DataFrame API,我可以看到两种不同的方法执行相同的功能以从数据集中删除重复项。
I can understand dropDuplicates(colNames) will remove duplicates considering only the subset of columns.
我可以理解 dropDuplicates(colNames) 将仅考虑列的子集删除重复项。
Is there any other differences between these two methods?
这两种方法之间还有其他区别吗?
回答by Bentech
The main difference is the consideration of the subset of columns which is great!
When using distinct you need a prior .selectto select the columns on which you want to apply the duplication and the returned Dataframe contains only these selected columns while dropDuplicates(colNames)will return all the columns of the initial dataframe after removing duplicated rows as per the columns.
主要区别在于对列子集的考虑,这很棒!使用 distinct 时,您需要先.select选择要应用重复的列,dropDuplicates(colNames)并且返回的数据帧仅包含这些选定的列,而在根据列删除重复行后将返回初始数据帧的所有列。
回答by Mrinal
From javadoc, there is no difference between distinc() and dropDuplicates().
从javadoc来看, distinc() 和 dropDuplicates() 之间没有区别。
dropDuplicates
public DataFrame dropDuplicates()
Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.
删除重复项
公共数据帧 dropDuplicates()
返回一个新的 DataFrame,其中仅包含此 DataFrame 中的唯一行。这是 distinct 的别名。
dropDuplicates() was introduced in 1.4 as a replacement for distinct(), as you can use it's overloaded methods to get unique rows based on subset of columns.
dropDuplicates() 在 1.4 中被引入作为 distinct() 的替代品,因为您可以使用它的重载方法来获取基于列子集的唯一行。

