scala Spark SQL DataFrame - distinct() 与 dropDuplicates()

Question

提问by Shankar

I was looking at the DataFrame API, i can see two different methods doing the same functionality for removing duplicates from a data set.

我正在查看 DataFrame API，我可以看到两种不同的方法执行相同的功能以从数据集中删除重复项。

I can understand dropDuplicates(colNames) will remove duplicates considering only the subset of columns.

我可以理解 dropDuplicates(colNames) 将仅考虑列的子集删除重复项。

Is there any other differences between these two methods?

这两种方法之间还有其他区别吗？

Answer 1

回答by Bentech

The main difference is the consideration of the subset of columns which is great! When using distinct you need a prior .selectto select the columns on which you want to apply the duplication and the returned Dataframe contains only these selected columns while dropDuplicates(colNames)will return all the columns of the initial dataframe after removing duplicated rows as per the columns.

主要区别在于对列子集的考虑，这很棒！使用 distinct 时，您需要先.select选择要应用重复的列，dropDuplicates(colNames)并且返回的数据帧仅包含这些选定的列，而在根据列删除重复行后将返回初始数据帧的所有列。

Answer 2

回答by Mrinal

From javadoc, there is no difference between distinc() and dropDuplicates().

从javadoc来看， distinc() 和 dropDuplicates() 之间没有区别。

dropDuplicates
public DataFrame dropDuplicates()
Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.

删除重复项
公共数据帧 dropDuplicates()
返回一个新的 DataFrame，其中仅包含此 DataFrame 中的唯一行。这是 distinct 的别名。

dropDuplicates() was introduced in 1.4 as a replacement for distinct(), as you can use it's overloaded methods to get unique rows based on subset of columns.

dropDuplicates() 在 1.4 中被引入作为 distinct() 的替代品，因为您可以使用它的重载方法来获取基于列子集的唯一行。

scala Spark SQL DataFrame - distinct() 与 dropDuplicates()

提问by Shankar

回答by Bentech

回答by Mrinal

dropDuplicates

删除重复项

相关推荐

最近更新

标签

scala Spark SQL DataFrame - distinct() 与 dropDuplicates()

提问by Shankar

回答by Bentech

回答by Mrinal

dropDuplicates

删除重复项

相关推荐

如何在 Scala 中使用同步？

scala 在 Spark 中四舍五入

如何迭代 org.json4s.JsonAST.JValue，它是一个 JSON 对象数组，以分别处理 Scala 中的每个对象？

scala Spark：有条件地将列添加到数据框

相关推荐

最近更新

标签