scala 如何比较Scala中不同的两个数据框和打印列

Question

提问by rominoushana

We have two data frames here:

我们这里有两个数据框：

the expected dataframe:

预期的数据框：

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

and the actual data frame:

和实际数据框：

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

the difference between the two dataframes now is:

现在两个数据帧之间的区别是：

+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
|     4| sanjose|  romino|9848022331|  45123|SanRamon|
+------+--------+--------+----------+-------+--------+

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

我们正在使用除函数 df1.except(df2)，但问题是，它返回不同的整行。我们想要的是查看该行中哪些列不同（在这种情况下，“emp_name”中的“romin”和“romino”是不同的）。我们在这方面遇到了巨大的困难，任何帮助都会很棒。

Answer 1

回答by himanshuIIITian

From the scenario that is described in the above question, it looks like that difference has to found between columns and not rows.

从上述问题中描述的场景来看，似乎必须在列而不是行之间找到差异。

So, in order to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

因此，为了做到这一点，我们需要在此处应用选择性差异，这将为我们提供具有不同值的列以及值。

Now, to apply selective difference we have to write code something like this:

现在，要应用选择性差异，我们必须编写如下代码：

First we need to find the columns in expected and actual dataframes.
val columns = df1.schema.fields.map(_.name)
Then we have to find difference columnwise.
val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
At last we need to find out which columns contains different values.
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

首先，我们需要找到预期和实际数据帧中的列。
val 列 = df1.schema.fields.map(_.name)
然后我们必须按列查找差异。
val selectedDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
最后我们需要找出哪些列包含不同的值。
selectionDifferences.map(diff => {if(diff.count > 0) diff.show})

And, we will get only the columns which contains different values. Like this:

并且，我们只会得到包含不同值的列。像这样：

+--------+
|emp_name|
+--------+
|  romino|
+--------+

I hope this helps!

我希望这有帮助！

Answer 2

回答by vivek mishra


list_col=[]
cols=df1.columns

# Prepare list of dataframes/per column
for col in cols:
  list_col.append(df1.select(col).subtract(df2.select(col)))

# Render/persist
for  l in list_col :
  if l.count() > 0 :
     l.show()

scala 如何比较Scala中不同的两个数据框和打印列

提问by rominoushana

回答by himanshuIIITian

回答by vivek mishra

相关推荐

最近更新

标签

scala 如何比较Scala中不同的两个数据框和打印列

提问by rominoushana

回答by himanshuIIITian

回答by vivek mishra

相关推荐

从特定列 scala spark 数据框中获取最小值和最大值

scala.MatchError: <SomeStringvalue> (class java.lang.String)

scala Spark Dataframe :How to add a index Column : Aka Distributed Data Index

scala 如何注册UDF以在SQL和DataFrame中使用？

相关推荐

最近更新

标签