scala 如何比较Scala中不同的两个数据框和打印列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44338412/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to compare two dataframe and print columns that are different in scala
提问by rominoushana
We have two data frames here:
我们这里有两个数据框:
the expected dataframe:
预期的数据框:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
and the actual data frame:
和实际数据框:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
the difference between the two dataframes now is:
现在两个数据帧之间的区别是:
+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+--------+--------+----------+-------+--------+
We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.
我们正在使用除函数 df1.except(df2),但问题是,它返回不同的整行。我们想要的是查看该行中哪些列不同(在这种情况下,“emp_name”中的“romin”和“romino”是不同的)。我们在这方面遇到了巨大的困难,任何帮助都会很棒。
回答by himanshuIIITian
From the scenario that is described in the above question, it looks like that difference has to found between columns and not rows.
从上述问题中描述的场景来看,似乎必须在列而不是行之间找到差异。
So, in order to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.
因此,为了做到这一点,我们需要在此处应用选择性差异,这将为我们提供具有不同值的列以及值。
Now, to apply selective difference we have to write code something like this:
现在,要应用选择性差异,我们必须编写如下代码:
First we need to find the columns in expected and actual dataframes.
val columns = df1.schema.fields.map(_.name)
Then we have to find difference columnwise.
val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
At last we need to find out which columns contains different values.
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
首先,我们需要找到预期和实际数据帧中的列。
val 列 = df1.schema.fields.map(_.name)
然后我们必须按列查找差异。
val selectedDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
最后我们需要找出哪些列包含不同的值。
selectionDifferences.map(diff => {if(diff.count > 0) diff.show})
And, we will get only the columns which contains different values. Like this:
并且,我们只会得到包含不同值的列。像这样:
+--------+
|emp_name|
+--------+
| romino|
+--------+
I hope this helps!
我希望这有帮助!
回答by vivek mishra
list_col=[]
cols=df1.columns
# Prepare list of dataframes/per column
for col in cols:
list_col.append(df1.select(col).subtract(df2.select(col)))
# Render/persist
for l in list_col :
if l.count() > 0 :
l.show()

