Python Spark Dataframe 区分名称重复的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33778664/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark Dataframe distinguish columns with duplicated name
提问by resec
So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:
正如我在 Spark Dataframe 中所知道的,对于多个列可以具有相同的名称,如下面的数据帧快照所示:
[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]
Above result is created by join with a dataframe to itself, you can see there are 4
columns with both two a
and f
.
上面的结果是通过将数据框连接到自身创建的,您可以看到有4
两个a
和f
.
The problem is is there when I try to do more calculation with the a
column, I cant find a way to select the a
, I have try df[0]
and df.select('a')
, both returned me below error mesaage:
问题是当我尝试对a
列进行更多计算时,我找不到选择 的方法a
,我尝试过,df[0]
并且df.select('a')
都在错误消息下方返回了我:
AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.
Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?
无论如何在 Spark API 中,我可以再次区分列和重复名称吗?或者也许某种方式让我更改列名称?
采纳答案by Glennie Helles Sindholt
I would recommend that you change the column names for your join
我建议您更改列名 join
df1.select('a as "df1_a", 'f as "df1_f")
.join(df2.select('a as "df2_a", 'f as "df2_f"), 'df1_a === 'df2_a)
The resulting DataFrame
will have schema
结果DataFrame
将有schema
(df1_a, df1_f, df2_a, df2_f)
回答by resec
After digging into the Spark API, I found I can first use alias
to create an alias for the original dataframe, then I use withColumnRenamed
to manually rename every column on the alias, this will do the join
without causing the column name duplication.
在深入研究 Spark API 后,我发现我可以首先使用alias
为原始数据帧创建别名,然后我使用withColumnRenamed
手动重命名别名上的每一列,这样做join
不会导致列名重复。
More detail can be refer to below Spark Dataframe API:
更多细节可以参考下面的Spark Dataframe API:
pyspark.sql.DataFrame.withColumnRenamed
pyspark.sql.DataFrame.withColumnRenamed
However, I think this is only a troublesome workaround, and wondering if there is any better way for my question.
但是,我认为这只是一个麻烦的解决方法,并且想知道是否有更好的方法来解决我的问题。
回答by zero323
Lets start with some data:
让我们从一些数据开始:
from pyspark.mllib.linalg import SparseVector
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(a=107831, f=SparseVector(
5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=125231, f=SparseVector(
5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
])
df2 = sqlContext.createDataFrame([
Row(a=107831, f=SparseVector(
5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(
5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
])
There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:
有几种方法可以解决这个问题。首先,您可以使用父列明确引用子表列:
df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)
## +--------------------+
## | f|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+
You can also use table aliases:
您还可以使用表别名:
from pyspark.sql.functions import col
df1_a = df1.alias("df1_a")
df2_a = df2.alias("df2_a")
df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)
## +--------------------+
## | f|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+
Finally you can programmatically rename columns:
最后,您可以以编程方式重命名列:
df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))
df2_r = df2.select(*(col(x).alias(x + '_df2') for x in df2.columns))
df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)
## +--------------------+
## | f_df1|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+
回答by StrongYoung
You can use def drop(col: Column)
method to drop the duplicated column,for example:
您可以使用def drop(col: Column)
方法删除重复的列,例如:
DataFrame:df1
+-------+-----+
| a | f |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+
DataFrame:df2
+-------+-----+
| a | f |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+
when I join df1 with df2, the DataFrame will be like below:
当我将 df1 与 df2 连接时,DataFrame 将如下所示:
val newDf = df1.join(df2,df1("a")===df2("a"))
DataFrame:newDf
+-------+-----+-------+-----+
| a | f | a | f |
+-------+-----+-------+-----+
|107831 | ... |107831 | ... |
|107831 | ... |107831 | ... |
+-------+-----+-------+-----+
Now, we can use def drop(col: Column)
method to drop the duplicated column 'a' or 'f', just like as follows:
现在,我们可以使用def drop(col: Column)
方法删除重复的列 'a' 或 'f',如下所示:
val newDfWithoutDuplicate = df1.join(df2,df1("a")===df2("a")).drop(df2("a")).drop(df2("f"))
回答by typhoonbxq
Suppose the DataFrames you want to join are df1 and df2, and you are joining them on column 'a', then you have 2 methods
假设您要加入的 DataFrames 是 df1 和 df2,并且您在列 'a' 上加入它们,那么您有 2 种方法
Method 1
方法一
df1.join(df2,'a','left_outer')
df1.join(df2,'a','left_outer')
This is an awsome method and it is highly recommended.
这是一个很棒的方法,强烈推荐。
Method 2
方法二
df1.join(df2,df1.a == df2.a,'left_outer').drop(df2.a)
df1.join(df2,df1.a == df2.a,'left_outer').drop(df2.a)
回答by Paul Bendevis
There is a simpler way than writing aliases for all of the columns you are joining on by doing:
有一种比为您加入的所有列编写别名更简单的方法:
df1.join(df2,['a'])
This works if the key that you are joining on is the same in both tables.
如果您要加入的键在两个表中相同,则此方法有效。
See https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html
见 https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html
回答by Nikhil Redij
This is how we can join two Dataframes on same column namesin PySpark.
这就是我们如何在PySpark 中以相同的列名连接两个Dataframe。
df = df1.join(df2, ['col1','col2','col3'])
If you do printSchema()
after this then you can see that duplicate columns have been removed.
如果您printSchema()
在此之后这样做,那么您可以看到重复的列已被删除。
回答by Akash
This might not be the best approach, but if you want to rename the duplicate columns(after join), you can do so using this tiny function.
这可能不是最好的方法,但如果你想重命名重复的列(在加入之后),你可以使用这个小函数来做到这一点。
def rename_duplicate_columns(dataframe):
columns = dataframe.columns
duplicate_column_indices = list(set([columns.index(col) for col in columns if columns.count(col) == 2]))
for index in duplicate_column_indices:
columns[index] = columns[index]+'2'
dataframe = dataframe.toDF(*columns)
return dataframe
回答by Wassermann
If you have a more complicated use case than described in the answer of Glennie Helles Sindholt e.g. you have other/few non-join column names that are also same and want to distinguish them while selecting it's best to use aliasses, e.g:
如果您有比 Glennie Helles Sindholt 的回答中描述的更复杂的用例,例如,您有其他/很少的非连接列名称也相同,并希望在选择最好使用别名时区分它们,例如:
df3 = df1.select("a", "b").alias("left")\
.join(df2.select("a", "b").alias("right"), ["a"])\
.select("left.a", "left.b", "right.b")
df3.columns
['a', 'b', 'b']
回答by Manish Singla
if only the key column is the same in both tables then try using the following way (Approach 1):
如果两个表中只有键列相同,则尝试使用以下方式(方法 1):
left. join(right , 'key', 'inner')
rather than below(approach 2):
而不是下面(方法2):
left. join(right , left.key == right.key, 'inner')
Pros of using approach 1:
使用方法 1 的优点:
- the 'key' will show only once in the final dataframe
- easy to use the syntax
- 'key' 将只在最终数据框中显示一次
- 易于使用的语法
Cons of using approach 1:
使用方法 1 的缺点:
- only help with the key column
- Scenarios, wherein case of left join, if planning to use the right key null count, this will not work. In that case, one has to rename one of the key as mentioned above.
- 只帮助关键列
- 场景,其中左连接的情况,如果计划使用右键空计数,这将不起作用。在这种情况下,必须如上所述重命名其中一个键。