Python Spark Dataframe 区分名称重复的列

Question

提问by resec

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:

正如我在 Spark Dataframe 中所知道的，对于多个列可以具有相同的名称，如下面的数据帧快照所示：

[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]

Above result is created by join with a dataframe to itself, you can see there are 4columns with both two aand f.

上面的结果是通过将数据框连接到自身创建的，您可以看到有4两个a和f.

The problem is is there when I try to do more calculation with the acolumn, I cant find a way to select the a, I have try df[0]and df.select('a'), both returned me below error mesaage:

问题是当我尝试对a列进行更多计算时，我找不到选择的方法a，我尝试过，df[0]并且df.select('a')都在错误消息下方返回了我：

AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.

Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?

无论如何在 Spark API 中，我可以再次区分列和重复名称吗？或者也许某种方式让我更改列名称？

Answer 1

采纳答案by Glennie Helles Sindholt

I would recommend that you change the column names for your join

我建议您更改列名 join

df1.select('a as "df1_a", 'f as "df1_f")
   .join(df2.select('a as "df2_a", 'f as "df2_f"), 'df1_a === 'df2_a)

The resulting DataFramewill have schema

结果DataFrame将有schema

(df1_a, df1_f, df2_a, df2_f)

Answer 2

回答by resec

After digging into the Spark API, I found I can first use aliasto create an alias for the original dataframe, then I use withColumnRenamedto manually rename every column on the alias, this will do the joinwithout causing the column name duplication.

在深入研究 Spark API 后，我发现我可以首先使用alias为原始数据帧创建别名，然后我使用withColumnRenamed手动重命名别名上的每一列，这样做join不会导致列名重复。

More detail can be refer to below Spark Dataframe API:

更多细节可以参考下面的Spark Dataframe API：

pyspark.sql.DataFrame.alias

pyspark.sql.DataFrame.withColumnRenamed

However, I think this is only a troublesome workaround, and wondering if there is any better way for my question.

但是，我认为这只是一个麻烦的解决方法，并且想知道是否有更好的方法来解决我的问题。

Answer 3

回答by zero323

Lets start with some data:

让我们从一些数据开始：

from pyspark.mllib.linalg import SparseVector
from pyspark.sql import Row

df1 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=125231, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
])

df2 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
])

There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:

有几种方法可以解决这个问题。首先，您可以使用父列明确引用子表列：

df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

You can also use table aliases:

您还可以使用表别名：

from pyspark.sql.functions import col

df1_a = df1.alias("df1_a")
df2_a = df2.alias("df2_a")

df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

Finally you can programmatically rename columns:

最后，您可以以编程方式重命名列：

df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))
df2_r = df2.select(*(col(x).alias(x + '_df2') for x in df2.columns))

df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)

## +--------------------+
## |               f_df1|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+

Answer 4

回答by StrongYoung

You can use def drop(col: Column)method to drop the duplicated column,for example:

您可以使用def drop(col: Column)方法删除重复的列，例如：

DataFrame:df1

+-------+-----+
| a     | f   |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+

DataFrame:df2

+-------+-----+
| a     | f   |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+

when I join df1 with df2, the DataFrame will be like below:

当我将 df1 与 df2 连接时，DataFrame 将如下所示：

val newDf = df1.join(df2,df1("a")===df2("a"))

DataFrame:newDf

+-------+-----+-------+-----+
| a     | f   | a     | f   |
+-------+-----+-------+-----+
|107831 | ... |107831 | ... |
|107831 | ... |107831 | ... |
+-------+-----+-------+-----+

Now, we can use def drop(col: Column)method to drop the duplicated column 'a' or 'f', just like as follows:

现在，我们可以使用def drop(col: Column)方法删除重复的列 'a' 或 'f'，如下所示：

val newDfWithoutDuplicate = df1.join(df2,df1("a")===df2("a")).drop(df2("a")).drop(df2("f"))

Answer 5

回答by typhoonbxq

Suppose the DataFrames you want to join are df1 and df2, and you are joining them on column 'a', then you have 2 methods

假设您要加入的 DataFrames 是 df1 和 df2，并且您在列 'a' 上加入它们，那么您有 2 种方法

Method 1

方法一

df1.join(df2,'a','left_outer')

This is an awsome method and it is highly recommended.

这是一个很棒的方法，强烈推荐。

Method 2

方法二

df1.join(df2,df1.a == df2.a,'left_outer').drop(df2.a)

Answer 6

回答by Paul Bendevis

There is a simpler way than writing aliases for all of the columns you are joining on by doing:

有一种比为您加入的所有列编写别名更简单的方法：

df1.join(df2,['a'])

This works if the key that you are joining on is the same in both tables.

如果您要加入的键在两个表中相同，则此方法有效。

See https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

见 https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

Answer 7

回答by Nikhil Redij

This is how we can join two Dataframes on same column namesin PySpark.

这就是我们如何在PySpark 中以相同的列名连接两个Dataframe。

df = df1.join(df2, ['col1','col2','col3'])

If you do printSchema()after this then you can see that duplicate columns have been removed.

如果您printSchema()在此之后这样做，那么您可以看到重复的列已被删除。

Answer 8

回答by Akash

This might not be the best approach, but if you want to rename the duplicate columns(after join), you can do so using this tiny function.

这可能不是最好的方法，但如果你想重命名重复的列（在加入之后），你可以使用这个小函数来做到这一点。

def rename_duplicate_columns(dataframe):
    columns = dataframe.columns
    duplicate_column_indices = list(set([columns.index(col) for col in columns if columns.count(col) == 2]))
    for index in duplicate_column_indices:
        columns[index] = columns[index]+'2'
    dataframe = dataframe.toDF(*columns)
    return dataframe

Answer 9

回答by Wassermann

If you have a more complicated use case than described in the answer of Glennie Helles Sindholt e.g. you have other/few non-join column names that are also same and want to distinguish them while selecting it's best to use aliasses, e.g:

如果您有比 Glennie Helles Sindholt 的回答中描述的更复杂的用例，例如，您有其他/很少的非连接列名称也相同，并希望在选择最好使用别名时区分它们，例如：

df3 = df1.select("a", "b").alias("left")\
   .join(df2.select("a", "b").alias("right"), ["a"])\
   .select("left.a", "left.b", "right.b")

df3.columns
['a', 'b', 'b']

Answer 10

回答by Manish Singla

if only the key column is the same in both tables then try using the following way (Approach 1):

如果两个表中只有键列相同，则尝试使用以下方式（方法 1）：

left. join(right , 'key', 'inner')

rather than below(approach 2):

而不是下面（方法2）：

left. join(right , left.key == right.key, 'inner')

Pros of using approach 1:

使用方法 1 的优点：

the 'key' will show only once in the final dataframe
easy to use the syntax

'key' 将只在最终数据框中显示一次
易于使用的语法

Cons of using approach 1:

使用方法 1 的缺点：

only help with the key column
Scenarios, wherein case of left join, if planning to use the right key null count, this will not work. In that case, one has to rename one of the key as mentioned above.

只帮助关键列
场景，其中左连接的情况，如果计划使用右键空计数，这将不起作用。在这种情况下，必须如上所述重命名其中一个键。

Python Spark Dataframe 区分名称重复的列

提问by resec

采纳答案by Glennie Helles Sindholt

回答by resec

回答by zero323

回答by StrongYoung

回答by typhoonbxq

回答by Paul Bendevis

回答by Nikhil Redij

回答by Akash

回答by Wassermann

回答by Manish Singla

相关推荐

最近更新

标签

Python Spark Dataframe 区分名称重复的列

提问by resec

采纳答案by Glennie Helles Sindholt

回答by resec

回答by zero323

回答by StrongYoung

回答by typhoonbxq

回答by Paul Bendevis

回答by Nikhil Redij

回答by Akash

回答by Wassermann

回答by Manish Singla

相关推荐

Python 错误“导入：无法打开 X 服务器”

使用python通过sftp上传文件

Python 'float' 对象不能被解释为 int，但转换为 int 不会产生任何输出

Python pandas.DataFrame.from_dict 不使用 OrderedDict 保留顺序

相关推荐

最近更新

标签