Python 连接两个 PySpark 数据帧

Question

提问by Ivan

I'm trying to concatenate two PySpark dataframes with some columns that are only on each of them:

我正在尝试将两个 PySpark 数据帧与一些仅位于其中的列连接起来：

from pyspark.sql.functions import randn, rand

df_1 = sqlContext.range(0, 10)

+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+

df_2 = sqlContext.range(11, 20)

+--+
|id|
+--+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+--+

df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal_2"))

and now I want to generate a third dataframe. I would like something like pandas concat:

现在我想生成第三个数据框。我想要像熊猫这样的东西concat：

df_1.show()
+---+--------------------+--------------------+
| id|             uniform|              normal|
+---+--------------------+--------------------+
|  0|  0.8122802274304282|  1.2423430583597714|
|  1|  0.8642043127063618|  0.3900018344856156|
|  2|  0.8292577771850476|  1.8077401259195247|
|  3|   0.198558705368724| -0.4270585782850261|
|  4|0.012661361966674889|   0.702634599720141|
|  5|  0.8535692890157796|-0.42355804115129153|
|  6|  0.3723296190171911|  1.3789648582622995|
|  7|  0.9529794127670571| 0.16238718777444605|
|  8|  0.9746632635918108| 0.02448061333761742|
|  9|   0.513622008243935|  0.7626741803250845|
+---+--------------------+--------------------+

df_2.show()
+---+--------------------+--------------------+
| id|             uniform|            normal_2|
+---+--------------------+--------------------+
| 11|  0.3221262660507942|  1.0269298899109824|
| 12|  0.4030672316912547|   1.285648175568798|
| 13|  0.9690555459609131|-0.22986601831364423|
| 14|0.011913836266515876|  -0.678915153834693|
| 15|  0.9359607054250594|-0.16557488664743034|
| 16| 0.45680471157575453| -0.3885563551710555|
| 17|  0.6411908952297819|  0.9161177183227823|
| 18|  0.5669232696934479|  0.7270125277020573|
| 19|   0.513622008243935|  0.7626741803250845|
+---+--------------------+--------------------+

#do some concatenation here, how?

df_concat.show()

| id|             uniform|              normal| normal_2   |
+---+--------------------+--------------------+------------+
|  0|  0.8122802274304282|  1.2423430583597714| None       |
|  1|  0.8642043127063618|  0.3900018344856156| None       |
|  2|  0.8292577771850476|  1.8077401259195247| None       |
|  3|   0.198558705368724| -0.4270585782850261| None       |
|  4|0.012661361966674889|   0.702634599720141| None       |
|  5|  0.8535692890157796|-0.42355804115129153| None       |
|  6|  0.3723296190171911|  1.3789648582622995| None       |
|  7|  0.9529794127670571| 0.16238718777444605| None       |
|  8|  0.9746632635918108| 0.02448061333761742| None       |
|  9|   0.513622008243935|  0.7626741803250845| None       |
| 11|  0.3221262660507942|  None              | 0.123      |
| 12|  0.4030672316912547|  None              |0.12323     |
| 13|  0.9690555459609131|  None              |0.123       |
| 14|0.011913836266515876|  None              |0.18923     |
| 15|  0.9359607054250594|  None              |0.99123     |
| 16| 0.45680471157575453|  None              |0.123       |
| 17|  0.6411908952297819|  None              |1.123       |
| 18|  0.5669232696934479|  None              |0.10023     |
| 19|   0.513622008243935|  None              |0.916332123 |
+---+--------------------+--------------------+------------+

Is that possible?

那可能吗？

Answer 1

回答by Daniel de Paula

Maybe you can try creating the unexisting columns and calling union(unionAllfor Spark 1.6 or lower):

也许您可以尝试创建不存在的列并调用union（unionAll对于 Spark 1.6 或更低版本）：

cols = ['id', 'uniform', 'normal', 'normal_2']    

df_1_new = df_1.withColumn("normal_2", lit(None)).select(cols)
df_2_new = df_2.withColumn("normal", lit(None)).select(cols)

result = df_1_new.union(df_2_new)

Answer 2

回答by David

df_concat = df_1.union(df_2)

The dataframes may need to have identical columns, in which case you can use withColumn()to create normal_1and normal_2

数据框可能需要具有相同的列，在这种情况下，您可以使用withColumn()创建normal_1和normal_2

Answer 3

回答by Shadowtrooper

You can use unionByName to make this:

您可以使用 unionByName 来实现：

df = df_1.unionByName(df_2)

unionByName is available since Spark 2.3.0.

unionByName 从 Spark 2.3.0 开始可用。

Answer 4

回答by user020314

Here is one way to do it, in case it is still useful: I ran this in pyspark shell, Python version 2.7.12 and my Spark install was version 2.0.1.

这是一种方法，以防它仍然有用：我在 pyspark shell 中运行它，Python 版本为 2.7.12，我的 Spark 安装版本为 2.0.1。

PS: I guess you meant to use different seeds for the df_1 df_2 and the code below reflects that.

PS：我猜你打算为 df_1 df_2 使用不同的种子，下面的代码反映了这一点。

from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand
import pyspark.sql.functions as F

df_1 = sqlContext.range(0, 10)
df_2 = sqlContext.range(11, 20)
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=11).alias("uniform"), randn(seed=28).alias("normal_2"))

def get_uniform(df1_uniform, df2_uniform):
    if df1_uniform:
        return df1_uniform
    if df2_uniform:
        return df2_uniform

u_get_uniform = F.udf(get_uniform, FloatType())

df_3 = df_1.join(df_2, on = "id", how = 'outer').select("id", u_get_uniform(df_1["uniform"], df_2["uniform"]).alias("uniform"), "normal", "normal_2").orderBy(F.col("id"))

Here are the outputs I get:

这是我得到的输出：

df_1.show()
+---+-------------------+--------------------+
| id|            uniform|              normal|
+---+-------------------+--------------------+
|  0|0.41371264720975787|  0.5888539012978773|
|  1| 0.7311719281896606|  0.8645537008427937|
|  2| 0.1982919638208397| 0.06157382353970104|
|  3|0.12714181165849525|  0.3623040918178586|
|  4| 0.7604318153406678|-0.49575204523675975|
|  5|0.12030715258495939|  1.0854146699817222|
|  6|0.12131363910425985| -0.5284523629183004|
|  7|0.44292918521277047| -0.4798519469521663|
|  8| 0.8898784253886249| -0.8820294772950535|
|  9|0.03650707717266999| -2.1591956435415334|
+---+-------------------+--------------------+

df_2.show()
+---+-------------------+--------------------+
| id|            uniform|            normal_2|
+---+-------------------+--------------------+
| 11| 0.1982919638208397| 0.06157382353970104|
| 12|0.12714181165849525|  0.3623040918178586|
| 13|0.12030715258495939|  1.0854146699817222|
| 14|0.12131363910425985| -0.5284523629183004|
| 15|0.44292918521277047| -0.4798519469521663|
| 16| 0.8898784253886249| -0.8820294772950535|
| 17| 0.2731073068483362|-0.15116027592854422|
| 18| 0.7784518091224375| -0.3785563841011868|
| 19|0.43776394586845413| 0.47700719174464357|
+---+-------------------+--------------------+

df_3.show()
+---+-----------+--------------------+--------------------+                     
| id|    uniform|              normal|            normal_2|
+---+-----------+--------------------+--------------------+
|  0| 0.41371265|  0.5888539012978773|                null|
|  1|  0.7311719|  0.8645537008427937|                null|
|  2| 0.19829196| 0.06157382353970104|                null|
|  3| 0.12714182|  0.3623040918178586|                null|
|  4|  0.7604318|-0.49575204523675975|                null|
|  5|0.120307155|  1.0854146699817222|                null|
|  6| 0.12131364| -0.5284523629183004|                null|
|  7| 0.44292918| -0.4798519469521663|                null|
|  8| 0.88987845| -0.8820294772950535|                null|
|  9|0.036507078| -2.1591956435415334|                null|
| 11| 0.19829196|                null| 0.06157382353970104|
| 12| 0.12714182|                null|  0.3623040918178586|
| 13|0.120307155|                null|  1.0854146699817222|
| 14| 0.12131364|                null| -0.5284523629183004|
| 15| 0.44292918|                null| -0.4798519469521663|
| 16| 0.88987845|                null| -0.8820294772950535|
| 17| 0.27310732|                null|-0.15116027592854422|
| 18|  0.7784518|                null| -0.3785563841011868|
| 19| 0.43776396|                null| 0.47700719174464357|
+---+-----------+--------------------+--------------------+

Answer 5

回答by Yuchen Zhong

To make it more generic of keeping both columns in df1and df2:

为了更通用地将两列都保留在df1和中df2：

import pyspark.sql.functions as F

# Keep all columns in either df1 or df2
def outter_union(df1, df2):

    # Add missing columns to df1
    left_df = df1
    for column in set(df2.columns) - set(df1.columns):
        left_df = left_df.withColumn(column, F.lit(None))

    # Add missing columns to df2
    right_df = df2
    for column in set(df1.columns) - set(df2.columns):
        right_df = right_df.withColumn(column, F.lit(None))

    # Make sure columns are ordered the same
    return left_df.union(right_df.select(left_df.columns))

Answer 6

回答by furianpandit

Above answers are very elegant. I have written this function long back where i was also struggling to concatenate two dataframe with distinct columns.

上面的答案非常优雅。我很久以前就写过这个函数，我也在努力将两个数据框与不同的列连接起来。

Suppose you have dataframe sdf1 and sdf2

假设你有数据帧 sdf1 和 sdf2

from pyspark.sql import functions as F
from pyspark.sql.types import *

def unequal_union_sdf(sdf1, sdf2):
    s_df1_schema = set((x.name, x.dataType) for x in sdf1.schema)
    s_df2_schema = set((x.name, x.dataType) for x in sdf2.schema)

    for i,j in s_df2_schema.difference(s_df1_schema):
        sdf1 = sdf1.withColumn(i,F.lit(None).cast(j))

    for i,j in s_df1_schema.difference(s_df2_schema):
        sdf2 = sdf2.withColumn(i,F.lit(None).cast(j))

    common_schema_colnames = sdf1.columns
    sdk = \
        sdf1.select(common_schema_colnames).union(sdf2.select(common_schema_colnames))
    return sdk 

sdf_concat = unequal_union_sdf(sdf1, sdf2)

Answer 7

回答by Bobby John

im a dwh turned pyspark developer. Below is what I would do:

我是 dwh 转为 pyspark 开发人员。下面是我会做的：

    from pyspark.sql import SparkSession
    df_1.createOrReplaceTempView("tab_1")
    df_2.createOrReplaceTempView("tab_2")
    df_concat=spark.sql("select tab_1.id,tab_1.uniform,tab_1.normal,tab_2.normal_2  from tab_1 tab_1 left join tab_2 tab_2 on tab_1.uniform=tab_2.uniform\
                union\
                select tab_2.id,tab_2.uniform,tab_1.normal,tab_2.normal_2  from tab_2 tab_2 left join tab_1 tab_1 on tab_1.uniform=tab_2.uniform")
    df_concat.show()

--pls let me know if this worked for you or was your need.

--请让我知道这是否适合您或您需要。

Answer 8

回答by Hans

This should do it for you ...

这应该为你做...

from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand, lit, coalesce, col
import pyspark.sql.functions as F

df_1 = sqlContext.range(0, 6)
df_2 = sqlContext.range(3, 10)
df_1 = df_1.select("id", lit("old").alias("source"))
df_2 = df_2.select("id")

df_1.show()
df_2.show()
df_3 = df_1.alias("df_1").join(df_2.alias("df_2"), df_1.id == df_2.id, "outer")\
  .select(\
    [coalesce(df_1.id, df_2.id).alias("id")] +\
    [col("df_1." + c) for c in df_1.columns if c != "id"])\
  .sort("id")
df_3.show()

Answer 9

回答by Mangnier Lo?c

Maybe, you want to concatenate more of two Dataframes. I found a issue which use pandas Dataframe conversion.

也许，您想连接更多的两个数据帧。我发现了一个使用 Pandas Dataframe 转换的问题。

Suppose you have 3 spark Dataframe who want to concatenate.

假设您有 3 个想要连接的 spark Dataframe。

The code is the following:

代码如下：

list_dfs = []
list_dfs_ = []

df = spark.read.json('path_to_your_jsonfile.json',multiLine = True)
df2 = spark.read.json('path_to_your_jsonfile2.json',multiLine = True)
df3 = spark.read.json('path_to_your_jsonfile3.json',multiLine = True)

list_dfs.extend([df,df2,df3])

for df in list_dfs : 

    df = df.select([column for column in df.columns]).toPandas()
    list_dfs_.append(df)

list_dfs.clear()

df_ = sqlContext.createDataFrame(pd.concat(list_dfs_))

Python 连接两个 PySpark 数据帧

提问by Ivan

回答by Daniel de Paula

回答by David

回答by Shadowtrooper

回答by user020314

回答by Yuchen Zhong

回答by furianpandit

回答by Bobby John

回答by Hans

回答by Mangnier Lo?c

相关推荐

最近更新

标签

Python 连接两个 PySpark 数据帧

提问by Ivan

回答by Daniel de Paula

回答by David

回答by Shadowtrooper

回答by user020314

回答by Yuchen Zhong

回答by furianpandit

回答by Bobby John

回答by Hans

回答by Mangnier Lo?c

相关推荐

Python 当未启用急切执行时，Tensor` 对象不可迭代。迭代这个张量使用`tf.map_fn`

Python Anaconda：我应该在linux中使用`conda activate`还是`source activate`

“即发即弃”python async/await

将数据帧保存到 csv 文件（python）

相关推荐

最近更新

标签