SQL 连接 Apache Spark DataFrame 中的列

Question

提问by Nipun

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

我们如何连接 Apache Spark DataFrame 中的两列？我们可以使用 Spark SQL 中的任何函数吗？

Answer 1

回答by zero323

With raw SQL you can use CONCAT:

使用原始 SQL，您可以使用CONCAT：

In Python

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

In Scala

import sqlContext.implicits._

val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

在 Python 中

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

在斯卡拉

import sqlContext.implicits._

val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

Since Spark 1.5.0 you can use concatfunction with DataFrame API:

从 Spark 1.5.0 开始，您可以将concat函数与 DataFrame API 一起使用：

In Python :

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

In Scala :

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

在 Python 中：

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

在斯卡拉：

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

There is also concat_wsfunction which takes a string separator as the first argument.

还有concat_ws一个函数将字符串分隔符作为第一个参数。

Answer 2

回答by muon

Here's how you can do custom naming

这是您可以进行自定义命名的方法

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

给，

+--------+--------+
|colname1|colname2|
+--------+--------+
|   row11|   row12|
|   row21|   row22|
+--------+--------+

create new column by concatenating:

通过连接创建新列：

df = df.withColumn('joined_column', 
                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()

+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
|   row11|   row12|  row11_row12|
|   row21|   row22|  row21_row22|
+--------+--------+-------------+

Answer 3

回答by Ignacio Alorre

One option to concatenate string columns in Spark Scala is using concat.

在 Spark Scala 中连接字符串列的一种选择是使用concat.

It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.

有必要检查空值。因为如果其中一列为空，即使其他列之一确实有信息，结果也将为空。

Using concatand withColumn:

使用concat和withColumn：

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

Using concatand select:

使用concat和select：

val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")

With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.

使用这两种方法，您将拥有一个 NEW_COLUMN，其值是列的串联：来自原始 df 的 COL1 和 COL2。

Answer 4

回答by Danish Shrestha

If you want to do it using DF, you could use a udf to add a new column based on existing columns.

如果你想使用 DF 来做，你可以使用 udf 基于现有列添加一个新列。

val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)

//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
    Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))

//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )

//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()

Answer 5

回答by Krishas

From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.

从 Spark 2.3( SPARK-22771) Spark SQL 支持连接运算符||。

For example;

例如;

val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")

Answer 6

回答by Teddy Belay

Here is another way of doing this for pyspark:

这是为 pyspark 执行此操作的另一种方法：

#import concat and lit functions from pyspark.sql.functions 
from pyspark.sql.functions import concat, lit

#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])

#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))

#Show the new data frame
personDF.show()

----------RESULT-------------------------

84
+------------+
|East African|
+------------+
|   Ethiopian|
|      Kenyan|
|     Ugandan|
|     Rwandan|
+------------+

Answer 7

回答by wones0120

Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.

这是当您不知道 Dataframe 中列的数量或名称时的建议。

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

Answer 8

回答by u5827450

In Spark 2.3.0, you may do:

在 Spark 2.3.0 中，您可以执行以下操作：

spark.sql( """ select '1' || column_a from table_a """)

Answer 9

回答by Roopesh MB

Do we have java syntax corresponding to below process

我们是否有对应于以下流程的 java 语法

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

Answer 10

回答by Ani Menon

concat(*cols)

连接（*列）

v1.5 and higher

v1.5 及更高版本

Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.

将多个输入列连接到一个列中。该函数适用于字符串、二进制和兼容的数组列。

Eg: new_df = df.select(concat(df.a, df.b, df.c))

例如： new_df = df.select(concat(df.a, df.b, df.c))

concat_ws(sep, *cols)

v1.5 and higher

v1.5 及更高版本

Similar to concatbut uses the specified separator.

类似于concat但使用指定的分隔符。

Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))

例如： new_df = df.select(concat_ws('-', df.col1, df.col2))

map_concat(*cols)

v2.4 and higher

v2.4 及更高版本

Used to concat maps, returns the union of all the given maps.

用于连接映射，返回所有给定映射的并集。

Eg: new_df = df.select(map_concat("map1", "map2"))

例如： new_df = df.select(map_concat("map1", "map2"))

Using string concat operator (||):

使用字符串连接运算符 ( ||)：

v2.3 and higher

v2.3 及更高版本

Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")

例如： df = spark.sql("select col_a || col_b || col_c as abc from table_x")

Reference: Spark sql doc

参考：Spark sql 文档

SQL 连接 Apache Spark DataFrame 中的列

提问by Nipun

回答by zero323

回答by muon

回答by Ignacio Alorre

回答by Danish Shrestha

回答by Krishas

回答by Teddy Belay

回答by wones0120

回答by u5827450

回答by Roopesh MB

回答by Ani Menon

concat(*cols)

连接（*列）

concat_ws(sep, *cols)

concat_ws(sep, *cols)

map_concat(*cols)

map_concat(*cols)

相关推荐

最近更新

标签

SQL 连接 Apache Spark DataFrame 中的列

提问by Nipun

回答by zero323

回答by muon

回答by Ignacio Alorre

回答by Danish Shrestha

回答by Krishas

回答by Teddy Belay

回答by wones0120

回答by u5827450

回答by Roopesh MB

回答by Ani Menon

concat(*cols)

连接（*列）

concat_ws(sep, *cols)

concat_ws(sep, *cols)

map_concat(*cols)

map_concat(*cols)

相关推荐

MS SQL Server - CURSOR 什么时候好用？

SQL 删除触发器（如果存在）并创建

在 where 子句中使用排名函数派生列 (SQL Server 2008)

SQL 错误：关系列不存在 PostgreSQL，无法运行插入查询

相关推荐

最近更新

标签