SQL 连接 Apache Spark DataFrame 中的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31450846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Concatenate columns in Apache Spark DataFrame
提问by Nipun
How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?
我们如何连接 Apache Spark DataFrame 中的两列?我们可以使用 Spark SQL 中的任何函数吗?
回答by zero323
With raw SQL you can use CONCAT
:
使用原始 SQL,您可以使用CONCAT
:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v")) df.registerTempTable("df") sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._ val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v") df.registerTempTable("df") sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
在 Python 中
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v")) df.registerTempTable("df") sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
在斯卡拉
import sqlContext.implicits._ val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v") df.registerTempTable("df") sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat
function with DataFrame API:
从 Spark 1.5.0 开始,您可以将concat
函数与 DataFrame API 一起使用:
In Python :
from pyspark.sql.functions import concat, col, lit df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit} df.select(concat($"k", lit(" "), $"v"))
在 Python 中:
from pyspark.sql.functions import concat, col, lit df.select(concat(col("k"), lit(" "), col("v")))
在斯卡拉:
import org.apache.spark.sql.functions.{concat, lit} df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws
function which takes a string separator as the first argument.
还有concat_ws
一个函数将字符串分隔符作为第一个参数。
回答by muon
Here's how you can do custom naming
这是您可以进行自定义命名的方法
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
给,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
通过连接创建新列:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
回答by Ignacio Alorre
One option to concatenate string columns in Spark Scala is using concat
.
在 Spark Scala 中连接字符串列的一种选择是使用concat
.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
有必要检查空值。因为如果其中一列为空,即使其他列之一确实有信息,结果也将为空。
Using concat
and withColumn
:
使用concat
和withColumn
:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat
and select
:
使用concat
和select
:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
使用这两种方法,您将拥有一个 NEW_COLUMN,其值是列的串联:来自原始 df 的 COL1 和 COL2。
回答by Danish Shrestha
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
如果你想使用 DF 来做,你可以使用 udf 基于现有列添加一个新列。
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
回答by Krishas
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||
.
从 Spark 2.3( SPARK-22771) Spark SQL 支持连接运算符||
。
For example;
例如;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
回答by Teddy Belay
Here is another way of doing this for pyspark:
这是为 pyspark 执行此操作的另一种方法:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
回答by wones0120
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
这是当您不知道 Dataframe 中列的数量或名称时的建议。
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
回答by u5827450
In Spark 2.3.0, you may do:
在 Spark 2.3.0 中,您可以执行以下操作:
spark.sql( """ select '1' || column_a from table_a """)
回答by Roopesh MB
Do we have java syntax corresponding to below process
我们是否有对应于以下流程的 java 语法
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
回答by Ani Menon
concat(*cols)
连接(*列)
v1.5 and higher
v1.5 及更高版本
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
将多个输入列连接到一个列中。该函数适用于字符串、二进制和兼容的数组列。
Eg: new_df = df.select(concat(df.a, df.b, df.c))
例如: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
concat_ws(sep, *cols)
v1.5 and higher
v1.5 及更高版本
Similar to concat
but uses the specified separator.
类似于concat
但使用指定的分隔符。
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
例如: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
map_concat(*cols)
v2.4 and higher
v2.4 及更高版本
Used to concat maps, returns the union of all the given maps.
用于连接映射,返回所有给定映射的并集。
Eg: new_df = df.select(map_concat("map1", "map2"))
例如: new_df = df.select(map_concat("map1", "map2"))
Using string concat operator (||
):
使用字符串连接运算符 ( ||
):
v2.3 and higher
v2.3 及更高版本
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
例如: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
参考:Spark sql 文档