scala 在火花数据框中创建子字符串列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/42822102/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
create substring column in spark dataframe
提问by J Smith
I want to take a json file and map it so that one of the columns is a substring of another. For example to take the left table and produce the right table:
我想获取一个 json 文件并对其进行映射,以便其中一列是另一列的子字符串。例如,取左表并生成右表:
 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  hello  |
I can do this using spark-sql syntax but how can it be done using the in-built functions?
我可以使用 spark-sql 语法来做到这一点,但如何使用内置函数来做到这一点?
回答by pasha701
Such statement can be used
可以使用这样的语句
import org.apache.spark.sql.functions._
dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))
dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))
回答by Balázs Fehér
Suppose you have the following dataframe:
假设您有以下数据框:
import spark.implicits._
import org.apache.spark.sql.functions._
var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")
+------+---+
|     a|  b|
+------+---+
|foobar|foo|
+------+---+
You could subset a new column from the first column as follows:
您可以从第一列中创建一个新列的子集,如下所示:
df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))
+------+---+---+
|     a|  b|  c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+
回答by soote
You would use the withColumnfunction
你会使用这个withColumn功能
import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))
回答by Ignacio Alorre
Just to enrich existing answers. In case you were interested in the right part of the string column. That is:
只是为了丰富现有的答案。如果您对字符串列的右侧部分感兴趣。那是:
 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  world  |
You should use a negative index:
您应该使用负索引:
dataFrame.select(col("a"), substring_index(col("a"), ",", -1).as("b"))

