scala 在火花数据框中创建子字符串列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42822102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:07:36  来源:igfitidea点击:

create substring column in spark dataframe

scalaapache-sparkspark-dataframe

提问by J Smith

I want to take a json file and map it so that one of the columns is a substring of another. For example to take the left table and produce the right table:

我想获取一个 json 文件并对其进行映射,以便其中一列是另一列的子字符串。例如,取左表并生成右表:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  hello  |

I can do this using spark-sql syntax but how can it be done using the in-built functions?

我可以使用 spark-sql 语法来做到这一点,但如何使用内置函数来做到这一点?

回答by pasha701

Such statement can be used

可以使用这样的语句

import org.apache.spark.sql.functions._

dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))

dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))

回答by Balázs Fehér

Suppose you have the following dataframe:

假设您有以下数据框:

import spark.implicits._
import org.apache.spark.sql.functions._

var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")

+------+---+
|     a|  b|
+------+---+
|foobar|foo|
+------+---+

You could subset a new column from the first column as follows:

您可以从第一列中创建一个新列的子集,如下所示:

df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))

+------+---+---+
|     a|  b|  c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+

回答by soote

You would use the withColumnfunction

你会使用这个withColumn功能

import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))

回答by Ignacio Alorre

Just to enrich existing answers. In case you were interested in the right part of the string column. That is:

只是为了丰富现有的答案。如果您对字符串列的右侧部分感兴趣。那是:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  world  |

You should use a negative index:

您应该使用负索引:

dataFrame.select(col("a"), substring_index(col("a"), ",", -1).as("b"))