scala spark sql cast 函数创建带有 NULLS 的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44664334/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
spark sql cast function creates column with NULLS
提问by ceteris_paribus
I have the following dataframe and schema in Spark
我在 Spark 中有以下数据框和架构
val df = spark.read.options(Map("header"-> "true")).csv("path")
scala> df show()
+-------+-------+-----+
| user| topic| hits|
+-------+-------+-----+
| om| scala| 120|
| daniel| spark| 80|
|3754978| spark| 1|
+-------+-------+-----+
scala> df printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
I want to change the column hits to integer
我想将列命中数更改为整数
I tried this:
我试过这个:
scala> df.createOrReplaceTempView("test")
val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test")
scala> dfNew.printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
|-- hist2: integer (nullable = true)
but when I print the dataframe the column hist 2 is filled with NULLS
但是当我打印数据框时,列 hist 2 填充了 NULLS
scala> dfNew show()
+-------+-------+-----+-----+
| user| topic| hits|hist2|
+-------+-------+-----+-----+
| om| scala| 120| null|
| daniel| spark| 80| null|
|3754978| spark| 1| null|
+-------+-------+-----+-----+
I also tried this:
我也试过这个:
scala> val df2 = df.withColumn("hitsTmp",
df.hits.cast(IntegerType)).drop("hits"
).withColumnRenamed("hitsTmp", "hits")
and got this:
得到了这个:
<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram
e
Also tried this:
也试过这个:
scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits")
and got this:
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col
umns: [user, topic, hits]; line 1 pos 0;
'Project [user#0, 'topic, cast('hits as int) AS hits#22]
+- Relation[user#0, topic#1, hits#2] csv
with
和
scala> val df2 = df.selectExpr ("cast(hits as int) hits")
I get similar error.
我得到类似的错误。
Any help will be appreciated. I know this question has been addressed before but I tried 3 different approaches (published here) and none is working.
任何帮助将不胜感激。我知道之前已经解决了这个问题,但我尝试了 3 种不同的方法(在此处发布),但都没有奏效。
Thanks.
谢谢。
回答by ktheitroadalo
You can cast a column to Integer type in following ways
您可以通过以下方式将列转换为整数类型
df.withColumn("hits", df("hits").cast("integer"))
df.withColumn("hits", df("hits").cast("integer"))
Or
或者
data.withColumn("hitsTmp",
data("hits").cast(IntegerType)).drop("hits").
withColumnRenamed("hitsTmp", "hits")
Or
或者
data.selectExpr ("user","topic","cast(hits as int) hits")
回答by Sahil Sahay
The response is delayed but i was facing the same issue & worked.So thought to put it over here. Might be of help to someone. Try declaring the schema as StructType. Reading from CSV files & providing inferential schema using case class gives weird errors for data types. Although, all the data formats are properly specified.
响应被延迟,但我遇到了同样的问题并且有效。所以想把它放在这里。可能对某人有帮助。尝试将架构声明为 StructType。从 CSV 文件中读取并使用 case 类提供推理模式会给数据类型带来奇怪的错误。虽然,所有数据格式都已正确指定。
回答by André Machado
I know that this answer probably won't be useful for the OP since it comes with a ~2 year delay. It might however be helpful for someone facing this problem.
我知道这个答案可能对 OP 没有用,因为它有大约 2 年的延迟。但是,它可能对面临此问题的人有所帮助。
Just like you, I had a dataframe with a column of strings which I was trying to cast to integer:
就像你一样,我有一个包含一列字符串的数据框,我试图将其转换为整数:
> df.show
+-------+
| id|
+-------+
|4918088|
|4918111|
|4918154|
...
> df.printSchema
root
|-- id: string (nullable = true)
But after doing the cast to IntegerTypethe only thing I obtained, just as you did, was a column of nulls:
但是在对IntegerType我唯一获得的东西进行转换之后,就像你一样,是一列空值:
> df.withColumn("test", $"id".cast(IntegerType))
.select("id","test")
.show
+-------+----+
| id|test|
+-------+----+
|4918088|null|
|4918111|null|
|4918154|null|
...
By default if you try to cast a string that contain non-numeric characters to integer the cast of the column won't fail but those values will be set to nullas you can see in the following example:
默认情况下,如果您尝试将包含非数字字符的字符串转换为整数,则列的转换不会失败,但这些值将设置null为如下例所示:
> val testDf = sc.parallelize(Seq(("1"), ("2"), ("3A") )).toDF("n_str")
> testDf.withColumn("n_int", $"n_str".cast(IntegerType))
.select("n_str","n_int")
.show
+-----+-----+
|n_str|n_int|
+-----+-----+
| 1| 1|
| 2| 2|
| 3A| null|
+-----+-----+
The thing with our initial dataframe is that, at first sight, when we use the showmethod, we can't see any non-numeric character. However, if you a row from your dataframe you'll see something different:
我们的初始数据帧的问题是,乍一看,当我们使用该show方法时,我们看不到任何非数字字符。但是,如果您从数据框中输入一行,您会看到不同的内容:
> df.first
org.apache.spark.sql.Row = [4?9?1?8?0?8?8??]
Why is this happening? You are probably reading a csv file containing a non-supported encoding.
为什么会这样?您可能正在阅读包含不受支持的编码的 csv 文件。
You can solve this by changing the encoding of the file you are reading. If that is not an option you can also clean each column before doing the type cast. An example:
您可以通过更改正在阅读的文件的编码来解决此问题。如果这不是一个选项,您还可以在进行类型转换之前清理每一列。一个例子:
> val df_cast = df.withColumn("test", regexp_replace($"id", "[^0-9]","").cast(IntegerType))
.select("id","test")
> df_cast.show
+-------+-------+
| id| test|
+-------+-------+
|4918088|4918088|
|4918111|4918111|
|4918154|4918154|
...
> df_cast.printSchema
root
|-- id: string (nullable = true)
|-- test: integer (nullable = true)
回答by Mike Sun
Try removing the quote around histif that does not work, then
try trimming the column:
hist如果不起作用,请尝试删除周围的引号,然后尝试修剪该列:
dfNew = spark.sql("select *, cast(trim(hist) as integer) as hist2 from test")

