Python 如何在pyspark中将Dataframe列从String类型更改为Double类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32284620/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:19:47  来源:igfitidea点击:

how to change a Dataframe column from String type to Double type in pyspark

pythonapache-sparkdataframepysparkapache-spark-sql

提问by Abhishek Choudhary

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.

我有一个列作为字符串的数据框。我想在 PySpark 中将列类型更改为 Double 类型。

Following is the way, I did:

以下是我所做的方法:

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.

只是想知道,这是正确的方法吗,因为在运行逻辑回归时,我遇到了一些错误,所以我想知道,这是否是造成麻烦的原因。

采纳答案by zero323

There is no need for an UDF here. Columnalready provides castmethodwith DataTypeinstance:

这里不需要 UDF。Column已经提供了带有实例的cast方法DataType

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

or short string:

或短字符串:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

where canonical string names (other variations can be supported as well) correspond to simpleStringvalue. So for atomic types:

其中规范字符串名称(也可以支持其他变体)对应于simpleString值。所以对于原子类型:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")
BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

and for example complex types

例如复杂类型

types.ArrayType(types.IntegerType()).simpleString()   
'array<int>'
types.MapType(types.StringType(), types.IntegerType()).simpleString()
'map<string,int>'

回答by Abhishek Choudhary

the solution was simple -

解决办法很简单——

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

回答by Duckling

Preserve the name of the column and avoid extra column addition by using the same name as input column:

通过使用与输入列相同的名称保留列的名称并避免额外的列添加:

changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

回答by serkan kucukbay

Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it)so given answer didn't catch it.

给出的答案足以解决这个问题,但我想分享另一种可能引入新版本 Spark (我不确定)的方法,所以给出的答案没有抓住它。

We can reach the column in spark statement with col("colum_name")keyword:

我们可以使用col("colum_name")关键字到达 spark 语句中的列:

from pyspark.sql.functions import col , column
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

回答by Cristian

pyspark version:

pyspark 版本:

  df = <source data>
  df.printSchema()

  from pyspark.sql.types import *

  # Change column type
  df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
  df_new.printSchema()
  df_new.select("myColumn").show()