Python 如何在pyspark中将Dataframe列从String类型更改为Double类型

Question

提问by Abhishek Choudhary

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.

我有一个列作为字符串的数据框。我想在 PySpark 中将列类型更改为 Double 类型。

Following is the way, I did:

以下是我所做的方法：

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.

只是想知道，这是正确的方法吗，因为在运行逻辑回归时，我遇到了一些错误，所以我想知道，这是否是造成麻烦的原因。

Answer 1

采纳答案by zero323

There is no need for an UDF here. Columnalready provides castmethodwith DataTypeinstance:

这里不需要 UDF。Column已经提供了带有实例的cast方法：DataType

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

or short string:

或短字符串：

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

where canonical string names (other variations can be supported as well) correspond to simpleStringvalue. So for atomic types:

其中规范字符串名称（也可以支持其他变体）对应于simpleString值。所以对于原子类型：

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

and for example complex types

例如复杂类型

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

Answer 2

回答by Abhishek Choudhary

the solution was simple -

解决办法很简单——

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Answer 3

回答by Duckling

Preserve the name of the column and avoid extra column addition by using the same name as input column:

通过使用与输入列相同的名称保留列的名称并避免额外的列添加：

changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

Answer 4

回答by serkan kucukbay

Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it)so given answer didn't catch it.

给出的答案足以解决这个问题，但我想分享另一种可能引入新版本 Spark （我不确定）的方法，所以给出的答案没有抓住它。

We can reach the column in spark statement with col("colum_name")keyword:

我们可以使用col("colum_name")关键字到达 spark 语句中的列：

from pyspark.sql.functions import col , column
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

Answer 5

回答by Cristian

pyspark version:

pyspark 版本：

  df = <source data>
  df.printSchema()

  from pyspark.sql.types import *

  # Change column type
  df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
  df_new.printSchema()
  df_new.select("myColumn").show()

Python 如何在pyspark中将Dataframe列从String类型更改为Double类型

提问by Abhishek Choudhary

采纳答案by zero323

回答by Abhishek Choudhary

回答by Duckling

回答by serkan kucukbay

回答by Cristian

相关推荐

最近更新

标签

Python 如何在pyspark中将Dataframe列从String类型更改为Double类型

提问by Abhishek Choudhary

采纳答案by zero323

回答by Abhishek Choudhary

回答by Duckling

回答by serkan kucukbay

回答by Cristian

相关推荐

Python 字符串转 Int 或 None

python flask从http重定向到https

Python 熊猫重置系列上的索引以删除多索引

在 Python 中定义白噪声过程

相关推荐

最近更新

标签