Python 如何在pyspark数据框中将字符串类型的列转换为int形式？

Question

提问by neha

I have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and checking for the schema of dataframe, those columns will have 'string' type. How I can change them to int type.I replaced the 'nan' values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:

我在 pyspark 中有数据框。它的一些数字列包含“nan”，因此当我读取数据并检查数据框的架构时，这些列将具有“字符串”类型。我如何将它们更改为 int 类型。我用 0 替换了“nan”值并再次检查了架构，但它也显示了这些列的字符串类型。我正在遵循以下代码：

data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()

my data looks like this:

我的数据是这样的：

here columns 'Plays' and 'drafts' containing integer values but because of nan present in these columns,they are treated as string type.

这里的“Plays”和“drafts”列包含整数值，但由于这些列中存在 nan，它们被视为字符串类型。

Answer 1

回答by Sahil Desai

from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))

You can run loop for each column but this is the simplest way to convert string column into integer.

您可以为每一列运行循环，但这是将字符串列转换为整数的最简单方法。

Answer 2

回答by Ani Menon

You could use cast(as int) after replacing NaNwith 0,

你可以使用cast更换后（如INT）NaN用0，

data_df = df.withColumn("Plays", df.call_time.cast('float'))

Answer 3

回答by Keshav Pradeep Ramanath

Another way to do it is using the StructField if you have multiple fields that needs to be modified.

如果您有多个需要修改的字段，另一种方法是使用 StructField。

Ex:

前任：

from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
       StructField('OPEN_FLG',IntegerType(),True),
       StructField('I1_GNDR_CODE',StringType(),True),
       StructField('TRW_INCOME_CD_V4',StringType(),True),
       StructField('ASIAN_CD',IntegerType(),True),
       StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)

Output:

输出：

Before

前

root
 |-- CLICK_FLG: string (nullable = true)
 |-- OPEN_FLG: string (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)

After:

后：

root
 |-- CLICK_FLG: integer (nullable = true)
 |-- OPEN_FLG: integer (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)

This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.

这对 cast 来说是一个稍微长的过程，但优点是可以完成所有必需的字段。

It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.

需要注意的是，如果只有必需的字段被分配了数据类型，那么结果数据帧将只包含那些被更改的字段。

Python 如何在pyspark数据框中将字符串类型的列转换为int形式？

提问by neha

回答by Sahil Desai

回答by Ani Menon

回答by Keshav Pradeep Ramanath

相关推荐

最近更新

标签

Python 如何在pyspark数据框中将字符串类型的列转换为int形式？

提问by neha

回答by Sahil Desai

回答by Ani Menon

回答by Keshav Pradeep Ramanath

相关推荐

Python 计算熊猫数据框中选定列的选定行的平均值

Python pip install -r requirements.txt [Errno 2] 没有这样的文件或目录：'requirements.txt'

如何在python中合并数组？

Python 如何在 matplotlib 中绘制和使用 NaN 值

相关推荐

最近更新

标签