Python 如何在pyspark数据框中将字符串类型的列转换为int形式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46956026/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert column with string type to int form in pyspark data frame?
提问by neha
I have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and checking for the schema of dataframe, those columns will have 'string' type. How I can change them to int type.I replaced the 'nan' values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:
我在 pyspark 中有数据框。它的一些数字列包含“nan”,因此当我读取数据并检查数据框的架构时,这些列将具有“字符串”类型。我如何将它们更改为 int 类型。我用 0 替换了“nan”值并再次检查了架构,但它也显示了这些列的字符串类型。我正在遵循以下代码:
data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()
here columns 'Plays' and 'drafts' containing integer values but because of nan present in these columns,they are treated as string type.
这里的“Plays”和“drafts”列包含整数值,但由于这些列中存在 nan,它们被视为字符串类型。
回答by Sahil Desai
from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))
You can run loop for each column but this is the simplest way to convert string column into integer.
您可以为每一列运行循环,但这是将字符串列转换为整数的最简单方法。
回答by Ani Menon
You could use cast
(as int) after replacing NaN
with 0
,
你可以使用cast
更换后(如INT)NaN
用0
,
data_df = df.withColumn("Plays", df.call_time.cast('float'))
回答by Keshav Pradeep Ramanath
Another way to do it is using the StructField if you have multiple fields that needs to be modified.
如果您有多个需要修改的字段,另一种方法是使用 StructField。
Ex:
前任:
from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
StructField('OPEN_FLG',IntegerType(),True),
StructField('I1_GNDR_CODE',StringType(),True),
StructField('TRW_INCOME_CD_V4',StringType(),True),
StructField('ASIAN_CD',IntegerType(),True),
StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)
Output:
输出:
Before
前
root
|-- CLICK_FLG: string (nullable = true)
|-- OPEN_FLG: string (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)
After:
后:
root
|-- CLICK_FLG: integer (nullable = true)
|-- OPEN_FLG: integer (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)
This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.
这对 cast 来说是一个稍微长的过程,但优点是可以完成所有必需的字段。
It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.
需要注意的是,如果只有必需的字段被分配了数据类型,那么结果数据帧将只包含那些被更改的字段。