Python pyspark: TypeError: IntegerType 不能接受 <type 'unicode'> 类型的对象

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33129918/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:49:41  来源:igfitidea点击:

pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

pythonapache-sparkapache-spark-sqlpyspark

提问by Hello lad

programming with pyspark on a Spark cluster, the data is large and in pieces so can not be loaded into the memory or check the sanity of the data easily

在 Spark 集群上使用 pyspark 进行编程,数据量大且碎片化,因此无法轻松加载到内存中或检查数据的完整性

basically it looks like

基本上它看起来像

af.b Current%20events 1 996
af.b Kategorie:Musiek 1 4468
af.b Spesiaal:RecentChangesLinked/Gebruikerbespreking:Freakazoid 1 5209
af.b Spesiaal:RecentChangesLinked/Sir_Arthur_Conan_Doyle 1 5214

wikipedia data:

维基百科数据:

I read it from aws S3 and then try to construct spark Dataframe with the following python code in pyspark intepreter:

我从 aws S3 读取它,然后尝试在 pyspark intepreter 中使用以下 python 代码构建 spark Dataframe:

parts = data.map(lambda l: l.split())
wikis = parts.map(lambda p: (p[0], p[1],p[2],p[3]))


fields = [StructField("project", StringType(), True),
StructField("title", StringType(), True),
StructField("count", IntegerType(), True),
StructField("byte_size", StringType(), True)] 

schema = StructType(fields) 

df = sqlContext.createDataFrame(wikis, schema)

all look fine, only createDataFrame gives me error

一切看起来都很好,只有 createDataFrame 给我错误

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 404, in   createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/usr/lib/spark/python/pyspark/sql/context.py", line 298, in _createFromRDD
_verify_type(row, schema)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1152, in _verify_type
_verify_type(v, f.dataType)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1136, in _verify_type
raise TypeError("%s can not accept object in type %s" % (dataType, type(obj)))
TypeError: IntegerType can not accept object in type <type 'unicode'>

why I can not set the third column which should be count to IntegerType ? How can I solve this ?

为什么我不能将第三列设置为 IntegerType ?我该如何解决这个问题?

采纳答案by zero323

As noted by cchenesonyou pass wrong types.

正如ccheneson所指出的,您传递了错误的类型。

Assuming you datalooks like this:

假设你data看起来像这样:

data = sc.parallelize(["af.b Current%20events 1 996"])

After the first map you get RDD[List[String]]:

在第一张地图之后你会得到RDD[List[String]]

parts = data.map(lambda l: l.split())
parts.first()
## ['af.b', 'Current%20events', '1', '996']

The second map converts it to tuple (String, String, String, String):

第二个映射将其转换为 tuple (String, String, String, String)

wikis = parts.map(lambda p: (p[0], p[1], p[2],p[3]))
wikis.first()
## ('af.b', 'Current%20events', '1', '996')

Your schemastates that 3rd columns is an integer:

schema说第三列是一个整数:

[f.dataType for f in schema.fields]
## [StringType, StringType, IntegerType, StringType]

Schema is used most to avoid a full table scan to infer types and doesn't perform any type casting.

Schema 最常用于避免全表扫描来推断类型,并且不执行任何类型转换。

You can either cast your data during last map:

您可以在最后一张地图期间投射您的数据:

wikis = parts.map(lambda p: (p[0], p[1], int(p[2]), p[3]))

Or define countas a StringTypeand cast column

或者定义count为 aStringType和 cast 列

fields[2] = StructField("count", StringType(), True)
schema = StructType(fields) 

wikis.toDF(schema).withColumn("cnt", col("count").cast("integer")).drop("count")

On a side note countis reserved word in SQL and shouldn't be used as a column name. In Spark it will work as expected in some contexts and fail in another.

旁注count是 SQL 中的保留字,不应用作列名。在 Spark 中,它会在某些情况下按预期工作,而在其他情况下会失败。

回答by Giovanni Bruner

With apache 2.0 you can let spark infer the schema of your data. Overall you'll need to cast in your parser function as argued above:

使用 apache 2.0,您可以让 spark 推断您的数据模式。总体而言,您需要按照上面的讨论在解析器函数中进行转换:

"When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of Row, or namedtuple, or dict."

“当模式为 None 时,它​​会尝试从数据推断模式(列名和类型),数据应该是 Row、namedtuple 或 dict 的 RDD。”