Python pyspark: TypeError: IntegerType 不能接受 <type 'unicode'> 类型的对象

Question

提问by Hello lad

programming with pyspark on a Spark cluster, the data is large and in pieces so can not be loaded into the memory or check the sanity of the data easily

在 Spark 集群上使用 pyspark 进行编程，数据量大且碎片化，因此无法轻松加载到内存中或检查数据的完整性

basically it looks like

基本上它看起来像

af.b Current%20events 1 996
af.b Kategorie:Musiek 1 4468
af.b Spesiaal:RecentChangesLinked/Gebruikerbespreking:Freakazoid 1 5209
af.b Spesiaal:RecentChangesLinked/Sir_Arthur_Conan_Doyle 1 5214

wikipedia data:

维基百科数据：

I read it from aws S3 and then try to construct spark Dataframe with the following python code in pyspark intepreter:

我从 aws S3 读取它，然后尝试在 pyspark intepreter 中使用以下 python 代码构建 spark Dataframe：

parts = data.map(lambda l: l.split())
wikis = parts.map(lambda p: (p[0], p[1],p[2],p[3]))


fields = [StructField("project", StringType(), True),
StructField("title", StringType(), True),
StructField("count", IntegerType(), True),
StructField("byte_size", StringType(), True)] 

schema = StructType(fields) 

df = sqlContext.createDataFrame(wikis, schema)

all look fine, only createDataFrame gives me error

一切看起来都很好，只有 createDataFrame 给我错误

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 404, in   createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/usr/lib/spark/python/pyspark/sql/context.py", line 298, in _createFromRDD
_verify_type(row, schema)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1152, in _verify_type
_verify_type(v, f.dataType)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1136, in _verify_type
raise TypeError("%s can not accept object in type %s" % (dataType, type(obj)))
TypeError: IntegerType can not accept object in type <type 'unicode'>

why I can not set the third column which should be count to IntegerType ? How can I solve this ?

为什么我不能将第三列设置为 IntegerType ？我该如何解决这个问题？

Answer 1

采纳答案by zero323

As noted by cchenesonyou pass wrong types.

正如ccheneson所指出的，您传递了错误的类型。

Assuming you datalooks like this:

假设你data看起来像这样：

data = sc.parallelize(["af.b Current%20events 1 996"])

After the first map you get RDD[List[String]]:

在第一张地图之后你会得到RDD[List[String]]：

parts = data.map(lambda l: l.split())
parts.first()
## ['af.b', 'Current%20events', '1', '996']

The second map converts it to tuple (String, String, String, String):

第二个映射将其转换为 tuple (String, String, String, String)：

wikis = parts.map(lambda p: (p[0], p[1], p[2],p[3]))
wikis.first()
## ('af.b', 'Current%20events', '1', '996')

Your schemastates that 3rd columns is an integer:

你schema说第三列是一个整数：

[f.dataType for f in schema.fields]
## [StringType, StringType, IntegerType, StringType]

Schema is used most to avoid a full table scan to infer types and doesn't perform any type casting.

Schema 最常用于避免全表扫描来推断类型，并且不执行任何类型转换。

You can either cast your data during last map:

您可以在最后一张地图期间投射您的数据：

wikis = parts.map(lambda p: (p[0], p[1], int(p[2]), p[3]))

Or define countas a StringTypeand cast column

或者定义count为 aStringType和 cast 列

fields[2] = StructField("count", StringType(), True)
schema = StructType(fields) 

wikis.toDF(schema).withColumn("cnt", col("count").cast("integer")).drop("count")

On a side note countis reserved word in SQL and shouldn't be used as a column name. In Spark it will work as expected in some contexts and fail in another.

旁注count是 SQL 中的保留字，不应用作列名。在 Spark 中，它会在某些情况下按预期工作，而在其他情况下会失败。

Answer 2

回答by Giovanni Bruner

With apache 2.0 you can let spark infer the schema of your data. Overall you'll need to cast in your parser function as argued above:

使用 apache 2.0，您可以让 spark 推断您的数据模式。总体而言，您需要按照上面的讨论在解析器函数中进行转换：

"When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of Row, or namedtuple, or dict."

“当模式为 None 时，它会尝试从数据推断模式（列名和类型），数据应该是 Row、namedtuple 或 dict 的 RDD。”

Python pyspark: TypeError: IntegerType 不能接受 <type 'unicode'> 类型的对象

提问by Hello lad

采纳答案by zero323

回答by Giovanni Bruner

相关推荐

最近更新

标签

Python pyspark: TypeError: IntegerType 不能接受 <type 'unicode'> 类型的对象

提问by Hello lad

采纳答案by zero323

回答by Giovanni Bruner

相关推荐

Python Django 1.8 migrate 不创建表

合并两个不同长度的python pandas数据帧，但将所有行保留在输出数据帧中

Python Matplotlib 重叠注释/文本

Python Pandas 列绑定（cbind）两个数据框

相关推荐

最近更新

标签