pandas pyspark:ValueError:推断后无法确定某些类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40517553/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:24:30  来源:igfitidea点击:

pyspark: ValueError: Some of types cannot be determined after inferring

pythonpython-2.7pandaspysparkspark-dataframe

提问by Edamame

I have a pandas data frame my_df, and my_df.dtypesgives us:

我有一个Pandas数据框my_df,并my_df.dtypes给了我们:

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

Then I am trying to convert the pandas data frame my_dfto a spark data frame by doing below:

然后我尝试my_df通过执行以下操作将Pandas数据框转换为火花数据框:

spark_my_df = sc.createDataFrame(my_df)

However, I got the following errors:

但是,我收到以下错误:

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

Does anyone know what the above error mean? Thanks!

有谁知道上面的错误是什么意思?谢谢!

回答by Gregology

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

为了推断字段类型,PySpark 查看每个字段中的非无记录。如果一个字段只有 None 记录,PySpark 无法推断类型并会引发该错误。

Manually defining a schema will resolve the issue

手动定义架构将解决该问题

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+

回答by rjurney

If you are using the RDD[Row].toDF()monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

如果您使用的是RDD[Row].toDF()monkey-patched 方法,您可以在推断类型时增加样本比率以检查超过 100 条记录:

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatiotowards 1.0.

假设您的 RDD 中的所有字段中都有非空行,那么当您sampleRatio向 1.0.0增加时,更有可能找到它们。

回答by Akavall

And to fix this problem, you could provide your own defined schema.

为了解决这个问题,您可以提供自己定义的架构。

For example:

例如:

To reproduce the error:

要重现错误:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

要修复错误:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+

回答by Kamaldeep Singh

This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe

这可能是因为列都具有空值。在将它们转换为 spark 数据框之前,您应该删除这些列

回答by Aaron Robeson

I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:

我遇到了同样的问题,如果您不需要空的列,您可以在导入到 spark 之前简单地将它们从 Pandas 数据框中删除:

my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)