pandas pyspark：ValueError：推断后无法确定某些类型

Question

提问by Edamame

I have a pandas data frame my_df, and my_df.dtypesgives us:

我有一个Pandas数据框my_df，并my_df.dtypes给了我们：

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

Then I am trying to convert the pandas data frame my_dfto a spark data frame by doing below:

然后我尝试my_df通过执行以下操作将Pandas数据框转换为火花数据框：

spark_my_df = sc.createDataFrame(my_df)

However, I got the following errors:

但是，我收到以下错误：

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

Does anyone know what the above error mean? Thanks!

有谁知道上面的错误是什么意思？谢谢！

Answer 1

回答by Gregology

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

为了推断字段类型，PySpark 查看每个字段中的非无记录。如果一个字段只有 None 记录，PySpark 无法推断类型并会引发该错误。

Manually defining a schema will resolve the issue

手动定义架构将解决该问题

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+

Answer 2

回答by rjurney

If you are using the RDD[Row].toDF()monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

如果您使用的是RDD[Row].toDF()monkey-patched 方法，您可以在推断类型时增加样本比率以检查超过 100 条记录：

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatiotowards 1.0.

假设您的 RDD 中的所有字段中都有非空行，那么当您sampleRatio向 1.0.0增加时，更有可能找到它们。

Answer 3

回答by Akavall

And to fix this problem, you could provide your own defined schema.

为了解决这个问题，您可以提供自己定义的架构。

For example:

例如：

To reproduce the error:

要重现错误：

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

要修复错误：

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+

Answer 4

回答by Kamaldeep Singh

This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe

这可能是因为列都具有空值。在将它们转换为 spark 数据框之前，您应该删除这些列

Answer 5

回答by Aaron Robeson

I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:

我遇到了同样的问题，如果您不需要空的列，您可以在导入到 spark 之前简单地将它们从 Pandas 数据框中删除：

my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)

pandas pyspark：ValueError：推断后无法确定某些类型

提问by Edamame

回答by Gregology

回答by rjurney

回答by Akavall

回答by Kamaldeep Singh

回答by Aaron Robeson

相关推荐

最近更新

标签

pandas pyspark：ValueError：推断后无法确定某些类型

提问by Edamame

回答by Gregology

回答by rjurney

回答by Akavall

回答by Kamaldeep Singh

回答by Aaron Robeson

相关推荐

如何使用 Pandas DataFrame.style？

pandas 熊猫将文本特征转换为数值

pandas 熊猫排序行值

如何更改 Pandas 数据框的形状（带有“L”的行号）？

相关推荐

最近更新

标签