pandas pyspark:ValueError:推断后无法确定某些类型
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40517553/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pyspark: ValueError: Some of types cannot be determined after inferring
提问by Edamame
I have a pandas data frame my_df
, and my_df.dtypes
gives us:
我有一个Pandas数据框my_df
,并my_df.dtypes
给了我们:
ts int64
fieldA object
fieldB object
fieldC object
fieldD object
fieldE object
dtype: object
Then I am trying to convert the pandas data frame my_df
to a spark data frame by doing below:
然后我尝试my_df
通过执行以下操作将Pandas数据框转换为火花数据框:
spark_my_df = sc.createDataFrame(my_df)
However, I got the following errors:
但是,我收到以下错误:
ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
2 spark_my_df.take(20)
/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
520 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
521 else:
--> 522 rdd, schema = self._createFromLocal(map(prepare, data), schema)
523 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
524 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
384
385 if schema is None or isinstance(schema, (list, tuple)):
--> 386 struct = self._inferSchemaFromList(data)
387 if isinstance(schema, (list, tuple)):
388 for i, name in enumerate(schema):
/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
318 schema = reduce(_merge_type, map(_infer_schema, data))
319 if _has_nulltype(schema):
--> 320 raise ValueError("Some of types cannot be determined after inferring")
321 return schema
322
ValueError: Some of types cannot be determined after inferring
Does anyone know what the above error mean? Thanks!
有谁知道上面的错误是什么意思?谢谢!
回答by Gregology
In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.
为了推断字段类型,PySpark 查看每个字段中的非无记录。如果一个字段只有 None 记录,PySpark 无法推断类型并会引发该错误。
Manually defining a schema will resolve the issue
手动定义架构将解决该问题
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+
回答by rjurney
If you are using the RDD[Row].toDF()
monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:
如果您使用的是RDD[Row].toDF()
monkey-patched 方法,您可以在推断类型时增加样本比率以检查超过 100 条记录:
# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()
Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio
towards 1.0.
假设您的 RDD 中的所有字段中都有非空行,那么当您sampleRatio
向 1.0.0增加时,更有可能找到它们。
回答by Akavall
And to fix this problem, you could provide your own defined schema.
为了解决这个问题,您可以提供自己定义的架构。
For example:
例如:
To reproduce the error:
要重现错误:
>>> df = spark.createDataFrame([[None, None]], ["name", "score"])
To fix the error:
要修复错误:
>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+
回答by Kamaldeep Singh
This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe
这可能是因为列都具有空值。在将它们转换为 spark 数据框之前,您应该删除这些列
回答by Aaron Robeson
I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:
我遇到了同样的问题,如果您不需要空的列,您可以在导入到 spark 之前简单地将它们从 Pandas 数据框中删除:
my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)