Python Pyspark 错误:“Py4JJavaError:调用 o655.count 时发生错误。” 在数据帧上调用 count() 方法时
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51952535/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pyspark Error: "Py4JJavaError: An error occurred while calling o655.count." when calling count() method on dataframe
提问by Kaushik
I'm new to Spark and I'm using Pyspark 2.3.1 to read in a csv file into a dataframe. I'm able to read in the file and print values in a Jupyter notebook running within an anaconda environment. This is the code I'm using:
我是 Spark 的新手,我正在使用 Pyspark 2.3.1 将 csv 文件读入数据帧。我能够在 anaconda 环境中运行的 Jupyter 笔记本中读取文件并打印值。这是我正在使用的代码:
# Start session
spark = SparkSession \
.builder \
.appName("Embedding Models") \
.config('spark.ui.showConsoleProgress', 'true') \
.config("spark.master", "local[2]") \
.getOrCreate()
sqlContext = sql.SQLContext(spark)
schema = StructType([
StructField("Index", IntegerType(), True),
StructField("title", StringType(), True),
StructField("body", StringType(), True)])
df= sqlContext.read.csv("../data/faq_data.csv",
header=True,
mode="DROPMALFORMED",
schema=schema)
Output:
输出:
df.show()
+-----+--------------------+--------------------+
|Index| title| body|
+-----+--------------------+--------------------+
| 0|What does “quantu...|Quantum theory is...|
| 1|What is a quantum...|A quantum compute...|
However when I call the .count()
method on the dataframe it throws the below error
但是,当我.count()
在数据帧上调用该方法时,它会引发以下错误
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-29-913a2f9eb5fc> in <module>()
----> 1 df.count()
~/anaconda3/envs/Community/lib/python3.6/site-packages/pyspark/sql/dataframe.py in count(self)
453 2
454 """
--> 455 return int(self._jdf.count())
456
457 @ignore_unicode_prefix
~/anaconda3/envs/Community/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/anaconda3/envs/Community/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/anaconda3/envs/Community/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o655.count.
: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$$anonfun$visitMethodInsn.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$$anonfun$visitMethodInsn.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$$anonfun$foreach.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$$anonfun$foreach.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
at org.apache.spark.sql.Dataset$$anonfun$count.apply(Dataset.scala:2770)
at org.apache.spark.sql.Dataset$$anonfun$count.apply(Dataset.scala:2769)
at org.apache.spark.sql.Dataset$$anonfun.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2769)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:844)
I'm using Python 3.6.5 if that makes a difference.
如果这有所不同,我将使用 Python 3.6.5。
回答by luminousmen
What Java version do you have on your machine? Your problem is probably related to Java 9.
你的机器上有什么 Java 版本?您的问题可能与 Java 9 有关。
If you download Java 8, the exception will disappear. If you already have Java 8 installed, just change JAVA_HOME
to it.
如果您下载 Java 8,则异常将消失。如果您已经安装了 Java 8,只需更改JAVA_HOME
为它。
回答by Borislav Aymaliev
Could you try df.repartition(1).count()
and len(df.toPandas())
?
你可以尝试df.repartition(1).count()
和len(df.toPandas())
?
If it works, then the problem is most probably in your spark configuration.
如果它有效,那么问题很可能出在您的火花配置中。
回答by Rene B.
In Linux installing Java 8 as the following will help:
在 Linux 中安装 Java 8 如下:
sudo apt install openjdk-8-jdk
Then set the default Java to version 8 using:
然后使用以下命令将默认 Java 设置为版本 8:
sudo update-alternatives --config java
***************** : 2 (Enter 2, when it asks you to choose) + Press Enter
***************** : 2 (输入 2, 当它要求你选择时) + 按 Enter
回答by Nanda
Without being able to actually see the data, I would guess that it's a schema issue. I would recommend trying to load a smaller sample of the data where you can ensure that there are only 3 columns to test that.
无法实际看到数据,我猜这是一个架构问题。我建议尝试加载较小的数据样本,您可以确保只有 3 列可以进行测试。
Since its a CSV, another simple test could be to load and split
the data by new line and then comma to check if there is anything breaking your file.
由于它是一个 CSV,另一个简单的测试可能是split
按新行加载数据,然后用逗号检查是否有任何破坏文件的内容。
I've definitely seen this before but I can't remember what exactly was wrong.
我以前肯定见过这个,但我不记得到底出了什么问题。