Python Pyspark 错误：“Py4JJavaError：调用 o655.count 时发生错误。” 在数据帧上调用 count() 方法时

Question

提问by Kaushik

I'm new to Spark and I'm using Pyspark 2.3.1 to read in a csv file into a dataframe. I'm able to read in the file and print values in a Jupyter notebook running within an anaconda environment. This is the code I'm using:

我是 Spark 的新手，我正在使用 Pyspark 2.3.1 将 csv 文件读入数据帧。我能够在 anaconda 环境中运行的 Jupyter 笔记本中读取文件并打印值。这是我正在使用的代码：

# Start session
spark = SparkSession \
.builder \
.appName("Embedding Models") \
.config('spark.ui.showConsoleProgress', 'true') \
.config("spark.master", "local[2]") \
.getOrCreate()

sqlContext = sql.SQLContext(spark)
schema = StructType([
         StructField("Index", IntegerType(), True),
         StructField("title", StringType(), True),
         StructField("body", StringType(), True)])

df= sqlContext.read.csv("../data/faq_data.csv",
                         header=True, 
                         mode="DROPMALFORMED",
                         schema=schema)

Output:

输出：

df.show()

+-----+--------------------+--------------------+
|Index|               title|                body|
+-----+--------------------+--------------------+
|    0|What does “quantu...|Quantum theory is...|
|    1|What is a quantum...|A quantum compute...|

However when I call the .count()method on the dataframe it throws the below error

但是，当我.count()在数据帧上调用该方法时，它会引发以下错误

    ---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-29-913a2f9eb5fc> in <module>()
----> 1 df.count()

~/anaconda3/envs/Community/lib/python3.6/site-packages/pyspark/sql/dataframe.py in count(self)
    453         2
    454         """
--> 455         return int(self._jdf.count())
    456 
    457     @ignore_unicode_prefix

~/anaconda3/envs/Community/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

~/anaconda3/envs/Community/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

~/anaconda3/envs/Community/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o655.count.
: java.lang.IllegalArgumentException
    at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
    at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
    at org.apache.spark.util.FieldAccessFinder$$anon$$anonfun$visitMethodInsn.apply(ClosureCleaner.scala:449)
    at org.apache.spark.util.FieldAccessFinder$$anon$$anonfun$visitMethodInsn.apply(ClosureCleaner.scala:432)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach.apply(TraversableLike.scala:733)
    at scala.collection.mutable.HashMap$$anon$$anonfun$foreach.apply(HashMap.scala:103)
    at scala.collection.mutable.HashMap$$anon$$anonfun$foreach.apply(HashMap.scala:103)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap$$anon.foreach(HashMap.scala:103)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.util.FieldAccessFinder$$anon.visitMethodInsn(ClosureCleaner.scala:432)
    at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
    at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean.apply(ClosureCleaner.scala:262)
    at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean.apply(ClosureCleaner.scala:261)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
    at org.apache.spark.rdd.RDD$$anonfun$collect.apply(RDD.scala:939)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
    at org.apache.spark.sql.Dataset$$anonfun$count.apply(Dataset.scala:2770)
    at org.apache.spark.sql.Dataset$$anonfun$count.apply(Dataset.scala:2769)
    at org.apache.spark.sql.Dataset$$anonfun.apply(Dataset.scala:3254)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
    at org.apache.spark.sql.Dataset.count(Dataset.scala:2769)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:844)

I'm using Python 3.6.5 if that makes a difference.

如果这有所不同，我将使用 Python 3.6.5。

Answer 1

回答by luminousmen

What Java version do you have on your machine? Your problem is probably related to Java 9.

你的机器上有什么 Java 版本？您的问题可能与 Java 9 有关。

If you download Java 8, the exception will disappear. If you already have Java 8 installed, just change JAVA_HOMEto it.

如果您下载 Java 8，则异常将消失。如果您已经安装了 Java 8，只需更改JAVA_HOME为它。

Answer 2

回答by Borislav Aymaliev

Could you try df.repartition(1).count()and len(df.toPandas())?

你可以尝试df.repartition(1).count()和len(df.toPandas())？

If it works, then the problem is most probably in your spark configuration.

如果它有效，那么问题很可能出在您的火花配置中。

Answer 3

回答by Rene B.

In Linux installing Java 8 as the following will help:

在 Linux 中安装 Java 8 如下：

sudo apt install openjdk-8-jdk

Then set the default Java to version 8 using:

然后使用以下命令将默认 Java 设置为版本 8：

sudo update-alternatives --config java

***************** : 2 (Enter 2, when it asks you to choose) + Press Enter

***************** : 2 (输入 2, 当它要求你选择时) + 按 Enter

Answer 4

回答by Nanda

Without being able to actually see the data, I would guess that it's a schema issue. I would recommend trying to load a smaller sample of the data where you can ensure that there are only 3 columns to test that.

无法实际看到数据，我猜这是一个架构问题。我建议尝试加载较小的数据样本，您可以确保只有 3 列可以进行测试。

Since its a CSV, another simple test could be to load and splitthe data by new line and then comma to check if there is anything breaking your file.

由于它是一个 CSV，另一个简单的测试可能是split按新行加载数据，然后用逗号检查是否有任何破坏文件的内容。

I've definitely seen this before but I can't remember what exactly was wrong.

我以前肯定见过这个，但我不记得到底出了什么问题。

Python Pyspark 错误：“Py4JJavaError：调用 o655.count 时发生错误。” 在数据帧上调用 count() 方法时

提问by Kaushik

回答by luminousmen

回答by Borislav Aymaliev

回答by Rene B.

回答by Nanda

相关推荐

最近更新

标签

Python Pyspark 错误：“Py4JJavaError：调用 o655.count 时发生错误。” 在数据帧上调用 count() 方法时

提问by Kaushik

回答by luminousmen

回答by Borislav Aymaliev

回答by Rene B.

回答by Nanda

相关推荐

Python 如何从一列对熊猫数据框进行排序

Python 将字典附加到数据框

Python 使用熊猫数据框的seaborn热图

Python 如何设置 Airflow 发送电子邮件？

相关推荐

最近更新

标签