java+spark:org.apache.spark.SparkException:作业中止:任务不可序列化:java.io.NotSerializableException
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24046744/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
java+spark: org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException
提问by user2810081
I'm new to spark, and was trying to run the example JavaSparkPi.java, it runs well, but because i have to use this in another java s I copy all things from main to a method in the class and try to call the method in main, it saids
我是 spark 新手,并试图运行示例 JavaSparkPi.java,它运行良好,但是因为我必须在另一个 java 中使用它,所以我将所有内容从 main 复制到类中的一个方法并尝试调用主要方法,它说
org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException
org.apache.spark.SparkException:作业中止:任务不可序列化:java.io.NotSerializableException
the code looks like this:
代码如下所示:
public class JavaSparkPi {
public void cal(){
JavaSparkContext jsc = new JavaSparkContext("local", "JavaLogQuery");
int slices = 2;
int n = 100000 * slices;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++) {
l.add(i);
}
JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
System.out.println("count is: "+ dataSet.count());
dataSet.foreach(new VoidFunction<Integer>(){
public void call(Integer i){
System.out.println(i);
}
});
int count = dataSet.map(new Function<Integer, Integer>() {
@Override
public Integer call(Integer integer) throws Exception {
double x = Math.random() * 2 - 1;
double y = Math.random() * 2 - 1;
return (x * x + y * y < 1) ? 1 : 0;
}
}).reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
});
System.out.println("Pi is roughly " + 4.0 * count / n);
}
public static void main(String[] args) throws Exception {
JavaSparkPi myClass = new JavaSparkPi();
myClass.cal();
}
}
anyone have idea on this? thanks!
有人对此有想法吗?谢谢!
采纳答案by Daniel Darabos
The nested functions hold a reference to the containing object (JavaSparkPi
). So this object will get serialized. For this to work, it needs to be serializable. Simple to do:
嵌套函数持有对包含对象 ( JavaSparkPi
)的引用。所以这个对象将被序列化。为此,它需要可序列化。做起来很简单:
public class JavaSparkPi implements Serializable {
...
回答by Anuj J
The main problem is that when you create an Anonymous Class in java it is passed a reference of the enclosing class. This can be fixed in many ways
主要问题是,当您在 java 中创建匿名类时,它会传递一个封闭类的引用。这可以通过多种方式修复
Declare the enclosing class Serializable
声明封闭类 Serializable
This works in your case but will fall flat in case your enclosing class has some field that is not serializable. I would also say that serializing the parent class is a total waste.
这适用于您的情况,但如果您的封闭类有一些不可序列化的字段,则会失败。我还要说序列化父类完全是浪费。
Create the Closure in a static function
在静态函数中创建闭包
Creating the closure by invoking some static function doesn't pass the reference to the closure and hence no need to make serializable this way.
通过调用一些静态函数来创建闭包不会将引用传递给闭包,因此不需要以这种方式进行序列化。
回答by Jeetu
This error comes because you have multiple physical CPUs in your local or cluster and spark engine try to send this function to multiple CPUs over network. Your function
出现此错误是因为您的本地或集群中有多个物理 CPU,而 Spark 引擎尝试通过网络将此功能发送到多个 CPU。你的职能
dataSet.foreach(new VoidFunction<Integer>(){
public void call(Integer i){
***System.out.println(i);***
}
});
uses println() which is not serialize. So the exception thrown by Spark Engine. The solution is you can use below:
使用非序列化的 println()。所以Spark Engine抛出的异常。解决方案是您可以在下面使用:
dataSet.collect().forEach(new VoidFunction<Integer>(){
public void call(Integer i){
System.out.println(i);
}
});