在 zeppelin scala 中读取大型 JSON 文件时出现 org.apache.thrift.transport.TTransportException 错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36835122/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:13:05  来源:igfitidea点击:

org.apache.thrift.transport.TTransportException error while Reading large JSON file in zeppelin scala

jsonscalaapache-sparkapache-zeppelin

提问by Kiran Shashi

I am trying to read a large JSON file (1.5 GB) using Zeppelin and Scala.

我正在尝试使用 Zeppelin 和 Scala 读取大型 JSON 文件(1.5 GB)。

Zeppelin is working on SPARK in local mode installed on Ubuntu OS on a VM with 10 GB RAM. I have alloted 8GB to the spark.executor.memory

Zeppelin 正在以本地模式在具有 10 GB RAM 的 VM 上安装在 Ubuntu 操作系统上的 SPARK。我已经为 spark.executor.memory 分配了 8GB

My Code is as below

我的代码如下

val inputFileWeather="/home/shashi/incubator-zeppelin-master/data/ai/weather.json"
val temp=sqlContext.read.json(inputFileWeather)

I am getting the following error

我收到以下错误

org.apache.thrift.transport.TTransportException
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
    at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
    at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
    at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:241)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:225)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:229)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
    at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:229)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
    at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:328)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

回答by user1314742

The error you got is due to a problem in running the Spark interpreter, so Zeppelin could not connect with the interpreter process.

您得到的错误是由于运行 Spark 解释器时出现问题,因此 Zeppelin 无法连接到解释器进程。

You have to check your logs located in /PATH/TO/ZEPPELIN/logs/*.outto know exactly what happening. Perhaps in the interpreter logs you will see an OOM.

您必须检查位于 中的日志/PATH/TO/ZEPPELIN/logs/*.out以确切了解发生了什么。也许在解释器日志中你会看到一个 OOM。

I think that 8GB for executor memory on a VM with 10 GB is a unreasonable,(and how many executors are you starting?). You have to consider the driver memeory as well

我认为 10 GB 的 VM 上的执行程序内存为 8GB 是不合理的,(您要启动多少个执行程序?)。您还必须考虑驱动程序内存

回答by Aditya Bangard

Increase the driver memory in the pyspark interpreter i.e. spark.driver.memory. By default its 1G

增加 pyspark 解释器中的驱动程序内存,即 spark.driver.memory。默认为 1G