Python 如何关闭 Spark 中的 INFO 日志记录?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25193488/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:52:10  来源:igfitidea点击:

How to turn off INFO logging in Spark?

pythonscalaapache-sparkhadooppyspark

提问by horatio1701d

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pysparkscript to get to the spark prompt and can also do the Quick Start quide successfully.

我使用 AWS EC2 指南安装了 Spark,我可以使用bin/pyspark脚本很好地启动程序以进入 spark 提示,也可以成功执行快速入门指南。

However, I cannot for the life of me figure out how to stop all of the verbose INFOlogging after each command.

但是,我终生无法弄清楚如何INFO在每个命令之后停止所有详细日志记录。

I have tried nearly every possible scenario in the below code (commenting out, setting to OFF) within my log4j.propertiesfile in the conffolder in where I launch the application from as well as on each node and nothing is doing anything. I still get the logging INFOstatements printing after executing each statement.

我已经尝试了以下代码中几乎所有可能的场景(注释掉,设置为 OFF)在我启动应用程序log4j.propertiesconf文件夹中的文件中以及在每个节点上,但没有做任何事情。INFO在执行每个语句后,我仍然打印日志语句。

I am very confused with how this is supposed to work.

我对这应该如何工作感到非常困惑。

#Set everything to be logged to the console log4j.rootCategory=INFO, console                                                                        
log4j.appender.console=org.apache.log4j.ConsoleAppender 
log4j.appender.console.target=System.err     
log4j.appender.console.layout=org.apache.log4j.PatternLayout 
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

Here is my full classpath when I use SPARK_PRINT_LAUNCH_COMMAND:

这是我使用时的完整类路径SPARK_PRINT_LAUNCH_COMMAND

Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/bin/java -cp :/root/spark-1.0.1-bin-hadoop2/conf:/root/spark-1.0.1-bin-hadoop2/conf:/root/spark-1.0.1-bin-hadoop2/lib/spark-assembly-1.0.1-hadoop2.2.0.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main

Spark 命令:/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/bin/java -cp :/root/spark-1.0.1-bin-hadoop2/conf:/root/spark-1.0.1 -bin-hadoop2/conf:/root/spark-1.0.1-bin-hadoop2/lib/spark-assembly-1.0.1-hadoop2.2.0.jar:/root/spark-1.0.1-bin-hadoop2/lib /datanucleus-api-jdo-3.2.1.jar:/root/spark-1.0.1-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/root/spark-1.0.1-bin-hadoop2 /lib/datanucleus-rdbms-3.2.1.jar -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark。主副本

contents of spark-env.sh:

内容spark-env.sh

#!/usr/bin/env bash

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with 
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH=/root/spark-1.0.1-bin-hadoop2/conf/

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default')
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

export SPARK_SUBMIT_CLASSPATH="$FWDIR/conf"

采纳答案by poiuytrez

Just execute this command in the spark directory:

只需在 spark 目录中执行此命令:

cp conf/log4j.properties.template conf/log4j.properties

Edit log4j.properties:

编辑 log4j.properties:

# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

Replace at the first line:

替换第一行:

log4j.rootCategory=INFO, console

by:

经过:

log4j.rootCategory=WARN, console

Save and restart your shell. It works for me for Spark 1.1.0 and Spark 1.5.1 on OS X.

保存并重新启动您的外壳。它适用于 OS X 上的 Spark 1.1.0 和 Spark 1.5.1。

回答by Josh Rosen

This may be due to how Spark computes its classpath. My hunch is that Hadoop's log4j.propertiesfile is appearing ahead of Spark's on the classpath, preventing your changes from taking effect.

这可能与 Spark 计算其类路径的方式有关。我的预感是 Hadoop 的log4j.properties文件在类路径上出现在 Spark 的前面,这会阻止您的更改生效。

If you run

如果你跑

SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell

then Spark will print the full classpath used to launch the shell; in my case, I see

然后 Spark 将打印用于启动 shell 的完整类路径;就我而言,我看到

Spark Command: /usr/lib/jvm/java/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.1.jar:/root/spark/lib/datanucleus-core-3.2.2.jar:/root/spark/lib/datanucleus-rdbms-3.2.1.jar -XX:MaxPermSize=128m -Djava.library.path=:/root/ephemeral-hdfs/lib/native/ -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main

where /root/ephemeral-hdfs/confis at the head of the classpath.

where/root/ephemeral-hdfs/conf位于类路径的头部。

I've opened an issue [SPARK-2913]to fix this in the next release (I should have a patch out soon).

我已经打开了一个问题 [SPARK-2913]来在下一个版本中修复这个问题(我应该很快就会有一个补丁)。

In the meantime, here's a couple of workarounds:

与此同时,这里有几个解决方法:

  • Add export SPARK_SUBMIT_CLASSPATH="$FWDIR/conf"to spark-env.sh.
  • Delete (or rename) /root/ephemeral-hdfs/conf/log4j.properties.
  • 添加export SPARK_SUBMIT_CLASSPATH="$FWDIR/conf"spark-env.sh.
  • 删除(或重命名)/root/ephemeral-hdfs/conf/log4j.properties

回答by AkhlD

Edit your conf/log4j.properties file and Change the following line:

编辑您的 conf/log4j.properties 文件并更改以下行:

   log4j.rootCategory=INFO, console

to

    log4j.rootCategory=ERROR, console

Another approach would be to :

另一种方法是:

Fireup spark-shell and type in the following:

Fireup spark-shell 并输入以下内容:

import org.apache.log4j.Logger
import org.apache.log4j.Level

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)

You won't see any logs after that.

之后您将看不到任何日志。

回答by oleksii

I used this with Amazon EC2 with 1 master and 2 slaves and Spark 1.2.1.

我将它与具有 1 个主站和 2 个从站的 Amazon EC2 以及 Spark 1.2.1 一起使用。

# Step 1. Change config file on the master node
nano /root/ephemeral-hdfs/conf/log4j.properties

# Before
hadoop.root.logger=INFO,console
# After
hadoop.root.logger=WARN,console

# Step 2. Replicate this change to slaves
~/spark-ec2/copy-dir /root/ephemeral-hdfs/conf/

回答by FDS

Inspired by the pyspark/tests.py I did

灵感来自我所做的 pyspark/tests.py

def quiet_logs(sc):
    logger = sc._jvm.org.apache.log4j
    logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
    logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

Calling this just after creating SparkContext reduced stderr lines logged for my test from 2647 to 163. However creating the SparkContext itself logs 163, up to

在创建 SparkContext 之后调用这个将我的测试记录的 stderr 行从 2647 减少到 163。但是创建 SparkContext 本身记录 163,最多

15/08/25 10:14:16 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

and it's not clear to me how to adjust those programmatically.

我不清楚如何以编程方式调整这些。

回答by wannik

>>> log4j = sc._jvm.org.apache.log4j
>>> log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)

回答by Galen Long

For PySpark, you can also set the log level in your scripts with sc.setLogLevel("FATAL"). From the docs:

对于 PySpark,您还可以在脚本中使用sc.setLogLevel("FATAL"). 从文档

Control our logLevel. This overrides any user-defined log settings. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN

控制我们的日志级别。这会覆盖任何用户定义的日志设置。有效的日志级别包括:ALL、DEBUG、ERROR、FATAL、INFO、OFF、TRACE、WARN

回答by user3827333

The way I do it is:

我这样做的方式是:

in the location I run the spark-submitscript do

在我运行spark-submit脚本的位置做

$ cp /etc/spark/conf/log4j.properties .
$ nano log4j.properties

change INFOto what ever level of logging you want and then run your spark-submit

更改INFO为您想要的日志记录级别,然后运行spark-submit

回答by santifinland

I you want to keep using the logging (Logging facility for Python) you can try splitting configurations for your application and for Spark:

如果您想继续使用日志记录(Python 的日志记录工具),您可以尝试为您的应用程序和 Spark 拆分配置:

LoggerManager()
logger = logging.getLogger(__name__)
loggerSpark = logging.getLogger('py4j')
loggerSpark.setLevel('WARNING')

回答by mdh

In Spark 2.0 you can also configure it dynamically for your application using setLogLevel:

在 Spark 2.0 中,您还可以使用setLogLevel为您的应用程序动态配置它:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.\
        master('local').\
        appName('foo').\
        getOrCreate()
    spark.sparkContext.setLogLevel('WARN')

In the pysparkconsole, a default sparksession will already be available.

pyspark控制台中,默认spark会话已经可用。