java 配置 Hadoop 日志记录以避免过多的日志文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2656159/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 22:14:23  来源:igfitidea点击:

Configuring Hadoop logging to avoid too many log files

javalog4jhadoopmapreduce

提问by Eric Wendelin

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: Error in Hadoop MapReduce

我遇到了 Hadoop 在 $HADOOP_LOG_DIR/userlogs(Ext3 文件系统只允许 32000 个子目录)中产生太多日志文件的问题,这在这个问题中看起来是同样的问题:Error in Hadoop MapReduce

My question is: does anyone know how to configure Hadoop to roll the log dir or otherwise prevent this? I'm trying to avoid just setting the "mapred.userlog.retain.hours" and/or "mapred.userlog.limit.kb" properties because I want to actually keep the log files.

我的问题是:有谁知道如何配置 Hadoop 来滚动日志目录或以其他方式防止这种情况发生?我试图避免只设置“mapred.userlog.retain.hours”和/或“mapred.userlog.limit.kb”属性,因为我想实际保留日志文件。

I was also hoping to configure this in log4j.properties, but looking at the Hadoop 0.20.2 source, it writes directly to logfiles instead of actually using log4j. Perhaps I don't understand how it's using log4j fully.

我也希望在 log4j.properties 中配置它,但是查看 Hadoop 0.20.2 源代码,它直接写入日志文件而不是实际使用 log4j。也许我不明白它是如何完全使用 log4j 的。

Any suggestions or clarifications would be greatly appreciated.

任何建议或澄清将不胜感激。

采纳答案by Chase

Unfortunately, there isn't a configurable way to prevent that. Every task for a job gets one directory in history/userlogs, which will hold the stdout, stderr, and syslog task log output files. The retain hours will help keep too many of those from accumulating, but you'd have to write a good log rotation tool to auto-tar them.

不幸的是,没有一种可配置的方法来防止这种情况发生。作业的每个任务在 history/userlogs 中都有一个目录,该目录将保存 stdout、stderr 和 syslog 任务日志输出文件。保留时间将有助于防止累积太多的时间,但您必须编写一个好的日志轮换工具来自动对它们进行 tar 处理。

We had this problem too when we were writing to an NFS mount, because all nodes would share the same history/userlogs directory. This means one job with 30,000 tasks would be enough to break the FS. Logging locally is really the way to go when your cluster actually starts processing a lot of data.

我们在写入 NFS 挂载时也遇到了这个问题,因为所有节点都将共享相同的历史记录/用户日志目录。这意味着一份有 30,000 个任务的工作就足以打破 FS。当您的集群实际开始处理大量数据时,在本地记录日志确实是一种可行的方法。

If you are already logging locally and still manage to process 30,000+ tasks on one machine in less than a week, then you are probably creating too many small files, causing too many mappers to spawn for each job.

如果您已经在本地登录并且仍然设法在不到一周的时间内在一台机器上处理 30,000 多个任务,那么您可能创建了太多小文件,导致为每个作业生成太多映射器。

回答by Jon Snyder

I had this same problem. Set the environment variable "HADOOP_ROOT_LOGGER=WARN,console" before starting Hadoop.

我有同样的问题。在启动 Hadoop 之前设置环境变量“HADOOP_ROOT_LOGGER=WARN,console”。

export HADOOP_ROOT_LOGGER="WARN,console"
hadoop jar start.jar

回答by milan

Configuring hadoop to use log4j and setting

配置 hadoop 以使用 log4j 并进行设置

log4j.appender.FILE_AP1.MaxFileSize=100MB
log4j.appender.FILE_AP1.MaxBackupIndex=10

like described on this wiki pagedoesn't work?

这个维基页面上描述的那样不起作用?

Looking at the LogLevel source code, seems like hadoop uses commons logging, and it'll try to use log4j by default, or jdk logger if log4j is not on the classpath.

查看LogLevel 源代码,似乎 hadoop 使用公共日志记录,并且默认情况下它会尝试使用 log4j,如果 log4j 不在类路径上,它将尝试使用 jdk logger。

Btw, it's possible to change log levels at runtime, take a look at the commands manual.

顺便说一句,可以在运行时更改日志级别,请查看命令手册

回答by Stephen C

According to the documentation, Hadoop uses log4j for logging. Maybe you are looking in the wrong place ...

根据文档,Hadoop 使用 log4j 进行日志记录。也许你找错了地方......

回答by mountrix

I also ran in the same problem.... Hive produce a lot of logs, and when the disk node is full, no more containers can be launched. In Yarn, there is currently no option to disable logging. One file particularly huge is the syslog file, generating GBs of logs in few minutes in our case.

我也遇到了同样的问题.... Hive 产生大量日志,当磁盘节点已满时,无法启动更多容器。在 Yarn 中,目前没有禁用日志记录的选项。一个特别大的文件是 syslog 文件,在我们的案例中,它会在几分钟内生成数 GB 的日志。

Configuring in "yarn-site.xml" the property yarn.nodemanager.log.retain-seconds to a small value does not help. Setting "yarn.nodemanager.log-dirs" to "file:///dev/null" is not possible because a directory is needed. Removing the writing ritght (chmod -r /logs) did not work either.

在“yarn-site.xml”中将属性yarn.nodemanager.log.retain-seconds 配置为一个小的值并没有帮助。由于需要目录,因此无法将“yarn.nodemanager.log-dirs”设置为“file:///dev/null”。删除写入权限(chmod -r /logs)也不起作用。

One solution could be to a "null blackhole" directory. Check here: https://unix.stackexchange.com/questions/9332/how-can-i-create-a-dev-null-like-blackhole-directory

一种解决方案可能是“空黑洞”目录。在这里查看:https: //unix.stackexchange.com/questions/9332/how-can-i-create-a-dev-null-like-blackhole-directory

Another solution working for us is to disable the log before running the jobs. For instance, in Hive, starting the script by the following lines is working:

另一个对我们有用的解决方案是在运行作业之前禁用日志。例如,在 Hive 中,通过以下几行启动脚本是有效的:

set yarn.app.mapreduce.am.log.level=OFF;
set mapreduce.map.log.level=OFF;
set mapreduce.reduce.log.level=OFF;