bash Grep 跨 Hadoop 文件系统中的多个文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11697810/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 22:29:31  来源:igfitidea点击:

Grep across multiple files in Hadoop Filesystem

bashshellunixhadoopgrep

提问by arsenal

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

我正在使用 Hadoop,我需要在我的 Hadoop 文件系统中找到大约 100 个文件中的哪些包含某个字符串。

I can see the files I wish to search like this:

我可以看到我想这样搜索的文件:

bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time

..which returns several entries like this:

..它返回几个这样的条目:

-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab

How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc? Once I know, I can edit them manually.

我如何找到其中哪些包含字符串bcd4bc3e1380a56108f486a4fffbc8dc?一旦我知道,我可以手动编辑它们。

回答by phs

This is a hadoop "filesystem", not a POSIX one, so try this:

这是一个hadoop“文件系统”,而不是POSIX,所以试试这个:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print }' | \
while read f
do
  hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done

This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:

这应该可以工作,但它是串行的,因此可能很慢。如果您的集群可以承受热量,我们可以并行化:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print }' | \
  xargs -n 1 -I ^ -P 10 bash -c \
  "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"

Notice the -P 10option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

请注意以下-P 10选项xargs:这是我们将并行下载和搜索的文件数量。从低开始并增加数字,直到您的磁盘 I/O 或网络带宽饱和,无论您的配置是否相关。

EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:

编辑:鉴于您使用的是 SunOS(有点脑死亡),请尝试以下操作:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print }' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done

回答by Gourav Goutam

To find all files with any extension recursively inside hdfs location:

要在 hdfs 位置内递归查找具有任何扩展名的所有文件:

hadoop fs -find  hdfs_loc_path  -name ".log"

回答by D Xia

hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "*bcd4bc3e1380a56108f486a4fffbc8dc*"

hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "bcd4bc3e1380a56108f486a4fffbc8dc"

回答by David Ongaro

Using hadoop fs -cat(or the more generic hadoop fs -text) might be feasible if you just have two 1 GB files. For 100 files though I would use the streaming-apibecause it can be used for adhoc-queries without resorting to a full fledged mapreduce job. E.g. in your case create a script get_filename_for_pattern.sh:

如果您只有两个 1 GB 文件,则使用hadoop fs -cat(或更通用的hadoop fs -text)可能是可行的。对于 100 个文件,虽然我会使用流 API,因为它可以用于临时查询,而无需求助于完整的 mapreduce 作业。例如,在您的情况下创建一个脚本get_filename_for_pattern.sh

#!/bin/bash
grep -q  && echo $mapreduce_map_input_file
cat >/dev/null # ignore the rest

Note that you have to read the whole input, in order to avoid getting java.io.IOException: Stream closedexceptions.

请注意,您必须阅读整个输入,以避免出现java.io.IOException: Stream closed异常。

Then issue the commands

然后发出命令

hadoop jar $HADOOP_HOME/hadoop-streaming.jar\
 -Dstream.non.zero.exit.is.failure=false\
 -files get_filename_for_pattern.sh\
 -numReduceTasks 1\
 -mapper "get_filename_for_pattern.sh bcd4bc3e1380a56108f486a4fffbc8dc"\
 -reducer "uniq"\
 -input /apps/hdmi-technology/b_dps/real-time/*\
 -output /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc
hadoop fs -cat /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc/*

In newer distributions mapred streaminginstead of hadoop jar $HADOOP_HOME/hadoop-streaming.jarshould work. In the latter case you have to set your $HADOOP_HOMEcorrectly in order to find the jar (or provide the full path directly).

在较新的发行版中,mapred streaming而不是hadoop jar $HADOOP_HOME/hadoop-streaming.jar应该工作。在后一种情况下,您必须$HADOOP_HOME正确设置才能找到 jar(或直接提供完整路径)。

For simpler queries you don't even need a script but just can provide the command to the -mapperparameter directly. But for anything slightly complex it's preferable to use a script, because getting the escaping right can be a chore.

对于更简单的查询,您甚至不需要脚本,只需-mapper直接向参数提供命令即可。但是对于稍微复杂的任何事情,最好使用脚本,因为正确转义可能是一件苦差事。

If you don't need a reduce phase provide the symbolic NONEparameter to the respective -reduceoption (or just use -numReduceTasks 0). But in your case it's useful to have a reduce phase in order to have the output consolidated into a single file.

如果您不需要缩减阶段NONE,请向相应-reduce选项提供符号参数(或仅使用-numReduceTasks 0)。但是在您的情况下,有一个减少阶段以便将输出合并到一个文件中很有用。

回答by Mukesh Gupta

You are looking to applying grep command on hdfs folder

您正在寻找在 hdfs 文件夹上应用 grep 命令

hdfs dfs -cat /user/coupons/input/201807160000/* | grep -c null

here cat recursively goes through all files in the folder and I have applied grep to find count.

这里 cat 递归遍历文件夹中的所有文件,我已经应用 grep 来查找计数。