java 使用 Hadoop 查找包含特定字符串的文件

Question

提问by arsenal

I have around 1000 filesand each file is of the size of 1GB. And I need to find a String in all these 1000 filesand also which files contains that particular String. I am working with Hadoop File System and all those 1000 filesare in Hadoop File System.

我有1000 files，每个文件的大小都是1GB. 我需要在所有这些中找到一个字符串，1000 files以及哪些文件包含该特定字符串。我正在使用 Hadoop 文件系统，所有这些1000 files都在 Hadoop 文件系统中。

All the 1000 filesare under real folder, so If I do like this below, I will be getting all the 1000 files. And I need to find which files contains a particular String hellounder real folder.

所有这些1000 files都在真实文件夹下，所以如果我在下面这样做，我将获得所有1000 files. 我需要在真实文件夹下找到哪些文件包含特定的 String hello。

bash-3.00$ hadoop fs -ls /technology/dps/real

And this is my data structure in hdfs-

这是我在 hdfs 中的数据结构-

row format delimited 
fields terminated by ''
collection items terminated by ','
map keys terminated by ':'
stored as textfile

How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.

我如何编写 MapReduce 作业来解决这个特定问题，以便我可以找到哪些文件包含特定字符串？任何简单的例子都会对我有很大帮助。

Update:-

更新：-

With the use of grep in Unix I can solve the above problem scenario, but it is very very slow and it takes lot of time to get the actual output-

在Unix下使用grep可以解决上面的问题场景，但是速度非常非常慢，需要很多时间才能得到实际的输出——

hadoop fs -ls /technology/dps/real | awk '{print }' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done

So that is the reason I was looking for some MapReduce jobs to do this kind of problem...

所以这就是我正在寻找一些 MapReduce 工作来解决此类问题的原因......

Answer 1

回答by Josh Rosen

It sounds like you're looking for a grep-like program, which is easy to implement using Hadoop Streaming(the Hadoop Java API would work too):

听起来您正在寻找一个类似 grep 的程序，它很容易使用Hadoop Streaming实现（Hadoop Java API 也可以）：

First, write a mapper that outputs the name of the file being processed if the line being processed contains your search string. I used Python, but any language would work:

首先，如果正在处理的行包含您的搜索字符串，则编写一个映射器，输出正在处理的文件的名称。我使用 Python，但任何语言都可以：

#!/usr/bin/env python
import os
import sys

SEARCH_STRING = os.environ["SEARCH_STRING"]

for line in sys.stdin:
    if SEARCH_STRING in line.split():
        print os.environ["map_input_file"]

This code reads the search string from the SEARCH_STRINGenvironmental variable. Here, I split the input line and check whether the search string matches any of the splits; you could change this to perform a substring search or use regular expressions to check for matches.

此代码从SEARCH_STRING环境变量中读取搜索字符串。在这里，我拆分输入行并检查搜索字符串是否与任何拆分匹配；您可以更改此设置以执行子字符串搜索或使用正则表达式来检查匹配项。

Next, run a Hadoop streaming job using this mapper and no reducers:

接下来，使用此映射器运行 Hadoop 流作业，不使用减速器：

$ bin/hadoop jar contrib/streaming/hadoop-streaming-*.jar \
    -D mapred.reduce.tasks=0
    -input hdfs:///data \
    -mapper search.py \
    -file search.py \
    -output /search_results \
    -cmdenv SEARCH_STRING="Apache"

The output will be written in several parts; to obtain a list of matches, you can simply cat the files (provided they aren't too big):

输出将分为几个部分；要获得匹配列表，您可以简单地对文件进行分类（前提是它们不是太大）：

$ bin/hadoop fs -cat /search_results/part-*
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/ivy.xml   
hdfs://localhost/data/README.txt
...

Answer 2

回答by Donald Miner

To get the filename you are currently processing, do:

要获取您当前正在处理的文件名，请执行以下操作：

((FileSplit) context.getInputSplit()).getPath().getName()

When you are searching your file record by record, when you see hello, emit the above path (and maybe the line or anything else).

当您按记录搜索文件记录时，当您看到时hello，发出上述路径（可能还有行或其他任何内容）。

Set the number of reducers to 0, they aren't doing anything here.

将减速器的数量设置为 0，它们在这里什么也不做。

Does 'row format delimited' mean that lines are delimited by a newline? in which case TextInputFormatand LineRecordReaderwork fine here.

“行格式分隔”是否意味着行由换行符分隔？在这种情况下TextInputFormat，LineRecordReader在这里工作正常。

Answer 3

回答by rtheunissen

You can try something like this, though I'm not sure if it's an efficient way to go about it. Let me know if it works - I haven't tested it or anything.

您可以尝试这样的方法，但我不确定这是否是一种有效的方法。让我知道它是否有效 - 我还没有测试过它或任何东西。

You can use it like this: java SearchFiles /technology/dps/real hellomaking sure you run it from the appropriate directory of course.

您可以像这样使用它：java SearchFiles /technology/dps/real hello当然确保您从适当的目录运行它。

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;

public class SearchFiles {

    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.err.println("Usage: [search-dir] [search-string]");
            return;
        }
        File searchDir = new File(args[0]);
        String searchString = args[1];
        ArrayList<File> matches = checkFiles(searchDir.listFiles(), searchString, new ArrayList<File>());
        System.out.println("These files contain '" + searchString + "':");
        for (File file : matches) {
            System.out.println(file.getPath());
        }
    }

    private static ArrayList<File> checkFiles(File[] files, String search, ArrayList<File> acc) throws IOException {
        for (File file : files) {
            if (file.isDirectory()) {
                checkFiles(file.listFiles(), search, acc);
            } else {
                if (fileContainsString(file, search)) {
                    acc.add(file);
                }
            }
        }
        return acc;
    }

    private static boolean fileContainsString(File file, String search) throws IOException {
        BufferedReader in = new BufferedReader(new FileReader(file));
        String line;
        while ((line = in.readLine()) != null) {
            if (line.contains(search)) {
                in.close();
                return true;
            }
        }
        in.close();
        return false;
    }
}

java 使用 Hadoop 查找包含特定字符串的文件

提问by arsenal

回答by Josh Rosen

回答by Donald Miner

回答by rtheunissen

相关推荐

最近更新

标签

java 使用 Hadoop 查找包含特定字符串的文件

提问by arsenal

回答by Josh Rosen

回答by Donald Miner

回答by rtheunissen

相关推荐

在 Java 中将 bean 转换为 Json

java 小程序问题 - NoClassDefFoundError

不使用注解的 Java 代码到 XML/XSD

java 将按钮高度和宽度设置为换行内容并填充父级

相关推荐

最近更新

标签