使用 java 8 在文件中查找模式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34791138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 16:00:08  来源:igfitidea点击:

Find pattern in files with java 8

javaregexjava-8

提问by Emerson Cod

consider I have a file like (just an excerpt)

考虑我有一个文件(只是一个摘录)

name: 'foobar'

I like to retrieve foobarwhen I discover the line with name.

我想取回foobar时,我发现与线name

My current approach is

我目前的做法是

Pattern m = Pattern.compile("name: '(.+)'");
try (Stream<String> lines = Files.lines(ruleFile)) {
    Optional<String> message = lines.filter(m.asPredicate()).findFirst();
    if (message.isPresent()) {
        Matcher matcher = m.matcher(message.get());
        matcher.find();
        String group = matcher.group(1);
        System.out.println(group);
    }
}

which looks not nice. The excessive use of the pattern and matcher seems wrong.

这看起来不太好。过度使用模式和匹配器似乎是错误的。

Is there a easier/better way ? Especially if I have multiple keys I like to search like this ?

有没有更简单/更好的方法?特别是如果我有多个键,我喜欢这样搜索?

采纳答案by khelwood

I would expect something more like this, to avoid matching the pattern twice:

我希望更像这样的东西,以避免两次匹配模式:

Pattern p = Pattern.compile("name: '([^']*)'");
lines.map(p::matcher)
     .filter(Matcher::matches)
     .findFirst()
     .ifPresent(matcher -> System.out.println(matcher.group(1)));

That is, for each string's matcher, get the first one that matches, for that one print out the first group.

也就是说,对于每个字符串的匹配器,获取第一个匹配的匹配器,为该匹配器打印出第一组。

回答by Holger

This is how the Java?9 solution will most likely look like:

这就是 Java?9 解决方案最有可能的样子:

Matcher m = Pattern.compile("name: '(.+)'").matcher("");
try(Stream<String> lines = Files.lines(ruleFile)) {
    lines.flatMap(line -> m.reset(line).results().limit(1))
         .forEach(mr -> System.out.println(mr.group(1)));
}

It uses the method Matcher.results()which returns a stream of all matches. Combining a stream of lines with a stream of matches via flatMapallows us to process all matches of a file. Since your original code only processes the first match of a line, I simply added a limit(1)to the matches of each line to get the same behavior.

它使用Matcher.results()返回所有匹配流的方法。通过将行流与匹配流结合起来flatMap,我们可以处理文件的所有匹配项。由于您的原始代码仅处理一行的第一个匹配项,因此我只是limit(1)在每行的匹配项中添加了 a以获得相同的行为。

Unfortunately, this feature is missing in Java?8, however, sneaking into upcoming releases helps getting an idea how an interim solution may look like:

不幸的是,Java?8 中缺少此功能,但是,潜入即将发布的版本有助于了解临时解决方案的外观:

Matcher m = Pattern.compile("name: '(.+)'").matcher("");
try(Stream<String> lines = Files.lines(ruleFile)) {
    lines.flatMap(line -> m.reset(line).find()? Stream.of(m.toMatchResult()): null)
         .forEach(mr -> System.out.println(mr.group(1)));
}

To simplify the sub-stream creation, this solution utilizes that only the first match is intended and creates a single element stream in the first place.

为了简化子流的创建,该解决方案利用只有第一个匹配项的目的,并首先创建单个元素流。

But note that with the question's pattern 'name: '(.+)'it doesn't matter whether we limit the number of matches as .+will greedily match all characters up to the last follow-up 'of the line, so another match is impossible. Things are different when using a reluctant quantifier like with name: '(.*?)'which consumes up to the next'rather than the lastone or forbidding to skip past 'explicitly, as with name: '([^']*)'.

但是请注意,对于问题的模式'name: '(.+)',我们是否限制匹配的数量并不重要,因为.+会贪婪地匹配行的最后一个后续字符之前的所有字符',因此不可能再匹配一次。使用不情愿的量词时情况有所不同,例如 with name: '(.*?)'which 消耗到下一个'而不是最后一个或禁止'显式跳过,例如 with name: '([^']*)'



The solutions above use a shared Matcherwhich works well with single-threaded usage (and this is unlikely to ever benefit from parallel processing). But if you want to be on the thread safe side, you may only share a Patternand create a Matcherinstead of calling m.reset(line):

上面的解决方案使用共享的Matcher,它适用于单线程使用(这不太可能从并行处理中受益)。但是如果你想在线程安全方面,你可能只共享 aPattern并创建 aMatcher而不是调用m.reset(line)

Pattern pattern = Pattern.compile("name: '(.*)'");
try(Stream<String> lines = Files.lines(ruleFile)) {
    lines.flatMap(line -> pattern.matcher(line).results().limit(1))
         .forEach(mr -> System.out.println(mr.group(1)));
}

resp. with Java?8

分别 用Java?8

try(Stream<String> lines = Files.lines(ruleFile)) {
    lines.flatMap(line -> {Matcher m=pattern.matcher(line);
                           return m.find()? Stream.of(m.toMatchResult()): null;})
         .forEach(mr -> System.out.println(mr.group(1)));
}

which isn't that concise due to the introduction of a local variable. This can be avoided by a preceding mapoperation, but when we are at this point, as long as we only head for a single match per line, we don't need a flatMapthen:

由于引入了局部变量,这不是那么简洁。这可以通过前面的map操作来避免,但是当我们在这一点上时,只要我们每行只针对一个匹配项,我们就不需要flatMapthen:

try(Stream<String> lines = Files.lines(ruleFile)) {
    lines.map(pattern::matcher).filter(Matcher::find)
         .forEach(m -> System.out.println(m.group(1)));
}

Since each Matcheris used exactly once, in a non-interfering way, its mutable nature doesn't hurt here and a conversion to an immutable MatchResultbecomes unnecessary.

由于每个Matcher都只使用一次,以无干扰的方式,它的可变性质在这里不会受到伤害,并且转换为不可变MatchResult变得不必要。

However, these solutions can't be scaled to process multiple matches per line, if that ever becomes necessary…

但是,如果有必要,这些解决方案无法扩展为每行处理多个匹配项……

回答by AJNeufeld

The answer by @khelwood results in creating a new Matcherobject over and over again, which can be a source of inefficiency if long files are scanned.

@khelwood 的答案会导致一遍又一遍地创建新Matcher对象,如果扫描长文件,这可能会导致效率低下。

The following solution creates the matcher only once, and reuses it for each line in the file.

以下解决方案只创建一次匹配器,并为文件中的每一行重复使用它。

Pattern p = Pattern.compile("name: '([^']*)'");
Matcher matcher = p.matcher(""); // Create a matcher for the pattern

Files.lines(ruleFile)
    .map(matcher::reset)         // Reuse the matcher object
    .filter(Matcher::matches)
    .findFirst()
    .ifPresent(m -> System.out.println(m.group(1)));

Warning -- Suspicious Hack Ahead

警告——可疑的黑客攻击

The .map(matcher::reset)pipeline stage is where the magic/hack happens. It effectively calls matcher.reset(line), which resets matcherto perform the next matching on the line just read in from the file, and returns itself, to allow chaining calls. The .map(...)stream operator sees this as mapping from the line to a Matcherobject, but in reality, we keep mapping to same object matchereach time, violating all sorts of rules about side-effects, etc.

.map(matcher::reset)流水线阶段是魔法/黑客发生。它有效地调用matcher.reset(line),它重置matcher为在刚刚从文件中读入的行上执行下一个匹配,并返回自身,以允许链接调用。该.map(...)流运营商认为这是从线到一个映射Matcher对象,但在现实中,我们不断映射相同的对象matcher每次违反各类关于副作用等规则

Of course, this cannotbe used for parallel streams, but fortunately reading from a file is inherently sequential.

当然,这不能用于并行流,但幸运的是从文件中读取本质上是顺序的。

Hack or Optimization? I suppose up/down votes will decide.

黑客还是优化?我想向上/向下投票将决定。