Java Apache Pig - 匹配多个匹配条件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18557928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 09:02:01  来源:igfitidea点击:

Apache Pig - MATCHES with multiple match criteria

javaregexhadoopapache-pig

提问by user2495234

I am trying to take a logical match criteria like:

我正在尝试采用逻辑匹配标准,例如:

(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ

and apply this as a match against a file in pig using

并将其作为匹配猪中的文件使用

result = filter inputfields by text matches (some regex expression here));

The problem is I have no idea how to trun the logical expression above into a regex expression for the matches method.

问题是我不知道如何将上面的逻辑表达式转换为匹配方法的正则表达式。

I have fiddled around with various things and the closest I have come to is something like this:

我摆弄过各种各样的东西,我最接近的是这样的:

((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b)

Any ideas? I also need to try to do this conversion programatically if possible.

有任何想法吗?如果可能,我还需要尝试以编程方式进行此转换。

Some examples:

一些例子:

a - The quick brown Foo jumped over the lazy test (This should pass as it contains foo and test)

a - 快速棕色 Foo 跳过了惰性测试(这应该通过,因为它包含 foo 和 test)

b - the was something going on in TestZ (This passes also as it contains testZ)

b - TestZ 中发生了一些事情(这也通过,因为它包含 testZ)

c - the quick brown Foo jumped over the lazy dog (This should fail as it contains Foo but not test,testA or TestB)

c - 快速的棕色 Foo 跳过了懒狗(这应该会失败,因为它包含 Foo 但不包含 test、testA 或 TestB)

Thanks

谢谢

采纳答案by jkovacs

Since you're using Pig you don't actually need an involved regular expression, you can just use the boolean operators supplied by pig combined with a couple of easy regular expressions, example:

由于您使用的是 Pig,因此您实际上并不需要涉及正则表达式,您可以只使用 pig 提供的布尔运算符以及几个简单的正则表达式,例如:

T = load 'matches.txt' as (str:chararray);
F = filter T by ((str matches '.*(Foo|Foo Bar|FooBar).*' and str matches '.*(test|testA|TestB).*') or str matches '.*TestZ.*');
dump F;

回答by Pshemo

You can use this regex for matchesmethod

您可以将此正则表达式用于matches方法

^((?=.*\bTestZ\b)|(?=.*\b(FooBar|Foo Bar|Foo)\b)(?=.*\b(testA|testB|test)\b)).*
  • note that "Foo" OR "Foo Bar" OR "FooBar"should be written as FooBar|Foo Bar|Foonot Foo|Foo Bar|FooBarto prevent matching only Fooin string containing FooBaror Foo Bar
  • also since look-ahead is zero-width you need to pass .*at the end of regex to let matches match entire string.
  • 请注意,"Foo" OR "Foo Bar" OR "FooBar"应该写成FooBar|Foo Bar|FooFoo|Foo Bar|FooBar防止仅Foo在包含FooBar或的字符串中进行匹配Foo Bar
  • 此外,由于前瞻是零宽度,您需要.*在正则表达式的末尾传递以让匹配匹配整个字符串。

Demo

演示

String[] data = { "The quick brown Foo jumped over the lazy test",
        "the was something going on in TestZ",
        "the quick brown Foo jumped over the lazy dog" };
String regex = "^((?=.*\bTestZ\b)|(?=.*\b(FooBar|Foo Bar|Foo)\b)(?=.*\b(testA|testB|test)\b)).*";
for (String s : data) {
    System.out.println(s.matches(regex) + " : " + s);
}

output:

输出:

true : The quick brown Foo jumped over the lazy test
true : the was something going on in TestZ
false : the quick brown Foo jumped over the lazy dog