如何使用 java.util.regex.* 执行部分匹配?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2469231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I perform a partial match with java.util.regex.*?
提问by amit.bhayani
I have been using the java.util.regex.* classes for Regular Expression in Java and all good so far. But today I have a different requirement. For example consider the pattern to be "aabb". Now if the input String is aa it will definitely not match, however there is still possibility that if I append bb it becomes aabb and it matches. However if I would have started with cc, no matter what I append it will never match.
我一直在使用 java.util.regex.* 类作为 Java 中的正则表达式,到目前为止一切都很好。但今天我有一个不同的要求。例如,考虑模式是“aabb”。现在如果输入字符串是 aa 它肯定不会匹配,但是如果我附加 bb 它仍然有可能变成 aabb 并且它匹配。但是,如果我从 cc 开始,无论我附加什么,它都不会匹配。
I have explored the Pattern and Matcher class but didn't find any way of achieving this.
我已经探索了 Pattern 和 Matcher 类,但没有找到任何方法来实现这一点。
The input will come from user and system have to wait till pattern matches or it will never match irrespective of any input further.
输入将来自用户,系统必须等到模式匹配,否则无论进一步输入如何,它都不会匹配。
Any clue?
有什么线索吗?
Thanks.
谢谢。
回答by Alan Moore
You should have looked more closely at the Matcher API; the hitEnd()method works exactly as you described:
您应该更仔细地查看 Matcher API;该hitEnd()方法完全按照您的描述工作:
import java.util.regex.*;
public class Test
{
public static void main(String[] args) throws Exception
{
String[] ss = { "aabb", "aa", "cc", "aac" };
Pattern p = Pattern.compile("aabb");
Matcher m = p.matcher("");
for (String s : ss) {
m.reset(s);
if (m.matches()) {
System.out.printf("%-4s : match%n", s);
}
else if (m.hitEnd()) {
System.out.printf("%-4s : partial match%n", s);
}
else {
System.out.printf("%-4s : no match%n", s);
}
}
}
}
output:
输出:
aabb : match
aa : partial match
cc : no match
aac : no match
As far as I know, Java is the only language that exposes this functionality. There's also the requireEnd()method, which tells you if more input could turn a match into a non-match, but I don't think it's relevant in your case.
据我所知,Java 是唯一公开此功能的语言。还有一个requireEnd()方法,它告诉您更多输入是否可以将匹配变为不匹配,但我认为这与您的情况无关。
Both methods were added to support the Scanner class, so it can apply regexes to a stream without requiring the whole stream to be read into memory.
添加这两种方法是为了支持 Scanner 类,因此它可以将正则表达式应用于流,而无需将整个流读入内存。
回答by Jun D. Ouyang
Pattern p = Pattern.compile(expr);
Matcher m = p.matcher(string);
m.find();
回答by Kilian Foth
So you want to know not whether a String s matches the regex, but whether there might be a longer String starting with s that would match? Sorry, Regexes can't help you there because you get no access to the internal state of the matcher; you only get the boolean result and any groups you have defined, so you never know whya match failed.
因此,您想知道的不是 String s 是否与正则表达式匹配,而是是否可能有更长的以 s 开头的 String 匹配?抱歉,正则表达式无法帮助您,因为您无法访问匹配器的内部状态;你只能得到布尔结果和您定义的群体,所以你永远不知道为什么比赛失败。
If you're willing to hack the JDK libraries, you can extend (or probably fork) java.util.regexand give out more information about the matching process. If the match failed because the input was 'used up' the answer would be true; if it failed because of character discrimination or other checks it would be false. That seems like a lot of work though, because your problem is completely the opposite of what regexes are supposed to do.
如果您愿意破解 JDK 库,您可以扩展(或可能 fork)java.util.regex并提供有关匹配过程的更多信息。如果匹配失败,因为输入被“用完”,则答案为真;如果由于字符歧视或其他检查而失败,则为false。不过,这似乎需要做很多工作,因为您的问题与正则表达式应该做的完全相反。
Another option: maybe you can simply redefine the task so that you can treat the input as the regexp and match aabbagainst *aa.**? You have to be careful about regex metacharacters, though.
另一种选择:也许您可以简单地重新定义任务,以便您可以将输入视为正则表达式并将aabb与 *aa.**匹配?但是,您必须小心正则表达式元字符。
回答by M. Jessup
For the example you give you could try to use an anti-pattern to disqualify invalid results. For example "^[^a]" would tell you you're input "c..." can't match your example pattern of "aabb".
对于您提供的示例,您可以尝试使用反模式来取消无效结果的资格。例如“^[^a]”会告诉你你输入的“c...”不能匹配你的“aabb”示例模式。
Depending on your pattern you may be able to break it up into smaller patterns to check and use multiple matchers and then set their bounds as one match occurs and you move to the next. This approach may work but if you're pattern is complex and can have variable length sub-parts you may end up reimplementing part of the matcher in your own code to adjust the possible bounds of the match to make it more or less greedy. A pseudo-code general idea of this would be:
根据您的模式,您可以将其分解为更小的模式以检查和使用多个匹配器,然后在一个匹配发生时设置它们的边界,然后移动到下一个。这种方法可能有效,但如果您的模式很复杂并且可以有可变长度的子部分,您最终可能会在自己的代码中重新实现匹配器的一部分,以调整匹配的可能边界,使其或多或少变得贪婪。一个伪代码的一般想法是:
boolean match(String input, Matcher[] subpatterns, int matchStart, int matchEnd){
matcher = next matcher in list;
int stop = matchend;
while(true){
if matcher.matches input from matchstart -> matchend{
if match(input, subpatterns, end of current match, end of string){
return true;
}else{
//make this match less greedy
stop--;
}
}else{
//no match
return false;
}
}
}
You could then merge this idea with the anti-patterns, and have anti-subpatterns and after each subpattern match you check the next anti-pattern, if it matches you know you have failed, otherwise continue the matching pattern. You would likely want to return something like an enum instead of a boolean (i.e. ALL_MATCHED, PARTIAL_MATCH, ANTI_PATTERN_MATCH, ...)
然后,您可以将此想法与反模式合并,并拥有反子模式,在每个子模式匹配后,您检查下一个反模式,如果匹配,您就知道失败了,否则继续匹配模式。您可能希望返回类似枚举而不是布尔值的内容(即 ALL_MATCHED、PARTIAL_MATCH、ANTI_PATTERN_MATCH 等)
Again depending on the complexity of your actual pattern that you are trying to match writing the appropriate sub patterns / anti-pattern may be difficult if not impossible.
再次根据您尝试匹配的实际模式的复杂性,编写适当的子模式/反模式可能很困难,如果不是不可能的话。
回答by Stephen C
One way to do this is to parse your regex into a sequence of sub-regexes, and then reassemble them in a way that allows you to do partial matches; e.g. "abc" has 3 sub-regexes "a", "b" and "c" which you can then reassemble as "a(b*(c)?)?".
一种方法是将您的正则表达式解析为一系列子正则表达式,然后以允许您进行部分匹配的方式重新组合它们;例如,“ab c”有 3 个子正则表达式“a”、“b”和“c”,然后您可以将它们重新组合为“a(b*(c)?)?”。
Things get more complicated when the input regex contains alternation and groups, but the same general approach should work.
当输入正则表达式包含交替和组时,事情会变得更加复杂,但同样的通用方法应该可以工作。
The problem with this approach is that the resulting regex is more complicated, and could potentially lead to excessive backtracking for complex input regexes.
这种方法的问题在于生成的正则表达式更加复杂,并且可能导致复杂输入正则表达式的过度回溯。
回答by ddimitrov
If you make each character of the regex optional and relax the multiplicity constraints, you kinda get what you want. Example if you have a matching pattern "aa(abc)+bbbb", you can have a 'possible match' pattern 'a?a?(a?b?c?)*b?b?b?b?'.
如果您将正则表达式的每个字符设为可选并放宽多重约束,您就会得到想要的东西。例如,如果你有一个匹配的模式“aa(abc)+bbbb”,你可以有一个“可能的匹配”模式“a?a?(a?b?c?)*b?b?b?b?”。
This mechanical way of producing possible-match pattern does not cover advanced constructs like forward and backward refs though.
但是,这种产生可能匹配模式的机械方式并没有涵盖像向前和向后引用这样的高级结构。
回答by brainimus
You might be able to accomplish this with a state machine (http://en.wikipedia.org/wiki/State_machine). Have your states/transitions represent valid input and one error state. You can then feed the state machine one character (possibly substring depending on your data) at a time. At any point you can check if your state machine is in the error state. If it is not in the error state then you know that future input may still match. If it is in the error state then you know something previously failed and any future input will not make the string valid.
您可以使用状态机 ( http://en.wikipedia.org/wiki/State_machine)来完成此操作。让您的状态/转换代表有效输入和一种错误状态。然后,您可以一次向状态机提供一个字符(可能是子字符串,具体取决于您的数据)。您可以随时检查您的状态机是否处于错误状态。如果它不处于错误状态,那么您知道未来的输入可能仍然匹配。如果它处于错误状态,那么您就知道之前失败了,任何未来的输入都不会使字符串有效。

