如何使用 Java 的 Scanner 类和正则表达式标记输入?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/244115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I tokenize input using Java's Scanner class and regular expressions?
提问by eplawless
Just for my own purposes, I'm trying to build a tokenizer in Java where I can define a regular grammar and have it tokenize input based on that. The StringTokenizer class is deprecated, and I've found a couple functions in Scanner that hint towards what I want to do, but no luck yet. Anyone know a good way of going about this?
仅出于我自己的目的,我试图在 Java 中构建一个分词器,我可以在其中定义常规语法并根据它对输入进行分词。StringTokenizer 类已被弃用,我在 Scanner 中发现了一些暗示我想要做什么的函数,但还没有运气。有人知道解决这个问题的好方法吗?
回答by Alan Moore
The name "Scanner" is a bit misleading, because the word is often used to mean a lexical analyzer, and that's not what Scanner is for. All it is is a substitute for the scanf()function you find in C, Perl, et al. Like StringTokenizer and split(), it's designed to scan ahead until it finds a match for a given pattern, and whatever it skipped over on the way is returned as a token.
“Scanner”这个名字有点误导,因为这个词经常被用来表示词法分析器,而这不是 Scanner 的用途。它只是scanf()您在 C、Perl等中找到的函数的替代品。与 StringTokenizer 和 一样split(),它被设计为向前扫描,直到找到给定模式的匹配项,并且在途中跳过的任何内容都作为标记返回。
A lexical analyzer, on the other hand, has to examine and classify every character, even if it's only to decide whether it can safely ignore them. That means, after each match, it may apply several patterns until it finds one that matches starting at that point. Otherwise, it may find the sequence "//" and think it's found the beginning of a comment, when it's really inside a string literal and it just failed to notice the opening quotation mark.
另一方面,词法分析器必须检查和分类每个字符,即使只是为了确定它是否可以安全地忽略它们。这意味着,在每次匹配之后,它可能会应用多种模式,直到找到从该点开始匹配的模式。否则,它可能会找到序列 "//" 并认为它找到了注释的开头,但实际上它确实在字符串文本中并且它只是没有注意到左引号。
It's actually much more complicated than that, of course, but I'm just illustrating why the built-in tools like StringTokenizer, split()and Scanner aren't suitable for this kind of task. It is, however, possible to use Java's regex classes for a limited form of lexical analysis. In fact, the addition of the Scanner class made it much easier, because of the new Matcher API that was added to support it, i.e., regions and the usePattern()method. Here's an example of a rudimentary scanner built on top of Java's regex classes.
当然,它实际上比这复杂得多,但我只是在说明为什么像 StringTokenizersplit()和 Scanner这样的内置工具不适合这种任务。但是,可以将 Java 的正则表达式类用于有限形式的词法分析。事实上,添加 Scanner 类使它变得更容易,因为添加了新的 Matcher API 来支持它,即区域和usePattern()方法。这是构建在 Java 正则表达式类之上的基本扫描器的示例。
import java.util.*;
import java.util.regex.*;
public class RETokenizer
{
static List<Token> tokenize(String source, List<Rule> rules)
{
List<Token> tokens = new ArrayList<Token>();
int pos = 0;
final int end = source.length();
Matcher m = Pattern.compile("dummy").matcher(source);
m.useTransparentBounds(true).useAnchoringBounds(false);
while (pos < end)
{
m.region(pos, end);
for (Rule r : rules)
{
if (m.usePattern(r.pattern).lookingAt())
{
tokens.add(new Token(r.name, m.start(), m.end()));
pos = m.end();
break;
}
}
pos++; // bump-along, in case no rule matched
}
return tokens;
}
static class Rule
{
final String name;
final Pattern pattern;
Rule(String name, String regex)
{
this.name = name;
pattern = Pattern.compile(regex);
}
}
static class Token
{
final String name;
final int startPos;
final int endPos;
Token(String name, int startPos, int endPos)
{
this.name = name;
this.startPos = startPos;
this.endPos = endPos;
}
@Override
public String toString()
{
return String.format("Token [%2d, %2d, %s]", startPos, endPos, name);
}
}
public static void main(String[] args) throws Exception
{
List<Rule> rules = new ArrayList<Rule>();
rules.add(new Rule("WORD", "[A-Za-z]+"));
rules.add(new Rule("QUOTED", "\"[^\"]*+\""));
rules.add(new Rule("COMMENT", "//.*"));
rules.add(new Rule("WHITESPACE", "\s+"));
String str = "foo //in \"comment\"\nbar \"no //comment\" end";
List<Token> result = RETokenizer.tokenize(str, rules);
for (Token t : result)
{
System.out.println(t);
}
}
}
This, by the way, is the only good use I've ever found for the lookingAt()method. :D
顺便说一句,这是我为该lookingAt()方法找到的唯一好的用途。:D
回答by Balint Pato
If I understand your question well then here are two example methods to tokenize a string. You do not even need the Scanner class, only if you want to pre-cast the tokens, or iterate through them more sofistically than using an array. If an array is enough just use String.split() as given below.
如果我很好地理解了您的问题,那么这里有两个用于标记字符串的示例方法。你甚至不需要 Scanner 类,只有当你想预铸令牌,或者比使用数组更复杂地遍历它们时。如果一个数组就足够了,只需使用下面给出的 String.split() 。
Please give more requirements to enable more precise answers.
请提出更多要求以提供更准确的答案。
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
String textToTokenize = "This is a text that will be tokenized. I will use 1-2 methods.";
Scanner scanner = new Scanner(textToTokenize);
scanner.useDelimiter("i.");
while (scanner.hasNext()){
System.out.println(scanner.next());
}
System.out.println(" **************** ");
String[] sSplit = textToTokenize.split("i.");
for (String token: sSplit){
System.out.println(token);
}
}
}
回答by ra9r
Most of the answers here are already excellent but I would be remiss if I didn't point out ANTLR. I've created entire compilers around this excellent tool. Version 3 has some amazing features and I'd recommend it for any project that required you to parse input based on a well defined grammar.
这里的大多数答案已经很好,但如果我没有指出ANTLR ,我就会失职 。我已经围绕这个优秀的工具创建了整个编译器。版本 3 具有一些惊人的功能,我建议将它用于任何需要您根据定义明确的语法解析输入的项目。
回答by Michael Myers
If this is for a simple project (for learning how things work), then go with what Balint Pato said.
如果这是一个简单的项目(用于学习事物的工作原理),那么就按照 Balint Pato 所说的去做。
If this is for a larger project, consider using a scanner generator like JFlexinstead. Somewhat more complicated, but faster and more powerful.
如果这是一个更大的项目,请考虑使用像JFlex这样的扫描仪生成器。有点复杂,但速度更快,功能更强大。

