Java 将字符串拆分成句子
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2687012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split string into sentences
提问by leba-lev
I have written this piece of code that splits a string and stores it in a string array:-
我已经编写了这段代码,用于拆分字符串并将其存储在字符串数组中:-
String[] sSentence = sResult.split("[a-z]\.\s+");
However, I've added the [a-z] because I wanted to deal with some of the abbreviation problem. But then my result shows up as so:-
但是,我添加了 [az] 是因为我想处理一些缩写问题。但后来我的结果显示如下:-
Furthermore when Everett tried to instruct them in basic mathematics they proved unresponsiv
此外,当埃弗雷特试图教他们基础数学时,他们被证明没有反应
I see that I lose the pattern specified in the split function. It's okay for me to lose the period, but losing the last letter of the word disturbs its meaning.
我发现我丢失了 split 函数中指定的模式。丢失句号对我来说是可以的,但是丢失单词的最后一个字母会扰乱其含义。
Could someone help me with this, and in addition, could someone help me with dealing with abbreviations? For example, because I split the string based on periods, I do not want to lose the abbreviations.
有人可以帮我解决这个问题吗,此外,有人可以帮我处理缩写吗?例如,因为我根据句点拆分字符串,所以我不想丢失缩写。
采纳答案by Julien Silland
Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.
解析句子远非一项微不足道的任务,即使对于像英语这样的拉丁语言也是如此。像您在问题中概述的那种天真的方法会经常失败,以至于在实践中证明是无用的。
A better approach is to use a BreakIteratorconfigured with the right Locale.
更好的方法是使用配置了正确 Locale的BreakIterator。
BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start,end));
}
Yields the following result:
产生以下结果:
- This is a test.
- This is a T.L.A. test.
- Now with a Dr. in it.
- 这是一个测试。
- 这是一个 TLA 测试。
- 现在里面有一个博士。
回答by Mark Byers
It will be difficult to get a regular expression to work in all cases, but to fix your immediate problem you can use a lookbehind:
很难让正则表达式在所有情况下都能工作,但要解决您的直接问题,您可以使用后视:
String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\.\s+");
Result:
结果:
This is a test
This is a T.L.A. test.
Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!
注意有不以大写字母结尾的缩写,如abbrev., Mr.等... 还有不以句号结尾的句子!
回答by Jo?o Silva
If you can, use a natural language processing tool, such as LingPipe. There are many subtleties which will be very hard to catch using regular expressions, e.g., (e.g.:-)), Mr., abbreviations, ellipsis(...), et cetera.
如果可以,请使用自然语言处理工具,例如LingPipe。使用正则表达式很难捕捉到许多微妙之处,例如 ( eg:-)), Mr., abbreviations, ellipsis(...)等等。
There is a very easy to follow tutorial on Sentence Detectionin the LingPipe website.
LingPipe 网站上有一个非常容易学习的关于句子检测的教程。