Java 如何将段落拆分成句子?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21430447/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split paragraphs into sentences?
提问by Lemon Juice
Please have a look at the following.
请看以下内容。
String[]sentenceHolder = titleAndBodyContainer.split("\n|\.(?!\d)|(?<!\d)\.");
This is how I tried to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan. 13, 2014
, words like U.S
and numbers like 2.2
. They all got splitted by the above code. So basically, this code splits lot of 'dots' whether it is a full stop or not.
这就是我试图将段落分成句子的方式。但有个问题。我的段落包括日期之类的Jan. 13, 2014
,单词之类的U.S
和数字之类的2.2
。他们都被上面的代码分开了。所以基本上,无论是否是句号,这段代码都会分割很多“点”。
I tried String[]sentenceHolder = titleAndBodyContainer.split(".\n");
and String[]sentenceHolder = titleAndBodyContainer.split("\\.");
as well. All failed.
我试着String[]sentenceHolder = titleAndBodyContainer.split(".\n");
和String[]sentenceHolder = titleAndBodyContainer.split("\\.");
为好。都失败了。
How can I split a paragraph into sentences "properly"?
如何“正确”将段落拆分为句子?
回答by Sathesh
String[] sentenceHolder = titleAndBodyContainer.split("(?i)(?<=[.?!])\S+(?=[a-z])");
Try this it worked for me.
试试这个它对我有用。
回答by Ruchira Gayan Ranaweera
You can try this
你可以试试这个
String str = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";
Pattern re = Pattern.compile("[^.!?\s][^.!?]*(?:[.!?](?!['\"]?\s|$)[^.!?]*)*[.!?]?['\"]?(?=\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(str);
while (reMatcher.find()) {
System.out.println(reMatcher.group());
}
Output:
输出:
This is how I tried to split a paragraph into a sentence.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2.
They all got split by the above code.
回答by Manas Kandekar
This will split the paragraph by .
?
!
:
这将按以下方式拆分段落.
?
!
:
String a[]=str.split("\.|\?|\!");
You can put any symbol after \\
which you want to use and use |
to separate each condition.
您可以在\\
其后放置要使用的任何符号并用于|
分隔每个条件。