Java 如何将段落拆分成句子?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21430447/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 08:41:44  来源:igfitidea点击:

How to split paragraphs into sentences?

javaregexstringsplittext-segmentation

提问by Lemon Juice

Please have a look at the following.

请看以下内容。

String[]sentenceHolder = titleAndBodyContainer.split("\n|\.(?!\d)|(?<!\d)\.");

This is how I tried to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan. 13, 2014, words like U.Sand numbers like 2.2. They all got splitted by the above code. So basically, this code splits lot of 'dots' whether it is a full stop or not.

这就是我试图将段落分成句子的方式。但有个问题。我的段落包括日期之类的Jan. 13, 2014,单词之类的U.S和数字之类的2.2。他们都被上面的代码分开了。所以基本上,无论是否是句号,这段代码都会分割很多“点”。

I tried String[]sentenceHolder = titleAndBodyContainer.split(".\n");and String[]sentenceHolder = titleAndBodyContainer.split("\\.");as well. All failed.

我试着String[]sentenceHolder = titleAndBodyContainer.split(".\n");String[]sentenceHolder = titleAndBodyContainer.split("\\.");为好。都失败了。

How can I split a paragraph into sentences "properly"?

如何“正确”将段落拆分为句子?

回答by Sathesh

String[] sentenceHolder = titleAndBodyContainer.split("(?i)(?<=[.?!])\S+(?=[a-z])");

Try this it worked for me.

试试这个它对我有用。

回答by Ruchira Gayan Ranaweera

You can try this

你可以试试这个

String str = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";

Pattern re = Pattern.compile("[^.!?\s][^.!?]*(?:[.!?](?!['\"]?\s|$)[^.!?]*)*[.!?]?['\"]?(?=\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(str);
while (reMatcher.find()) {
    System.out.println(reMatcher.group());
}

Output:

输出:

This is how I tried to split a paragraph into a sentence.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2.
They all got split by the above code.

回答by Manas Kandekar

This will split the paragraph by .?!:

这将按以下方式拆分段落.?!

String a[]=str.split("\.|\?|\!");

You can put any symbol after \\which you want to use and use |to separate each condition.

您可以在\\其后放置要使用的任何符号并用于|分隔每个条件。