java 如何将文本解析成句子
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4373612/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse text into sentences
提问by user533203
I'm trying to break up a paragraph into sentences. Here is my code so far:
我试图将一个段落分解成句子。到目前为止,这是我的代码:
import java.util.*;
public class StringSplit {
public static void main(String args[]) throws Exception{
String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.";
String[] sentences = testString.split("[\.\!\?]");
for (int i=0;i<sentences.length;i++){
System.out.println(i);
System.out.println(sentences[i]);
}
}
}
Two problems were found:
发现两个问题:
- The code splits anytime it comes to a period (".") symbol, even when it's actually one sentence. How do I prevent this?
- Each sentence that is split starts with a space. How do I delete the redundant space?
- 代码在遇到句点 (".") 符号时会拆分,即使它实际上是一个句子。我如何防止这种情况?
- 拆分的每个句子都以空格开头。如何删除多余的空间?
回答by Favonius
The problem you mentioned is a NLP (Natural Language Processing) problem. It is fine to write a crude rule engine but it might not scale up to support full english text.
您提到的问题是 NLP(自然语言处理)问题。编写一个粗略的规则引擎很好,但它可能无法扩展以支持完整的英文文本。
To have a deeper insight and a java library check out this link http://nlp.stanford.edu/software/lex-parser.shtml, http://nlp.stanford.edu:8080/parser/index.jspand similar question for ruby
language How do you parse a paragraph of text into sentences? (perferrably in Ruby)
为了有一个更深入的了解和java库看看这个链接http://nlp.stanford.edu/software/lex-parser.shtml,http://nlp.stanford.edu:8080/parser/index.jsp和类似ruby
语言问题你如何将一段文本解析成句子?(最好是在 Ruby 中)
for example : The text -
例如:文本 -
The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.
谈判的结果至关重要,因为乔治·W·布什总统签署成为法律的现行税收水平将于 12 月 31 日到期。除非国会采取行动,否则几乎所有缴纳所得税的美国人的税率都将在 1 月 1 日上升。那可能会影响经济增长甚至假日销售。
after tagging becomes :
标记后变为:
The/DT outcome/NN of/IN the/DT negotiations/NNS is/VBZ vital/JJ ,/, because/IN the/DT current/JJ tax/NN levels/NNS signed/VBN into/IN law/NN by/IN President/NNP George/NNP W./NNP Bush/NNP expire/VBP on/RP Dec./NNP 31/CD ./. Unless/IN Congress/NNP acts/VBZ ,/, tax/NN rates/NNS on/IN virtually/RB all/RB Americans/NNPS who/WP pay/VBP income/NN taxes/NNS will/MD rise/VB on/IN Jan./NNP 1/CD ./. That/DT could/MD affect/VB economic/JJ growth/NN and/CC even/RB holiday/NN sales/NNS ./. Parse
/ DT结果/NN of/IN/DT谈判/NNS是/VBZ至关重要/JJ,/,因为/IN/DT当前/JJ税收/NN水平/NNS签署/VBN成/IN法/NN由/ IN 总统/NNP George/NNP W./NNP Bush/NNP 到期/VBP on/RP Dec./NNP 31/CD ./. 除非/IN 国会/NNP 法案/VBZ,/,税收/NN 税率/NNS on/IN 几乎/RB all/RB 美国人/NNPS 谁/WP 支付/VBP 收入/NN 税/NNS 将/MD 上升/VB on/在 Jan./NNP 1/CD ./. 那/DT 可能/MD 影响/VB 经济/JJ 增长/NN 和/CC 偶数/RB 假期/NN 销售/NNS ./。解析
Check how it has distinguished the full stop (.) and the period after Dec. 31 ...
检查它如何区分句号 (.) 和 12 月 31 日之后的时间段......
回答by darioo
The first one is a pretty hard problem to do properly, since you'd have to implement sentence detection. I suggest you don't do that, and just separate sentences with two blank lines after a punctuation mark. For example:
第一个是一个很难正确处理的问题,因为您必须实现句子检测。我建议你不要这样做,在标点符号后用两个空行分隔句子。例如:
"The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales."
The second one can be solved using String.trim().
第二个可以使用String.trim()解决。
Example:
例子:
String one = " and now... ";
String two = one.trim();
System.out.println(two); // output: "and now..."
回答by Jay Weinberg
You can try to use the java.text.BreakIterator
class for parsing sentences. For example:
您可以尝试使用java.text.BreakIterator
该类来解析句子。例如:
BreakIterator border = BreakIterator.getSentenceInstance(Locale.US);
border.setText(text);
int start = border.first();
//iterate, creating sentences out of all the Strings between the given boundaries
for (int end = border.next(); end != BreakIterator.DONE; start = end, end = border.next()) {
System.out.println(text.substring(start,end));
}
回答by Pooja N Babu
回答by Vijay Mathew
Given the current input format, it will be difficult to split into sentences. You have to impose some rule additional rule to identify the end of a sentence, in addition to the period. For instance, this rule could be "a sentence should end with a period(.) and two spaces". (This is how the UNIX tool grep
identifies sentences.
鉴于当前的输入格式,将很难拆分成句子。除了句号之外,您还必须施加一些规则附加规则来识别句子的结尾。例如,这个规则可以是“一个句子应该以句点(.)和两个空格结尾”。(这是 UNIX 工具grep
识别句子的方式。
回答by Jimit Tank
first Trim() Your String... and use this link
首先 Trim() 您的字符串...并使用此链接
http://www.java-examples.com/java-string-split-example&http://www.rgagnon.com/javadetails/java-0438.html
http://www.java-examples.com/java-string-split-example&http://www.rgagnon.com/javadetails/java-0438.html
and you can also use StringBuffer Class... just use this link i hope it will help you
你也可以使用 StringBuffer 类...只需使用这个链接我希望它会帮助你