java Java简单句解析器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2103598/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 19:28:21  来源:igfitidea点击:

Java simple sentence parser

javaparsingnlp

提问by mika

is there any simple way to create sentence parser in plain Java without adding any libs and jars.

有没有什么简单的方法可以在不添加任何库和 jar 的情况下在纯 Java 中创建句子解析器。

Parser should not just take care about blanks between words, but be more smart and parse: . ! ?, recognize when sentence is ended etc.

解析器不应该只关心单词之间的空格,而是要更加智能和解析: . !?,识别句子何时结束等。

After parsing, only real words could be all stored in db or file, not any special chars.

解析后,只能将真实单词全部存储在db或file中,而不能存储任何特殊字符。

thank you very much all in advance :)

非常感谢大家提前:)

回答by

You might want to start by looking at the BreakIteratorclass.

您可能希望从查看BreakIterator类开始。

From the JavaDoc.

来自 JavaDoc。

The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur. Internally, BreakIterator scans text using a CharacterIterator, and is thus able to scan text held by any object implementing that protocol. A StringCharacterIterator is used to scan String objects passed to setText.

You use the factory methods provided by this class to create instances of various types of break iterators. In particular, use getWordIterator, getLineIterator, getSentenceIterator, and getCharacterIterator to create BreakIterators that perform word, line, sentence, and character boundary analysis respectively. A single BreakIterator can work only on one unit (word, line, sentence, and so on). You must use a different iterator for each unit boundary analysis you wish to perform.

Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.

Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.

Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation of through character strings, regardless of how the character is stored. For example, an accented character might be stored as a base character and a diacritical mark. What users consider to be a character can differ between languages.

BreakIterator is intended for use with natural languages only. Do not use this class to tokenize a programming language.

BreakIterator 类实现了用于查找文本中边界位置的方法。BreakIterator 的实例维护当前位置并扫描文本,返回出现边界的字符索引。在内部,BreakIterator 使用 CharacterIterator 扫描文本,因此能够扫描由实现该协议的任何对象保存的文本。StringCharacterIterator 用于扫描传递给 setText 的 String 对象。

您可以使用此类提供的工厂方法来创建各种类型的中断迭代器的实例。特别是,使用 getWordIterator、getLineIterator、getSentenceIterator 和 getCharacterIterator 来创建 BreakIterator,分别执行词、行、句子和字符边界分析。单个 BreakIterator 只能处理一个单元(单词、行、句子等)。您必须为要执行的每个单元边界分析使用不同的迭代器。

行边界分析确定换行时文本字符串可以在何处断开。该机制正确处理标点符号和带连字符的单词。

句子边界分析允许通过正确解释数字和缩写中的句点以及尾随标点符号(如引号和括号)进行选择。

词边界分析用于搜索和替换功能,以及允许用户通过双击选择词的文本编辑应用程序。词选择提供对词内和词后标点符号的正确解释。不属于单词的字符(例如符号或标点符号)在两侧都有断字。

字符边界分析允许用户按照他们的预期与字符进行交互,例如,在文本字符串中移动光标时。无论字符如何存储,字符边界分析都能提供正确的字符串导航。例如,重音字符可能存储为基本字符和变音符号。用户认为的字符可能因语言而异。

BreakIterator 仅用于自然语言。不要使用此类来标记编程语言。

See demo: BreakIteratorDemo.java

见演示BreakIteratorDemo.java

回答by indusBull

Based on @Jarrod Roberson's answer, I have created a util method that uses BreakIterator and returns the list of sentences.

基于@Jarrod Roberson 的回答,我创建了一个使用 BreakIterator 并返回句子列表的 util 方法。

public static List<String> tokenize(String text, String language, String country){
    List<String> sentences = new ArrayList<String>();
    Locale currentLocale = new Locale(language, country);
    BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);      
    sentenceIterator.setText(text);
    int boundary = sentenceIterator.first();
    int lastBoundary = 0;
    while (boundary != BreakIterator.DONE) {
        boundary = sentenceIterator.next();         
        if(boundary != BreakIterator.DONE){
            sentences.add(text.substring(lastBoundary, boundary));
        }
        lastBoundary = boundary;            
    }
    return sentences;
}

回答by Crozin

Just use regular expression (\s+- it will apply to one or more whitespaces (spaces, tabs, etc.)) to split String into array.

只需使用正则表达式(\s+- 它将应用于一个或多个空格(空格、制表符等))将 String 拆分为数组。

Then you may iterate over that array and check whether word ends with .?!(String.endsWith()to find end of sentences.

然后您可以遍历该数组并检查单词是否以.?!( String.endsWith()结尾以查找句子的结尾。

And before saving any word use once again regular expression to remove every non-alphanumeric character.

在保存任何单词之前,再次使用正则表达式删除每个非字母数字字符。

回答by stacker

Of course, use StringTokenizer

当然,使用 StringTokenizer

import java.util.StringTokenizer;

public class Token {
    public static void main(String[] args) {

        String sentence = "Java! simple ?sentence parser.";
        String separator = "!?.";

        StringTokenizer st = new StringTokenizer( sentence, separator, true );

        while ( st.hasMoreTokens() ) {
            String token = st.nextToken();
            if ( token.length() == 1 && separator.indexOf( token.charAt( 0 ) ) >= 0 ) {
                System.out.println( "special char:" + token );
            }
            else {
                System.out.println( "word :" + token );
            }

        }
    }
}

回答by Holograham

String Tokenizer

字符串标记器

Scanner

扫描器

Ex.

前任。

StringTokenizer tokenizer = new StringTokenizer(input, " !?.");