java Java简单句解析器

Question

提问by mika

is there any simple way to create sentence parser in plain Java without adding any libs and jars.

有没有什么简单的方法可以在不添加任何库和 jar 的情况下在纯 Java 中创建句子解析器。

Parser should not just take care about blanks between words, but be more smart and parse: . ! ?, recognize when sentence is ended etc.

解析器不应该只关心单词之间的空格，而是要更加智能和解析： . ！?，识别句子何时结束等。

After parsing, only real words could be all stored in db or file, not any special chars.

解析后，只能将真实单词全部存储在db或file中，而不能存储任何特殊字符。

thank you very much all in advance :)

非常感谢大家提前:)

Answer 1

回答by

You might want to start by looking at the BreakIteratorclass.

您可能希望从查看BreakIterator类开始。

From the JavaDoc.

来自 JavaDoc。

The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur. Internally, BreakIterator scans text using a CharacterIterator, and is thus able to scan text held by any object implementing that protocol. A StringCharacterIterator is used to scan String objects passed to setText.
You use the factory methods provided by this class to create instances of various types of break iterators. In particular, use getWordIterator, getLineIterator, getSentenceIterator, and getCharacterIterator to create BreakIterators that perform word, line, sentence, and character boundary analysis respectively. A single BreakIterator can work only on one unit (word, line, sentence, and so on). You must use a different iterator for each unit boundary analysis you wish to perform.
Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.
Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.
Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.
Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation of through character strings, regardless of how the character is stored. For example, an accented character might be stored as a base character and a diacritical mark. What users consider to be a character can differ between languages.
BreakIterator is intended for use with natural languages only. Do not use this class to tokenize a programming language.

BreakIterator 类实现了用于查找文本中边界位置的方法。BreakIterator 的实例维护当前位置并扫描文本，返回出现边界的字符索引。在内部，BreakIterator 使用 CharacterIterator 扫描文本，因此能够扫描由实现该协议的任何对象保存的文本。StringCharacterIterator 用于扫描传递给 setText 的 String 对象。
您可以使用此类提供的工厂方法来创建各种类型的中断迭代器的实例。特别是，使用 getWordIterator、getLineIterator、getSentenceIterator 和 getCharacterIterator 来创建 BreakIterator，分别执行词、行、句子和字符边界分析。单个 BreakIterator 只能处理一个单元（单词、行、句子等）。您必须为要执行的每个单元边界分析使用不同的迭代器。
行边界分析确定换行时文本字符串可以在何处断开。该机制正确处理标点符号和带连字符的单词。
句子边界分析允许通过正确解释数字和缩写中的句点以及尾随标点符号（如引号和括号）进行选择。
词边界分析用于搜索和替换功能，以及允许用户通过双击选择词的文本编辑应用程序。词选择提供对词内和词后标点符号的正确解释。不属于单词的字符（例如符号或标点符号）在两侧都有断字。
字符边界分析允许用户按照他们的预期与字符进行交互，例如，在文本字符串中移动光标时。无论字符如何存储，字符边界分析都能提供正确的字符串导航。例如，重音字符可能存储为基本字符和变音符号。用户认为的字符可能因语言而异。
BreakIterator 仅用于自然语言。不要使用此类来标记编程语言。

See demo: BreakIteratorDemo.java

见演示：BreakIteratorDemo.java

Answer 2

回答by indusBull

Based on @Jarrod Roberson's answer, I have created a util method that uses BreakIterator and returns the list of sentences.

基于@Jarrod Roberson 的回答，我创建了一个使用 BreakIterator 并返回句子列表的 util 方法。

public static List<String> tokenize(String text, String language, String country){
    List<String> sentences = new ArrayList<String>();
    Locale currentLocale = new Locale(language, country);
    BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);      
    sentenceIterator.setText(text);
    int boundary = sentenceIterator.first();
    int lastBoundary = 0;
    while (boundary != BreakIterator.DONE) {
        boundary = sentenceIterator.next();         
        if(boundary != BreakIterator.DONE){
            sentences.add(text.substring(lastBoundary, boundary));
        }
        lastBoundary = boundary;            
    }
    return sentences;
}

Answer 3

回答by Crozin

Just use regular expression (\s+- it will apply to one or more whitespaces (spaces, tabs, etc.)) to split String into array.

只需使用正则表达式（\s+- 它将应用于一个或多个空格（空格、制表符等））将 String 拆分为数组。

Then you may iterate over that array and check whether word ends with .?!(String.endsWith()to find end of sentences.

然后您可以遍历该数组并检查单词是否以.?!( String.endsWith()结尾以查找句子的结尾。

And before saving any word use once again regular expression to remove every non-alphanumeric character.

在保存任何单词之前，再次使用正则表达式删除每个非字母数字字符。

Answer 4

回答by stacker

Of course, use StringTokenizer

当然，使用 StringTokenizer

import java.util.StringTokenizer;

public class Token {
    public static void main(String[] args) {

        String sentence = "Java! simple ?sentence parser.";
        String separator = "!?.";

        StringTokenizer st = new StringTokenizer( sentence, separator, true );

        while ( st.hasMoreTokens() ) {
            String token = st.nextToken();
            if ( token.length() == 1 && separator.indexOf( token.charAt( 0 ) ) >= 0 ) {
                System.out.println( "special char:" + token );
            }
            else {
                System.out.println( "word :" + token );
            }

        }
    }
}

Answer 5

回答by Holograham

Ex.

前任。

StringTokenizer tokenizer = new StringTokenizer(input, " !?.");

java Java简单句解析器

提问by mika

回答by

回答by indusBull

回答by Crozin

回答by stacker

回答by Holograham

相关推荐

最近更新

标签

java Java简单句解析器

提问by mika

回答by

回答by indusBull

回答by Crozin

回答by stacker

回答by Holograham

相关推荐

如何在 Java EE 中获取当前 Web 应用程序的名称？

java 如何通过本地主机从 JavaMail 发送邮件

java 在 JSTL 参数中使用变量

获取真正的文件扩展名-Java 代码

相关推荐

最近更新

标签