什么字符可用于用 Java 解析段落?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2188265/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What character can be used to parse for paragraphs with Java?
提问by canadiancreed
I'm sure folks will get a good laugh out of this one, but for the life of me I cannot find a seperator that will indicate when a new paragraph has begun in a string of text. Word, and line? Easy peasy, but paragraph seems to be much harder to find. I've tried two line breaks in a row, the Unicode representation of paragraph break and line break, with no luck.
我相信人们会从这个中得到一个很好的笑声,但就我的生活而言,我找不到一个分隔符来指示文本字符串中新段落何时开始。字,行?容易,但段落似乎更难找到。我已经连续尝试了两个换行符,分段符和换行符的 Unicode 表示,但没有成功。
EDIT: I apologize for the vagueness of my original question. To answer some of the questions, it is a basic text file originally created on windows. I'm testing some code for opening and analyzing it's contents with the Blackberry JDE 4.5 using the RIM eclipse plugin. While the source of the file will be windows (at least for the foreseeable future) and be basic text, I have no control over how they are created (it's a third party source that I dont' have access to the way it is created)
编辑:我为我最初问题的含糊不清而道歉。为了回答一些问题,它是一个最初在 Windows 上创建的基本文本文件。我正在测试一些代码,用于使用 RIM eclipse 插件使用 Blackberry JDE 4.5 打开和分析其内容。虽然文件的源将是 windows(至少在可预见的未来)并且是基本文本,但我无法控制它们的创建方式(这是第三方源,我无法访问它的创建方式)
采纳答案by Stephen C
There is no such paragraph break character in common usage.
常用的没有这样的分段符。
You might be able to get away with assuming that two or more line breaks in a row (with optional horizontal whitespace) indicates a paragraph break. But there are numerous exceptions to this "rule". For example, when a paragraph
您可以假设一行中的两个或多个换行符(带有可选的水平空白)表示一个段落分隔符。但是这个“规则”有很多例外。例如,当一个段落
- is interrupted by a floating figure, or
- contains bullet points
- 被漂浮的人影打断,或
- 包含要点
and then continues on ... like this one. For that kind of thing, there is probably no solution.
然后继续......就像这个。对于这种事情,恐怕是没有办法解决的。
EDITper @Aiden's comment below. (It is now clear that this is not relevant to the OP, but it may be relevant to others who find the question via Google, etc)
根据以下@Aiden 的评论进行编辑。(现在很明显,这与 OP 无关,但可能与通过 Google 等找到问题的其他人有关)
Instead of trying to reverse engineer paragraphs from text, perhaps you should consider specifying that your input should be in (for example) Markdownsyntax; i.e. as supported by StackOverflow. The Markdown Wikiincludes links to markdown parser implementations in many languages, including Java.
与其尝试从文本中逆向工程段落,也许您应该考虑指定您的输入应采用(例如)Markdown语法;即由 StackOverflow 支持。Markdown Wiki包含指向多种语言(包括 Java)的Markdown解析器实现的链接。
(This assumes that you have some control over the input format of the text you are trying to parse into paragraphs, etcetera.)
(这假设您对尝试解析为段落等的文本的输入格式有一定的控制权。)
回答by Ofir
It is possible that instead on line feed you need to look for a CR LF sequence (\r\n) - obviously the answer would depend on the text format.
有可能在换行时您需要查找 CR LF 序列 (\r\n) - 显然答案取决于文本格式。
回答by Alan Moore
Paragraphs in plain text documents are usually separated by two or more line separators. A line separator may be a linefeed (\n), a carriage-return (\r), or a carriage-return followed by a linefeed (\r\n). These three kinds of separator are typically associated with operating systems, but any application is free to write text using any kind of line separator. In fact, text that's been assembled from diverse sources (like a web page) may well contain two or more kinds of separator. When your app readstext, no matter what platform it's running on, it should always check for all three kinds of line separator.
纯文本文档中的段落通常由两个或多个行分隔符分隔。行分隔符可以是换行符 ( \n)、回车符 ( \r) 或回车符后跟换行符 ( \r\n)。这三种分隔符通常与操作系统相关联,但任何应用程序都可以使用任何类型的行分隔符自由编写文本。事实上,从不同来源(如网页)组合而成的文本很可能包含两种或多种分隔符。当您的应用程序读取文本时,无论它在什么平台上运行,它都应该始终检查所有三种行分隔符。
BufferedReader#readLine()does that, but of course it only reads one line at a time. Simple prose will usually be returned as an alternating sequence of non-empty lines representing paragraphs, and empty lines representing the spaces between them. But don't count on it; watch for multiple empty lines, and be aware that "empty" lines may in fact contain whitespace characters like space (\u0020) and TAB (\u0009).
BufferedReader#readLine()这样做,但当然它一次只读取一行。简单的散文通常会作为表示段落的非空行和表示它们之间的空格的空行的交替序列返回。但不要指望它;注意多个空行,并注意“空”行实际上可能包含空格 ( \u0020) 和 TAB ( \u0009)等空白字符。
If you choose not to go with a BufferedReader, you may have to write the detection code from scratch. Java ME doesn't include regex support, so split()and java.util.Scannerare not available; and StringTokenizer makes no distinction between a single delimiter character and several in a row unless you use the returnDelimsoption. Then it returns the delimiters one character at a time, so you still have to write your own code to figure out what kind of separator you're looking at, if any.
如果您选择不使用BufferedReader,则可能必须从头开始编写检测代码。Java ME的不包括正则表达式的支持,所以split()和java.util.Scanner不可用; 除非您使用该returnDelims选项,否则 StringTokenizer 不会区分单个分隔符字符和连续多个字符。然后它一次返回一个字符的分隔符,因此您仍然需要编写自己的代码来确定您正在查看的分隔符类型(如果有)。
回答by SEK
First, your best bet would be to define a paragraph. Whether it is a line break, a double line break, or a line break followed by a tab. Assuming that you have no control over the input and want to determine the number of paragraphs in various samples of text, any of these situations may exist. Furthermore, they might be used to the same purpose within the same document. So some analysis is needed for this, and keep in mind it won't be 100% accurate all the time.
首先,最好的办法是定义一个段落。无论是换行符、双换行符还是换行符后跟制表符。假设您无法控制输入并希望确定各种文本样本中的段落数,则可能存在任何这些情况。此外,它们可能在同一文档中用于相同的目的。因此需要对此进行一些分析,并记住它不会一直是 100% 准确的。
Start by initializing the various possible paragraph breaks:
首先初始化各种可能的分段符:
- "\r"
- "\n\r"
- "\n"
- System.getProperty("line.seperator")
- "\r"
- "\n\r"
- "\n"
- System.getProperty("line.seperator")
and all of those, but twice, and all those variations with an additional tab character ('\t') on the end.
以及所有这些,但两次,所有这些变体都在末尾带有一个额外的制表符 ('\t')。
The inefficient way to do this would be to load the input into a string and then call buffer.split().lengthto determine how many paragraphs there were. The efficient, scalable way would be to use a stream and go over the input, taking into account how long the paragraph is, and throwing out those paragraphs beneath a given "threshold". A more advanced algorithm might even switch what it considers to be a paragraph after it encounters a switch in the way line breaks are handled (several very short lines, or several very long ones, for example).
执行此操作的低效方法是将输入加载到字符串中,然后调用buffer.split().length以确定有多少段落。有效的、可扩展的方法是使用流并检查输入,考虑段落的长度,并在给定的“阈值”下丢弃这些段落。更高级的算法甚至可能会在遇到处理换行方式的切换(例如,几行很短的行,或几行很长的行)后切换它认为是段落的内容。
And all of this is assuming that you are dealing with unformatted text without section titles, etc. What it comes down to is the concept of asking how many paragraphs are in a particular piece of text is like asking how many weeks are in a year. It's not exactly 52, but it's around there.
所有这些都是假设您正在处理没有章节标题等的无格式文本。归根结底,询问特定文本中有多少段落的概念就像询问一年中有多少周一样。它不完全是 52,但它就在那里。
回答by BalusC
String lineSeparator = System.getProperty("line.separator");
This returns the platform's default line separator.
这将返回平台的默认行分隔符。
Thus, e.g. the following should work:
因此,例如以下应该工作:
String[] paragraphs = text.split(lineSeparator);
回答by Gladwin Burboz
I assume you have a text file and not a complex document like MS-Word or RTF.
我假设您有一个文本文件,而不是像 MS-Word 或 RTF 这样的复杂文档。
The concept of paragraph in text document is not well defined. Most cases new paragraph will be recognized by the fact that when you open a document in text editor, you will see next set of text starting on next line.
文本文档中段落的概念没有明确定义。大多数情况下,当您在文本编辑器中打开文档时,新段落将被识别,您将看到下一行开始的下一组文本。
There are two special characters viz. new-line (LF - '\n') and carriage-return (CR - '\r') that causes the text to start on next line. Which character is used for next line depends on operating system you use. Further more, sometimes combination of both is also used like CRLF ('\r\n').
有两个特殊字符即。换行符 (LF - '\n') 和回车符 (CR - '\r') 使文本从下一行开始。下一行使用哪个字符取决于您使用的操作系统。此外,有时也使用两者的组合,如 CRLF ( '\r\n')。
In java you can determine character or set of characters used to seprate lines/paragraphs using System.getProperty("line.separator");. But this brings in new problem. What if you create a text file in MS Windows and then open it in Unix? Line seprator in text file in this case is that of windows, but java is running on unix.
在 java 中,您可以使用System.getProperty("line.separator");. 但这又带来了新的问题。如果您在 MS Windows 中创建一个文本文件,然后在 Unix 中打开它会怎样?在这种情况下,文本文件中的行分隔符是 windows 的,但 java 在 unix 上运行。
.
.
My recommendation is:
我的建议是:
IF length of text(docuemnt) is zero, THEN paragraphs = 0.
如果文本(文档)的长度为零,则段落数 = 0。
IF length of text(docuemnt) is NOT zero, THEN
如果文本长度(docuemnt)不为零,则
- Consider
'\n'and'\r'as line break characters. - Scan your text for above line break characters.
- Any continious line break characters in any order should be considered as one paragraph break.
- Number of paragraphs = 1 + (count of paragraph breaks)
- 考虑
'\n'和'\r'作为换行符。 - 扫描您的文本以查找以上换行符。
- 任何顺序的任何连续换行符都应被视为一个段落分隔符。
- 段落数 = 1 +(分段数)
Note, exceptions pointed by Stephen still applies here as well.
请注意,Stephen 指出的例外情况仍然适用于此。
.
.
public class ParagraphTest {
public static void main(String[] args) {
String document =
"Hello world.\n" +
"This is line 2.\n\r" +
"Line 3 here.\r" +
"Yet another line 4.\n\r\n\r" +
"Few more lines 5.\r";
printParaCount(document);
}
public static void printParaCount(String document) {
String lineBreakCharacters = "\r\n";
StringTokenizer st = new StringTokenizer(
document, lineBreakCharacters);
System.out.println("ParaCount: " + st.countTokens());
}
}
Output
输出
ParaCount: 5

