java 如何在java中解析单词创建的特殊字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4000392/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse word-created special chars in java
提问by Derek
I am trying to parse some word documents in java. Some of the values are things like a date range and instead of showing up like Startdate - endDate I am getting some funky characters like so
我正在尝试用 java 解析一些 word 文档。有些值是日期范围之类的东西,而不是像 Startdate - endDate 那样显示,我得到了一些像这样的时髦字符
StartDate Γ?? EndDate
This is where word puts in a special character hypen. Can you search for these characters and replace them with a regular - or something int he string so that I can then tokenize on a "-" and what is that character - ascii? unicode or what?
这是 word 放入特殊字符连字符的地方。您可以搜索这些字符并将它们替换为常规的 - 或字符串中的某些内容,以便我可以标记“-”以及该字符是什么 - ascii?unicode 还是什么?
Edited to add some code:
编辑添加一些代码:
String projDateString = "08/2010 Γ?? Present"
Charset charset = Charset.forName("Cp1252");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer buf = ByteBuffer.wrap(projDateString.getBytes("Cp1252"));
CharBuffer cbuf = decoder.decode(buf);
String s = cbuf.toString();
println ("S: " + s)
println("projDatestring: " + projDateString)
Outputs the following:
输出以下内容:
S: 08/2010 Γ?? Present
projDatestring: 08/2010 Γ?? Present
Also, using the same projDateString above, if I do:
另外,使用上面相同的 projDateString,如果我这样做:
projDateString.replaceAll("\u0096", "\u2013");
projDateString.replaceAll("\u0097", "\u2014");
and then print out projDateString, it still prints as
然后打印出projDateString,它仍然打印为
projDatestring: 08/2010 Γ?? Present
回答by Stephen P
You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)
您可能得到的是 Windows-1252,它是一个字符集,而不是一种编码。(Torgamus - 谷歌搜索 Windows-1232 没有给我任何东西。)
Windows-1252, formerly "Cp1252" is almostUnicode, but keeps some characters that came from Cp1252 in their same places. The En Dashis character 150 (0x96) which falls within the Unicode C1
reserved control character range and shouldn't be there.
Windows-1252,以前的“Cp1252”几乎是Unicode,但将一些来自 Cp1252 的字符保留在相同的位置。的短划线是字符150(0x96),其落入Unicode的C1
保留控制字符和范围不应该存在。
You can search for char 150 and replace it with \u2013
which is the proper Unicode code point for En Dash.
您可以搜索 char 150 并将其替换\u2013
为 En Dash 的正确 Unicode 代码点。
There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.
MS 在 0x80 到 0x9f 范围内还有很多其他字符,这是在 Unicode 标准中保留的,包括 Em Dash、项目符号和它们的“智能”引号。
Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.
编辑:顺便说一下,Java 在内部使用字符的 Unicode 代码点值。UTF-8 是一种编码,Java 在将字符串写入文件或网络连接时将其用作默认编码。
Say you have
说你有
String stuff = MSWordUtil.getNextChunkOfText();
Where MSWordUtil
would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to
MSWordUtil
您编写的以某种方式获取 MS-Word .doc 文件片段的内容在哪里。它可能归结为
File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file
By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tellwhat encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.
默认情况下,当您从文件中读取字节缓冲区并从中生成字符串时,Java 会将其视为 UTF-8 编码文本。正如 Torgamus 勋爵所说,有一些方法可以说明应该使用什么编码,但如果不这样做,Windows-1252 就非常接近 UTF-8,除了 C1 控制范围内的那些讨厌的字符。
After getting some String like stuff
above, you won't find \u2013
or \u2014
in it, you'll find 0x96 and 0x97 instead.
得到一些像stuff
上面这样的字符串后,你将找不到\u2013
或\u2014
在其中,你会找到 0x96 和 0x97。
At that point you should be able to do
那时你应该能够做到
stuff.replaceAll("\u0096", "\u2013");
I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence
one char at a time, decide based on 0x80 <= charValue <= 0x9f
if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.
我在我不得不处理这个问题的代码中没有这样做。我一次循环CharSequence
一个字符,根据0x80 <= charValue <= 0x9f
是否必须替换它来决定,然后在数组中查找要替换的内容。如果您只关心 1252 En Dash 与 Unicode En Dash,上面的 replaceAll() 会容易得多。
回答by Misa
s = s.replace( (char)145, (char)'\'');
s = s.replace( (char)8216, (char)'\''); // left single quote
s = s.replace( (char)146, (char)'\'');
s = s.replace( (char)8217, (char)'\''); // right single quote
s = s.replace( (char)147, (char)'\"');
s = s.replace( (char)148, (char)'\"');
s = s.replace( (char)8220, (char)'\"'); // left double
s = s.replace( (char)8221, (char)'\"'); // right double
s = s.replace( (char)8211, (char)'-' ); // em dash??
s = s.replace( (char)150, (char)'-' );
回答by Pops
Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc
documents. See this sitefor more info. Notably,
您的问题几乎肯定与您的编码方案与 Word 保存的编码方案不匹配有关。您的代码可能使用的是 Java 默认值,如果您没有对其进行任何操作,则可能使用UTF-8。另一方面,您的输入可能是Windows-1252,它是 Microsoft Word.doc
文档的默认值。请参阅此站点以获取更多信息。尤其,
Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.
在 Windows 中,ISO-8859-1 被 Windows-1252 取代,这通常意味着从 Microsoft Word 文档复制并直接粘贴到网页中的文本会产生 HTML 验证错误。
So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset
class.
那么这对你的意义是什么?您必须告诉程序输入使用的是 Windows-1252 编码,并将其转换为 UTF-8。您可以使用不同风格的“手动”来执行此操作。可能最自然的方法是利用 Java 的内置Charset
类。
Windows-1252 is recognized by the IANA Charset Registry
Windows-1252 被IANA 字符集注册表识别
Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None
名称:windows-1252
MIBenum:2252
来源:Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
别名:无
so you it should be Charset
-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String
constructor that takes a byte[]
and a Charset
as arguments.
所以你应该是Charset
兼容的。我自己之前没有这样做过,所以我不能给你一个代码示例,但我会指出有一个String
以 abyte[]
和 aCharset
作为参数的构造函数。
回答by Giulio Piancastelli
Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.
可能该字符是一个短划线,您看到的奇怪模糊是由于 Word 对该字符进行编码的方式与您用来显示它的任何(其他)系统对该字符进行解码的方式之间存在差异。
If I remember correctly from when I did some work on character encodings in Java, String
instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s
, you may write
如果我没记错的话,当我在 Java 中做一些字符编码工作时,String
实例总是在内部使用 UTF-8;因此,在这种情况下,您可以通过其 Unicode 形式搜索和替换单个字符。例如,假设您想用简单的双引号替换智能引号:给定 a String s
,您可以写
s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');
where 201c
and 201d
are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013
.
其中201c
和201d
是用于打开和关闭智能引号的 Unicode 代码点。根据维基百科上面的链接,短划线的 Unicode 代码点是2013
.