Java 如何将 UTF-8 字符转换为 ISO Latin 1?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/634727/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert UTF-8 character to ISO Latin 1?
提问by
I need to convert a UTF-8 trademark sign to a ISO Latin 1, and save it into database, which is also ISO Latin 1 encoded.
我需要将 UTF-8 商标符号转换为 ISO Latin 1,并将其保存到数据库中,该数据库也是 ISO Latin 1 编码的。
How can I do that in java?
我怎么能在java中做到这一点?
I've tried something like
我试过类似的东西
String s2 = new String(s1.getBytes("ISO-8859-1"), "utf-8");
but it seems not work as I expected.
但它似乎不像我预期的那样工作。
回答by Jon Skeet
A string in Java is always in Unicode (UTF-16, effectively). Conversions are only necessary when you're trying to go from text to a binary encoding or vice versa.
Java 中的字符串始终采用 Unicode(实际上是 UTF-16)。仅当您尝试从文本转换为二进制编码时才需要转换,反之亦然。
What's the character involved? Are you sure it's even present in ISO Latin 1? If it is, I'd expect that character to be stored by your database without any problem. There's no such thing as a "UTF-8 trademark sign". You could have "the bytes representing the trademark sign UTF-8 encoded" but that would be a byte array, not a string.
涉及的角色是什么?你确定它甚至出现在 ISO Latin 1 中吗?如果是这样,我希望您的数据库可以毫无问题地存储该字符。没有“UTF-8 商标符号”这样的东西。你可以有“代表商标符号UTF-8编码的字节”,但这将是一个字节数组,而不是一个字符串。
EDIT: If you mean the Unicode trademark characterU+2122, that's outside the range of ISO-Latin-1. There's the registered trademark characterU+00AE, which isn't the same thing (either in appearance or in legal meaning, IIRC) but may be better than nothing - if you want to use that then just use:
编辑:如果您的意思是Unicode 商标字符U+2122,那就超出了 ISO-Latin-1 的范围。有注册商标字符U+00AE,它不是一回事(无论是外观还是法律含义,IIRC),但可能总比没有好 - 如果您想使用它,那么只需使用:
string replaced = original.replace('\u2122', '\u00ae');
回答by Joachim Sauer
- Read what Jon Skeet told you. The Code you posted is rubbish (it takes the UTF-8 encoded form of your String and interprets it as if it were ISO-8859-1, this accomplishes nothing useful).
- The ISO-8859-1 encoding (a.k.a Latin1) doesn't contain the Trademark character "?".
- 阅读 Jon Skeet 告诉你的内容。您发布的代码是垃圾(它采用字符串的 UTF-8 编码形式并将其解释为 ISO-8859-1,这没有任何用处)。
- ISO-8859-1 编码(又名 Latin1)不包含商标字符“?”。
回答by juwens
I had a similar problem and solved it by converting the the none-translatable chars in Entitys. If you display the information later as html you are fine anyway.
我有一个类似的问题,并通过转换实体中的不可翻译字符来解决它。如果稍后将信息显示为 html,则无论如何都可以。
If not, you could try to convert them back to unicode.
如果没有,您可以尝试将它们转换回 unicode。
example in python with "Trademark":
带有“商标”的python示例:
s = u'yellow bananas\u2122'.encode('latin1', 'xmlcharrefreplace')
# s is 'yellow bananas™'
回答by Myobis
As far as I understand, you are trying to store characters (from s1
) that contains non Latin-1 characters into a DB that only supports ISO-8859-1.
据我了解,您正在尝试将s1
包含非拉丁 1 字符的字符(来自)存储到仅支持 ISO-8859-1 的数据库中。
First, I agree with the others to say that it is a dirty idea.
Note that CP1252is close from ISO-8859-1 (1 byte per character) and includes ?Now, to anwser your question, I think you did the opposite..
You want to encode UTF-8 bytes into ISO-8859-1 :String s2 = new String(s1.getBytes("UTF-8"), "ISO-8859-1");
This way,
s2
is a characher String that, once encoded in ISO-8859-1, will return a byte array which may look like valid UTF-8 bytes.To retrieve the original string, you would do
String s1 = new String(s2.getBytes("ISO-8859-1"),"UTF-8");
首先,我同意其他人说这是一个肮脏的想法。
请注意,CP1252与 ISO-8859-1(每个字符 1 个字节)很接近,并且包括?现在,为了回答你的问题,我认为你做了相反的事情..
你想将 UTF-8 字节编码为 ISO-8859-1 :String s2 = new String(s1.getBytes("UTF-8"), "ISO-8859-1");
这样,
s2
是一个字符字符串,一旦在 ISO-8859-1 中编码,将返回一个字节数组,它可能看起来像有效的 UTF-8 字节。要检索原始字符串,您将执行
String s1 = new String(s2.getBytes("ISO-8859-1"),"UTF-8");
BUT WAIT !When doing this, you hopethat any byte can be decoded with ISO-8859-1 .. and that your DB will accept such data. etc..
可是等等 !这样做时,您希望任何字节都可以使用 ISO-8859-1 .. 进行解码,并且您的数据库将接受此类数据。等等..
In fact, it is really unsure because officially, ISO-8859-1 doesn't have chars for any byte values. For instance, from 80 to 9F.
事实上,它真的不确定,因为正式地,ISO-8859-1 没有任何字节值的字符。例如,从 80 到 9F。
Then,
然后,
byte[] b = { -97, -100, -128 };
System.out.println( new String(b,"ISO-8859-1") );
would display ???
会显示 ???
However, in Java, s.getBytes("ISO-8859-1")
indeed restores the initial array.
但是,在 Java 中,s.getBytes("ISO-8859-1")
确实恢复了初始数组。