在 Java 中将 UTF-8 转换为 ISO-8859-1 - 如何将其保持为单字节
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/655891/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting UTF-8 to ISO-8859-1 in Java - how to keep it as single byte
提问by
I am trying to convert a string encoded in java in UTF-8 to ISO-8859-1. Say for example, in the string 'aabcd' 'a' is represented in ISO-8859-1 as E2. In UTF-8 it is represented as two bytes. C3 A2 I believe. When I do a getbytes(encoding) and then create a new string with the bytes in ISO-8859-1 encoding, I get a two different chars. ?¢. Is there any other way to do this so as to keep the character the same i.e. aabcd?
我正在尝试将用 UTF-8 编码的 java 字符串转换为 ISO-8859-1。例如,在字符串 'aabcd' 中,'a' 在 ISO-8859-1 中表示为 E2。在 UTF-8 中,它表示为两个字节。C3 A2 我相信。当我执行 getbytes(encoding) 然后用 ISO-8859-1 编码的字节创建一个新字符串时,我得到两个不同的字符。?¢。有没有其他方法可以做到这一点,以保持字符相同,即 aabcd?
回答by Joachim Sauer
byte[] iso88591Data = theString.getBytes("ISO-8859-1");
Will do the trick. From your description it seems as if you're trying to "store an ISO-8859-1 String". String objects in Java are alwaysimplicitly encoded in UTF-16. There's no way to change that encoding.
会做的伎俩。从您的描述来看,您似乎正在尝试“存储 ISO-8859-1 字符串”。Java 中的字符串对象总是以 UTF-16 隐式编码。无法更改该编码。
What you can do, 'though is to get the bytes that constitute some other encoding of it (using the .getBytes()
method as shown above).
你可以做的是,'虽然是获取构成它的其他编码的字节(使用.getBytes()
如上所示的方法)。
回答by Adam Rosenfield
If you're dealing with character encodings other than UTF-16, you shouldn't be using java.lang.String
or the char
primitive -- you should only be using byte[]
arrays or ByteBuffer
objects. Then, you can use java.nio.charset.Charset
to convert between encodings:
如果您正在处理 UTF-16 以外的字符编码,则不应使用java.lang.String
或char
原语——您应该只使用byte[]
数组或ByteBuffer
对象。然后,您可以使用java.nio.charset.Charset
在编码之间进行转换:
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xC3, (byte)0xA2});
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
回答by Pete Kirkham
Starting with a set of bytes which encode a string using UTF-8, creates a string from that data, then get some bytes encoding the string in a different encoding:
从使用 UTF-8 编码字符串的一组字节开始,从该数据创建一个字符串,然后获取一些以不同编码对字符串进行编码的字节:
byte[] utf8bytes = { (byte)0xc3, (byte)0xa2, 0x61, 0x62, 0x63, 0x64 };
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
String string = new String ( utf8bytes, utf8charset );
System.out.println(string);
// "When I do a getbytes(encoding) and "
byte[] iso88591bytes = string.getBytes(iso88591charset);
for ( byte b : iso88591bytes )
System.out.printf("%02x ", b);
System.out.println();
// "then create a new string with the bytes in ISO-8859-1 encoding"
String string2 = new String ( iso88591bytes, iso88591charset );
// "I get a two different chars"
System.out.println(string2);
this outputs strings and the iso88591 bytes correctly:
这将正确输出字符串和 iso88591 字节:
aabcd
e2 61 62 63 64
aabcd
So your byte array wasn't paired with the correct encoding:
所以你的字节数组没有与正确的编码配对:
String failString = new String ( utf8bytes, iso88591charset );
System.out.println(failString);
Outputs
输出
?¢abcd
(either that, or you just wrote the utf8 bytes to a file and read them elsewhere as iso88591)
(或者,或者您只是将 utf8 字节写入文件并在其他地方读取它们作为 iso88591)
回答by bcros
evict non ISO-8859-1 characters, will be replace by '?' (before send to a ISO-8859-1 DB by example):
驱逐非 ISO-8859-1 字符,将被替换为 '?' (在通过示例发送到 ISO-8859-1 DB 之前):
utf8String = new String ( utf8String.getBytes(), "ISO-8859-1" );
utf8String = 新字符串 ( utf8String.getBytes(), "ISO-8859-1" );
回答by Paul Vargas
If you have the correct encoding in the string, you need not do more to get the bytes for another encoding.
如果字符串中有正确的编码,则无需执行更多操作即可获取另一种编码的字节。
public static void main(String[] args) throws Exception {
printBytes("a");
System.out.println(
new String(new byte[] { (byte) 0xE2 }, "ISO-8859-1"));
System.out.println(
new String(new byte[] { (byte) 0xC3, (byte) 0xA2 }, "UTF-8"));
}
private static void printBytes(String str) {
System.out.println("Bytes in " + str + " with ISO-8859-1");
for (byte b : str.getBytes(StandardCharsets.ISO_8859_1)) {
System.out.printf("%3X", b);
}
System.out.println();
System.out.println("Bytes in " + str + " with UTF-8");
for (byte b : str.getBytes(StandardCharsets.UTF_8)) {
System.out.printf("%3X", b);
}
System.out.println();
}
Output:
输出:
Bytes in a with ISO-8859-1
E2
Bytes in a with UTF-8
C3 A2
a
a
回答by Frizz1977
For files encoding...
对于文件编码...
public class FRomUtf8ToIso {
static File input = new File("C:/Users/admin/Desktop/pippo.txt");
static File output = new File("C:/Users/admin/Desktop/ciccio.txt");
public static void main(String[] args) throws IOException {
BufferedReader br = null;
FileWriter fileWriter = new FileWriter(output);
try {
String sCurrentLine;
br = new BufferedReader(new FileReader( input ));
int i= 0;
while ((sCurrentLine = br.readLine()) != null) {
byte[] isoB = encode( sCurrentLine.getBytes() );
fileWriter.write(new String(isoB, Charset.forName("ISO-8859-15") ) );
fileWriter.write("\n");
System.out.println( i++ );
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
fileWriter.flush();
fileWriter.close();
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
static byte[] encode(byte[] arr){
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-15");
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
return outputData;
}
}
回答by Chadi
In addition to Adam Rosenfield's answer, I would like to add that ByteBuffer.array()
returns the buffer's underlying byte array, which is not necessarily "trimmed" up to the last character. Extra manipulation will be needed, such as the ones mentioned in thisanswer; in particular:
除了 Adam Rosenfield 的回答之外,我想补充一点,ByteBuffer.array()
返回缓冲区的底层字节数组,它不一定“修剪”到最后一个字符。需要额外的操作,例如本答案中提到的操作;特别是:
byte[] b = new byte[bb.remaining()]
bb.get(b);
回答by K?re Jonsson
This is what I needed:
这就是我需要的:
public static byte[] encode(byte[] arr, String fromCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName("UTF-8"));
}
public static byte[] encode(byte[] arr, String fromCharsetName, String targetCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName(targetCharsetName));
}
public static byte[] encode(byte[] arr, Charset sourceCharset, Charset targetCharset) {
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
CharBuffer data = sourceCharset.decode(inputBuffer);
ByteBuffer outputBuffer = targetCharset.encode(data);
byte[] outputData = outputBuffer.array();
return outputData;
}