在 Java 中将 UTF-8 转换为 ISO-8859-1 - 如何将其保持为单字节

Question

提问by

I am trying to convert a string encoded in java in UTF-8 to ISO-8859-1. Say for example, in the string 'aabcd' 'a' is represented in ISO-8859-1 as E2. In UTF-8 it is represented as two bytes. C3 A2 I believe. When I do a getbytes(encoding) and then create a new string with the bytes in ISO-8859-1 encoding, I get a two different chars. ?￠. Is there any other way to do this so as to keep the character the same i.e. aabcd?

我正在尝试将用 UTF-8 编码的 java 字符串转换为 ISO-8859-1。例如，在字符串 'aabcd' 中，'a' 在 ISO-8859-1 中表示为 E2。在 UTF-8 中，它表示为两个字节。C3 A2 我相信。当我执行 getbytes(encoding) 然后用 ISO-8859-1 编码的字节创建一个新字符串时，我得到两个不同的字符。?￠。有没有其他方法可以做到这一点，以保持字符相同，即 aabcd？

Answer 1

回答by Joachim Sauer

byte[] iso88591Data = theString.getBytes("ISO-8859-1");

Will do the trick. From your description it seems as if you're trying to "store an ISO-8859-1 String". String objects in Java are alwaysimplicitly encoded in UTF-16. There's no way to change that encoding.

会做的伎俩。从您的描述来看，您似乎正在尝试“存储 ISO-8859-1 字符串”。Java 中的字符串对象总是以 UTF-16 隐式编码。无法更改该编码。

What you can do, 'though is to get the bytes that constitute some other encoding of it (using the .getBytes()method as shown above).

你可以做的是，'虽然是获取构成它的其他编码的字节（使用.getBytes()如上所示的方法）。

Answer 2

回答by Adam Rosenfield

If you're dealing with character encodings other than UTF-16, you shouldn't be using java.lang.Stringor the charprimitive -- you should only be using byte[]arrays or ByteBufferobjects. Then, you can use java.nio.charset.Charsetto convert between encodings:

如果您正在处理 UTF-16 以外的字符编码，则不应使用java.lang.String或char原语——您应该只使用byte[]数组或ByteBuffer对象。然后，您可以使用java.nio.charset.Charset在编码之间进行转换：

Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");

ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xC3, (byte)0xA2});

// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);

// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();

Answer 3

回答by Pete Kirkham

Starting with a set of bytes which encode a string using UTF-8, creates a string from that data, then get some bytes encoding the string in a different encoding:

从使用 UTF-8 编码字符串的一组字节开始，从该数据创建一个字符串，然后获取一些以不同编码对字符串进行编码的字节：

    byte[] utf8bytes = { (byte)0xc3, (byte)0xa2, 0x61, 0x62, 0x63, 0x64 };
    Charset utf8charset = Charset.forName("UTF-8");
    Charset iso88591charset = Charset.forName("ISO-8859-1");

    String string = new String ( utf8bytes, utf8charset );

    System.out.println(string);

    // "When I do a getbytes(encoding) and "
    byte[] iso88591bytes = string.getBytes(iso88591charset);

    for ( byte b : iso88591bytes )
        System.out.printf("%02x ", b);

    System.out.println();

    // "then create a new string with the bytes in ISO-8859-1 encoding"
    String string2 = new String ( iso88591bytes, iso88591charset );

    // "I get a two different chars"
    System.out.println(string2);

this outputs strings and the iso88591 bytes correctly:

这将正确输出字符串和 iso88591 字节：

aabcd 
e2 61 62 63 64 
aabcd

So your byte array wasn't paired with the correct encoding:

所以你的字节数组没有与正确的编码配对：

    String failString = new String ( utf8bytes, iso88591charset );

    System.out.println(failString);

Outputs

输出

?￠abcd

(either that, or you just wrote the utf8 bytes to a file and read them elsewhere as iso88591)

（或者，或者您只是将 utf8 字节写入文件并在其他地方读取它们作为 iso88591）

Answer 4

回答by bcros

evict non ISO-8859-1 characters, will be replace by '?' (before send to a ISO-8859-1 DB by example):

驱逐非 ISO-8859-1 字符，将被替换为 '?' （在通过示例发送到 ISO-8859-1 DB 之前）：

utf8String = new String ( utf8String.getBytes(), "ISO-8859-1" );

utf8String = 新字符串 ( utf8String.getBytes(), "ISO-8859-1" );

Answer 5

回答by Paul Vargas

If you have the correct encoding in the string, you need not do more to get the bytes for another encoding.

如果字符串中有正确的编码，则无需执行更多操作即可获取另一种编码的字节。

public static void main(String[] args) throws Exception {
    printBytes("a");
    System.out.println(
            new String(new byte[] { (byte) 0xE2 }, "ISO-8859-1"));
    System.out.println(
            new String(new byte[] { (byte) 0xC3, (byte) 0xA2 }, "UTF-8"));
}

private static void printBytes(String str) {
    System.out.println("Bytes in " + str + " with ISO-8859-1");
    for (byte b : str.getBytes(StandardCharsets.ISO_8859_1)) {
        System.out.printf("%3X", b);
    }
    System.out.println();
    System.out.println("Bytes in " + str + " with UTF-8");
    for (byte b : str.getBytes(StandardCharsets.UTF_8)) {
        System.out.printf("%3X", b);
    }
    System.out.println();
}

Output:

输出：

Bytes in a with ISO-8859-1
 E2
Bytes in a with UTF-8
 C3 A2
a
a

Answer 6

回答by Frizz1977

For files encoding...

对于文件编码...

public class FRomUtf8ToIso {
        static File input = new File("C:/Users/admin/Desktop/pippo.txt");
        static File output = new File("C:/Users/admin/Desktop/ciccio.txt");


    public static void main(String[] args) throws IOException {

        BufferedReader br = null;

        FileWriter fileWriter = new FileWriter(output);
        try {

            String sCurrentLine;

            br = new BufferedReader(new FileReader( input ));

            int i= 0;
            while ((sCurrentLine = br.readLine()) != null) {
                byte[] isoB =  encode( sCurrentLine.getBytes() );
                fileWriter.write(new String(isoB, Charset.forName("ISO-8859-15") ) );
                fileWriter.write("\n");
                System.out.println( i++ );
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                fileWriter.flush();
                fileWriter.close();
                if (br != null)br.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }

    }


    static byte[] encode(byte[] arr){
        Charset utf8charset = Charset.forName("UTF-8");
        Charset iso88591charset = Charset.forName("ISO-8859-15");

        ByteBuffer inputBuffer = ByteBuffer.wrap( arr );

        // decode UTF-8
        CharBuffer data = utf8charset.decode(inputBuffer);

        // encode ISO-8559-1
        ByteBuffer outputBuffer = iso88591charset.encode(data);
        byte[] outputData = outputBuffer.array();

        return outputData;
    }

}

Answer 7

回答by Chadi

In addition to Adam Rosenfield's answer, I would like to add that ByteBuffer.array()returns the buffer's underlying byte array, which is not necessarily "trimmed" up to the last character. Extra manipulation will be needed, such as the ones mentioned in thisanswer; in particular:

除了 Adam Rosenfield 的回答之外，我想补充一点，ByteBuffer.array()返回缓冲区的底层字节数组，它不一定“修剪”到最后一个字符。需要额外的操作，例如本答案中提到的操作；特别是：

byte[] b = new byte[bb.remaining()]
bb.get(b);

Answer 8

回答by K?re Jonsson

This is what I needed:

这就是我需要的：

public static byte[] encode(byte[] arr, String fromCharsetName) {
    return encode(arr, Charset.forName(fromCharsetName), Charset.forName("UTF-8"));
}

public static byte[] encode(byte[] arr, String fromCharsetName, String targetCharsetName) {
    return encode(arr, Charset.forName(fromCharsetName), Charset.forName(targetCharsetName));
}

public static byte[] encode(byte[] arr, Charset sourceCharset, Charset targetCharset) {

    ByteBuffer inputBuffer = ByteBuffer.wrap( arr );

    CharBuffer data = sourceCharset.decode(inputBuffer);

    ByteBuffer outputBuffer = targetCharset.encode(data);
    byte[] outputData = outputBuffer.array();

    return outputData;
}

在 Java 中将 UTF-8 转换为 ISO-8859-1 - 如何将其保持为单字节

提问by

回答by Joachim Sauer

回答by Adam Rosenfield

回答by Pete Kirkham

回答by bcros

回答by Paul Vargas

回答by Frizz1977

回答by Chadi

回答by K?re Jonsson

相关推荐

最近更新

标签

在 Java 中将 UTF-8 转换为 ISO-8859-1 - 如何将其保持为单字节

提问by

回答by Joachim Sauer

回答by Adam Rosenfield

回答by Pete Kirkham

回答by bcros

回答by Paul Vargas

回答by Frizz1977

回答by Chadi

回答by K?re Jonsson

相关推荐

Java 更改参数化测试的名称

Java 如何将测试类包含到 Maven jar 中并执行它们？

如何在 Java 中在 ISO-8859-1 和 UTF-8 之间进行转换？

java8“java.lang.OutOfMemoryError：元空间”

相关推荐

最近更新

标签