Convert from Codepage 1252 (Windows) to Java, in Java

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/577846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 12:02:08  来源:igfitidea点击:

Convert from Codepage 1252 (Windows) to Java, in Java

javawindowsunicodecodepages

提问by Jakob Eriksson

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.

I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.

I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.

WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet); 
row = s.getRow(4);
String contents = row[0].getContents();

This is where contents seems to contain something unicode, the ??? are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "?" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)

This is where contents seems to contain something unicode, the ??? are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "?" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)

[edit]

[edit]

I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "?" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "?" always worked.

I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "?" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "?" always worked.

[edit]

[edit]

I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

回答by

WorkbookSettings ws = new WorkbookSettings();

WorkbookSettings ws = new WorkbookSettings();

ws.setEncoding("CP1250");

ws.setEncoding("CP1250");

Worked for me.

Worked for me.

回答by lxndr

If none of the answer above solve the problem, the trick might be done like this:

If none of the answer above solve the problem, the trick might be done like this:

String myOutput = new String (myInput, "UTF-8");

This should decodethe incoming string, whatever its format.

This should decodethe incoming string, whatever its format.

回答by Michael Borgwardt

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.

JXL allows you to specify the encoding by passing a WorkbookSettingsobject to the factory method.

JXL allows you to specify the encoding by passing a WorkbookSettingsobject to the factory method.

回答by u7867

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.

If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.

If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.

Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:

Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:

String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);

pw.print(text ); // repeat as needed

pw.close(); // cleanup
osw.close();
fos.close();

If your problem is something else please edit your question and provide more details.

If your problem is something else please edit your question and provide more details.

回答by vartec

FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));

And do with reader whatever you'd do directly with file.

And do with reader whatever you'd do directly with file.

回答by Tom Hawtin - tackline

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings"in your JDK documentation. Then it's just a matter of using String, InputStreamReaderor similar to decode the bytes into chars.

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings"in your JDK documentation. Then it's just a matter of using String, InputStreamReaderor similar to decode the bytes into chars.

回答by Seth

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for '?'.

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for '?'.