Convert from Codepage 1252 (Windows) to Java, in Java

Question

提问by Jakob Eriksson

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.

I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.

WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet); 
row = s.getRow(4);
String contents = row[0].getContents();

This is where contents seems to contain something unicode, the ??? are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "?" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)

[edit]

I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "?" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "?" always worked.

[edit]

I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

Answer 1

回答by

WorkbookSettings ws = new WorkbookSettings();

ws.setEncoding("CP1250");

Worked for me.

Answer 2

回答by lxndr

If none of the answer above solve the problem, the trick might be done like this:

String myOutput = new String (myInput, "UTF-8");

This should decodethe incoming string, whatever its format.

Answer 3

回答by Michael Borgwardt

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.

JXL allows you to specify the encoding by passing a WorkbookSettingsobject to the factory method.

Answer 4

回答by u7867

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.

If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.

Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:

String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);

pw.print(text ); // repeat as needed

pw.close(); // cleanup
osw.close();
fos.close();

If your problem is something else please edit your question and provide more details.

Answer 5

回答by vartec

FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));

And do with reader whatever you'd do directly with file.

Answer 6

回答by Tom Hawtin - tackline

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings"in your JDK documentation. Then it's just a matter of using String, InputStreamReaderor similar to decode the bytes into chars.

Answer 7

回答by Seth

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for '?'.

Convert from Codepage 1252 (Windows) to Java, in Java

提问by Jakob Eriksson

回答by

回答by lxndr

回答by Michael Borgwardt

回答by u7867

回答by vartec

回答by Tom Hawtin - tackline

回答by Seth

相关推荐

最近更新

标签

Convert from Codepage 1252 (Windows) to Java, in Java

提问by Jakob Eriksson

回答by

回答by lxndr

回答by Michael Borgwardt

回答by u7867

回答by vartec

回答by Tom Hawtin - tackline

回答by Seth

相关推荐

windows py2exe 生成dll？

远程访问 Windows Vista 到 Mac OSX？

windows 通过 USB 端口发送和接收数据

什么是最小的 Windows (PE) 可执行文件？

相关推荐

最近更新

标签