java 如何转换一串俄语西里尔字母?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6017004/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert a string of russian cyrillic letters?
提问by Mediator
I parsing mp3 tags.
我解析 mp3 标签。
String artist
- I do not know what was on the encoding
String artist
- 我不知道编码是什么
???í? ?e? íà????ó
- example string in russian "Песня про надежду"
???í? ?e? íà????ó
- 俄语字符串示例 "Песня про надежду"
I use http://code.google.com/p/juniversalchardet/
我使用http://code.google.com/p/juniversalchardet/
code:
代码:
String GetEncoding(String text) throws IOException {
byte[] buf = new byte[4096];
InputStream fis = new ByteArrayInputStream(text.getBytes());
UniversalDetector detector = new UniversalDetector(null);
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
return encoding;
}
And covert
并且隐蔽
new String(text.getBytes(encoding), "cp1251");
-but this not work.
new String(text.getBytes(encoding), "cp1251");
- 但这不起作用。
if I use utf-16
如果我使用 utf-16
new String(text.getBytes("UTF-16"), "cp1251")
return "юя П е с н я п р о н а д е ж д у" space - not is char space
new String(text.getBytes("UTF-16"), "cp1251")
返回 "юя П е с н я п р о н а д е ж д у" 空间 - 不是字符空间
EDIT:
编辑:
this first read bytes
这第一次读取字节
byte[] abyFrameData = new byte[iTagSize];
oID3DIS.readFully(abyFrameData);
ByteArrayInputStream oFrameBAIS = new ByteArrayInputStream(abyFrameData);
String s = new String(abyFrameData, "????");
String s = new String(abyFrameData, "????");
采纳答案by McDowell
Java strings are UTF-16. All other encodings can be represented using byte sequences. To decode character data, you must provide the encoding when you first create the string. If you have a corrupted string, it is already too late.
Java 字符串是 UTF-16。所有其他编码都可以使用字节序列来表示。要解码字符数据,您必须在第一次创建字符串时提供编码。如果您的字符串已损坏,则为时已晚。
Assuming ID3, the specifications define the rules for encoding. For example, ID3v2.4.0might restrict the encodings used via an extended header:
假设 ID3,规范定义了编码规则。例如,ID3v2.4.0可能会限制通过扩展头使用的编码:
q - Text encoding restrictions
0 No restrictions 1 Strings are only encoded with ISO-8859-1 [ISO-8859-1] or UTF-8 [UTF-8].
q - 文本编码限制
0 No restrictions 1 Strings are only encoded with ISO-8859-1 [ISO-8859-1] or UTF-8 [UTF-8].
Encoding handling is defined further down the document:
编码处理在文档下方进一步定义:
If nothing else is said, strings, including numeric strings and URLs, are represented as ISO-8859-1 characters in the range $20 - $FF. Such strings are represented in frame descriptions as
<text string>
, or<full text string>
if newlines are allowed. If nothing else is said newline character is forbidden. In ISO-8859-1 a newline is represented, when allowed, with $0A only.Frames that allow different types of text encoding contains a text encoding description byte. Possible encodings:
ISO-8859-1 [ISO-8859-1]. Terminated with ##代码##. UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All strings in the same frame SHALL have the same byteorder. Terminated with ##代码## 00. UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM. Terminated with ##代码## 00. UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with ##代码##.
ISO-8859-1 [ISO-8859-1]. Terminated with//untested code public String parseID3String(DataInputStream in) throws IOException { String[] encodings = { "ISO-8859-1", "UTF-16", "UTF-16BE", "UTF-8" }; String encoding = encodings[in.read()]; byte[] terminator = encoding.startsWith("UTF-16") ? new byte[2] : new byte[1]; byte[] buf = terminator.clone(); ByteArrayOutputStream buffer = new ByteArrayOutputStream(); do { in.readFully(buf); buffer.write(buf); } while (!Arrays.equals(terminator, buf)); return new String(buffer.toByteArray(), encoding); }
. UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All strings in the same frame SHALL have the same byteorder. Terminated with ##代码## 00. UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM. Terminated with ##代码## 00. UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with ##代码##.byte[] bytes = s.getBytes("ISO-8859-1"); UniversalDetector encDetector = new UniversalDetector(null); encDetector.handleData(bytes, 0, bytes.length); encDetector.dataEnd(); String encoding = encDetector.getDetectedCharset(); if (encoding != null) s = new String(bytes, encoding);
如果没有其他说明,字符串(包括数字字符串和 URL)将表示为 $20 - $FF 范围内的 ISO-8859-1 字符。此类字符串在框架描述中表示为
<text string>
,或者<full text string>
是否允许换行。如果没有其他说明,则禁止换行符。在 ISO-8859-1 中,当允许时,仅用 $0A 表示换行符。允许不同类型文本编码的帧包含一个文本编码描述字节。可能的编码:
##代码##
Use transcoding classes like InputStreamReader
or (more likely in this case) the String(byte[],Charset)
constructor to decode the data. See also Java: a rough guide to character encoding.
使用转码类InputStreamReader
或(在这种情况下更有可能)String(byte[],Charset)
构造函数来解码数据。另请参阅Java:字符编码粗略指南。
Parsing the string components of an ID3v2.4.0 data structure would something like this:
解析 ID3v2.4.0 数据结构的字符串组件将是这样的:
##代码##回答by Nik
This is works for me:
这对我有用:
##代码##