java 如何转换一串俄语西里尔字母？

Question

提问by Mediator

I parsing mp3 tags.

我解析 mp3 标签。

String artist- I do not know what was on the encoding

String artist- 我不知道编码是什么

???í? ?e? íà????ó- example string in russian "Песня про надежду"

???í? ?e? íà????ó- 俄语字符串示例 "Песня про надежду"

I use http://code.google.com/p/juniversalchardet/

我使用http://code.google.com/p/juniversalchardet/

code:

代码：

String GetEncoding(String text) throws IOException {
        byte[] buf = new byte[4096];


        InputStream fis = new ByteArrayInputStream(text.getBytes());


        UniversalDetector detector = new UniversalDetector(null);

        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        return encoding;
    }

And covert

并且隐蔽

new String(text.getBytes(encoding), "cp1251");-but this not work.

new String(text.getBytes(encoding), "cp1251");- 但这不起作用。

if I use utf-16

如果我使用 utf-16

new String(text.getBytes("UTF-16"), "cp1251")return "юя П е с н я п р о н а д е ж д у" space - not is char space

new String(text.getBytes("UTF-16"), "cp1251")返回 "юя П е с н я п р о н а д е ж д у" 空间 - 不是字符空间

EDIT:

编辑：

this first read bytes

这第一次读取字节

byte[] abyFrameData = new byte[iTagSize];
oID3DIS.readFully(abyFrameData);
ByteArrayInputStream oFrameBAIS = new ByteArrayInputStream(abyFrameData);

String s = new String(abyFrameData, "????");

Answer 1

采纳答案by McDowell

Java strings are UTF-16. All other encodings can be represented using byte sequences. To decode character data, you must provide the encoding when you first create the string. If you have a corrupted string, it is already too late.

Java 字符串是 UTF-16。所有其他编码都可以使用字节序列来表示。要解码字符数据，您必须在第一次创建字符串时提供编码。如果您的字符串已损坏，则为时已晚。

Assuming ID3, the specifications define the rules for encoding. For example, ID3v2.4.0might restrict the encodings used via an extended header:

假设 ID3，规范定义了编码规则。例如，ID3v2.4.0可能会限制通过扩展头使用的编码：

q - Text encoding restrictions

   0    No restrictions
   1    Strings are only encoded with ISO-8859-1 [ISO-8859-1] or
        UTF-8 [UTF-8].

q - 文本编码限制

   0    No restrictions
   1    Strings are only encoded with ISO-8859-1 [ISO-8859-1] or
        UTF-8 [UTF-8].

Encoding handling is defined further down the document:

编码处理在文档下方进一步定义：

If nothing else is said, strings, including numeric strings and URLs, are represented as ISO-8859-1 characters in the range $20 - $FF. Such strings are represented in frame descriptions as <text string>, or <full text string>if newlines are allowed. If nothing else is said newline character is forbidden. In ISO-8859-1 a newline is represented, when allowed, with $0A only.

Frames that allow different types of text encoding contains a text encoding description byte. Possible encodings:

  //untested code
public String parseID3String(DataInputStream in) throws IOException {
  String[] encodings = { "ISO-8859-1", "UTF-16", "UTF-16BE", "UTF-8" };
  String encoding = encodings[in.read()];
  byte[] terminator =
      encoding.startsWith("UTF-16") ? new byte[2] : new byte[1];
  byte[] buf = terminator.clone();
  ByteArrayOutputStream buffer = new ByteArrayOutputStream();
  do {
    in.readFully(buf);
    buffer.write(buf);
  } while (!Arrays.equals(terminator, buf));
  return new String(buffer.toByteArray(), encoding);
}
   ISO-8859-1 [ISO-8859-1]. Terminated with byte[] bytes = s.getBytes("ISO-8859-1");
UniversalDetector encDetector = new UniversalDetector(null);
encDetector.handleData(bytes, 0, bytes.length);
encDetector.dataEnd();
String encoding = encDetector.getDetectedCharset();
if (encoding != null) s = new String(bytes, encoding);
.
    UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
       strings in the same frame SHALL have the same byteorder.
       Terminated with ##代码## 00.
    UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
       Terminated with ##代码## 00.
    UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with
       ##代码##.
   ISO-8859-1 [ISO-8859-1]. Terminated with ##代码##.
    UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
       strings in the same frame SHALL have the same byteorder.
       Terminated with ##代码## 00.
    UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
       Terminated with ##代码## 00.
    UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with
       ##代码##.

如果没有其他说明，字符串（包括数字字符串和 URL）将表示为 $20 - $FF 范围内的 ISO-8859-1 字符。此类字符串在框架描述中表示为<text string>，或者 <full text string>是否允许换行。如果没有其他说明，则禁止换行符。在 ISO-8859-1 中，当允许时，仅用 $0A 表示换行符。
允许不同类型文本编码的帧包含一个文本编码描述字节。可能的编码：
##代码##

Use transcoding classes like InputStreamReaderor (more likely in this case) the String(byte[],Charset)constructor to decode the data. See also Java: a rough guide to character encoding.

使用转码类InputStreamReader或（在这种情况下更有可能）String(byte[],Charset)构造函数来解码数据。另请参阅Java：字符编码粗略指南。

Parsing the string components of an ID3v2.4.0 data structure would something like this:

解析 ID3v2.4.0 数据结构的字符串组件将是这样的：

##代码##

Answer 2

回答by Nik

This is works for me:

这对我有用：

##代码##

java 如何转换一串俄语西里尔字母？

提问by Mediator

采纳答案by McDowell

回答by Nik

相关推荐

最近更新

标签

java 如何转换一串俄语西里尔字母？

提问by Mediator

采纳答案by McDowell

回答by Nik

相关推荐

在 Java 中创建和读取自定义文件类型

java.net.SocketException：连接重置

java HtmlUnit 来查看源码

java eclipse glassfish 3，客户端错误

相关推荐

最近更新

标签