如何从可能的编码列表中将 Oracle VARCHAR2 值转换为 UTF-8?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12717363/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 04:32:02  来源:igfitidea点击:

How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?

oracleunicodeutf-8character-encoding

提问by theory

For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to AL32UTF8—that contain some non-UTF-8 values. The values are always in one of these character sets:

出于遗留原因,我们的 Oracle 10 数据库中有一个 VARCHAR2 列(字符编码设置为),其中AL32UTF8包含一些非 UTF-8 值。这些值始终采用以下字符集之一:

  • US-ASCII
  • UTF-8
  • CP1252
  • Latin-1
  • US-ASCII
  • UTF-8
  • CP1252
  • 拉丁语 1

I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.

我编写了一个 Perl 函数来修复数据库外的损坏值。对于此数据库列中的值,它会遍历此编码列表并尝试将该值转换为 UTF-8。如果转换失败,它会尝试下一个编码。第一个没有错误地转换的是我们保留的值。现在,我想在数据库中复制此功能,以便任何人都可以使用它。

However, all I can find for this is the CONVERTfunction, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.

但是,我所能找到的只是CONVERT函数,它永远不会失败,但会为它无法识别的字符插入替换字符。因此,据我所知,无法知道转换何时失败。

Therefor, I have two questions:

因此,我有两个问题:

  1. Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?
  2. And if not, is there some other interface that indicates failure if it's not able to convert a string to an encoding? If so, then I could write the previous function.
  1. 是否有一些现有的接口试图将字符串转换为编码列表之一,返回第一个成功的?
  2. 如果没有,如果无法将字符串转换为编码,是否还有其他接口指示失败?如果是这样,那么我可以编写以前的函数。


UPDATE:

更新:

For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:

作为参考,我在 PL/pgSQL 中编写了这个 PostgreSQL 函数,它完全符合我的需要:

CREATE OR REPLACE FUNCTION encoding_utf8(
    bytea
) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$
DECLARE
    encoding TEXT;
BEGIN
    FOREACH encoding IN ARRAY ARRAY[
        'UTF8',
        'WIN1252',
        'LATIN1'
    ] LOOP
        BEGIN
            RETURN convert_from(, encoding);
        EXCEPTION WHEN character_not_in_reperttheitroade OR untranslatable_character THEN
            CONTINUE;
        END;
    END LOOP;
END;
$$;

I'd dearly love to know how to do the equivalent in Oracle.

我非常想知道如何在 Oracle 中执行相同的操作。

采纳答案by theory

Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:

感谢@collapsar提供的关于UTF-8中非法字符的关键信息,以及同事的一些挖掘,我想出了这个:

CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2
AS
    encoded VARCHAR2(32767);
    type  array_t IS varray(3) OF VARCHAR2(15);
    array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');
BEGIN
    FOR I IN 1..array.count LOOP
        encoded := CASE array(i)
            WHEN 'AL32UTF8' THEN string
            ELSE CONVERT(string, 'AL32UTF8', array(i))
        END;
        IF instr(
            rawtohex(
                utl_raw.cast_to_raw(
                    utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')
                )
            ),
            'EFBFBD'
        ) = 0 THEN
            RETURN encoded;
        END IF;
    END LOOP;
    RAISE VALUE_ERROR;
END;

Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.

奇怪的是,它永远不会到达 WE8ISO8859P1:WE8MSWIN1252 转换了我所拥有的 800 个左右的错误值列表中的每一个,而没有任何抱怨。对于我的 Perl 或 PostgreSQL 实现,情况并非如此,其中 CP1252 对某些值失败,但 ISO-8859-1 成功。尽管如此,来自 Oracle 的值似乎足够,而且似乎是有效的 Unicode(通过将它们加载到 PostgreSQL 中进行测试),所以我不能抱怨。我认为这足以清理我的数据。

回答by collapsar

to check whether your database column contains invalid utf-8 use the following query:

要检查您的数据库列是否包含无效的 utf-8,请使用以下查询:

 select CASE
            INSTR (
                  RAWTOHEX (
                      utl_raw.cast_to_raw (
                          utl_i18n.raw_to_char (
                                utl_raw.cast_to_raw ( <your_column> )
                              , 'utf8'
                          )
                      )
                  )
                , 'EFBFBD'
            )
        WHEN 0 THEN 'OK'
        ELSE 'FAIL' 
        END
   from <your_table>
      ;

given that your db charset is al32utf8.

鉴于您的数据库字符集是 al32utf8。

note that EF BF BDrepresents an illegal encoding in utf-8.

请注意,这EF BF BD表示utf-8 中非法编码

as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.

由于您指出的所有其他字符集都是面向字节的,因此转换为 unicode 永远不会失败,但可能会产生不同的代码点。如果没有上下文信息,就不可能自动确定实际的源字符集。

best regards, carsten

最好的问候,卡斯滕

ps: oracle names for charsets: CP1252-> WE8MSWIN1252LATIN-1-> WE8ISO8859P1

ps:字符集的oracle名称: CP1252-> WE8MSWIN1252LATIN-1->WE8ISO8859P1