如何从可能的编码列表中将 Oracle VARCHAR2 值转换为 UTF-8?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12717363/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?
提问by theory
For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to AL32UTF8
—that contain some non-UTF-8 values. The values are always in one of these character sets:
出于遗留原因,我们的 Oracle 10 数据库中有一个 VARCHAR2 列(字符编码设置为),其中AL32UTF8
包含一些非 UTF-8 值。这些值始终采用以下字符集之一:
- US-ASCII
- UTF-8
- CP1252
- Latin-1
- US-ASCII
- UTF-8
- CP1252
- 拉丁语 1
I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.
我编写了一个 Perl 函数来修复数据库外的损坏值。对于此数据库列中的值,它会遍历此编码列表并尝试将该值转换为 UTF-8。如果转换失败,它会尝试下一个编码。第一个没有错误地转换的是我们保留的值。现在,我想在数据库中复制此功能,以便任何人都可以使用它。
However, all I can find for this is the CONVERT
function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.
但是,我所能找到的只是CONVERT
函数,它永远不会失败,但会为它无法识别的字符插入替换字符。因此,据我所知,无法知道转换何时失败。
Therefor, I have two questions:
因此,我有两个问题:
- Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?
- And if not, is there some other interface that indicates failure if it's not able to convert a string to an encoding? If so, then I could write the previous function.
- 是否有一些现有的接口试图将字符串转换为编码列表之一,返回第一个成功的?
- 如果没有,如果无法将字符串转换为编码,是否还有其他接口指示失败?如果是这样,那么我可以编写以前的函数。
UPDATE:
更新:
For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:
作为参考,我在 PL/pgSQL 中编写了这个 PostgreSQL 函数,它完全符合我的需要:
CREATE OR REPLACE FUNCTION encoding_utf8(
bytea
) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$
DECLARE
encoding TEXT;
BEGIN
FOREACH encoding IN ARRAY ARRAY[
'UTF8',
'WIN1252',
'LATIN1'
] LOOP
BEGIN
RETURN convert_from(, encoding);
EXCEPTION WHEN character_not_in_reperttheitroade OR untranslatable_character THEN
CONTINUE;
END;
END LOOP;
END;
$$;
I'd dearly love to know how to do the equivalent in Oracle.
我非常想知道如何在 Oracle 中执行相同的操作。
采纳答案by theory
Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:
感谢@collapsar提供的关于UTF-8中非法字符的关键信息,以及同事的一些挖掘,我想出了这个:
CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2
AS
encoded VARCHAR2(32767);
type array_t IS varray(3) OF VARCHAR2(15);
array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');
BEGIN
FOR I IN 1..array.count LOOP
encoded := CASE array(i)
WHEN 'AL32UTF8' THEN string
ELSE CONVERT(string, 'AL32UTF8', array(i))
END;
IF instr(
rawtohex(
utl_raw.cast_to_raw(
utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')
)
),
'EFBFBD'
) = 0 THEN
RETURN encoded;
END IF;
END LOOP;
RAISE VALUE_ERROR;
END;
Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.
奇怪的是,它永远不会到达 WE8ISO8859P1:WE8MSWIN1252 转换了我所拥有的 800 个左右的错误值列表中的每一个,而没有任何抱怨。对于我的 Perl 或 PostgreSQL 实现,情况并非如此,其中 CP1252 对某些值失败,但 ISO-8859-1 成功。尽管如此,来自 Oracle 的值似乎足够,而且似乎是有效的 Unicode(通过将它们加载到 PostgreSQL 中进行测试),所以我不能抱怨。我认为这足以清理我的数据。
回答by collapsar
to check whether your database column contains invalid utf-8 use the following query:
要检查您的数据库列是否包含无效的 utf-8,请使用以下查询:
select CASE
INSTR (
RAWTOHEX (
utl_raw.cast_to_raw (
utl_i18n.raw_to_char (
utl_raw.cast_to_raw ( <your_column> )
, 'utf8'
)
)
)
, 'EFBFBD'
)
WHEN 0 THEN 'OK'
ELSE 'FAIL'
END
from <your_table>
;
given that your db charset is al32utf8.
鉴于您的数据库字符集是 al32utf8。
note that EF BF BD
represents an illegal encoding in utf-8.
as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.
由于您指出的所有其他字符集都是面向字节的,因此转换为 unicode 永远不会失败,但可能会产生不同的代码点。如果没有上下文信息,就不可能自动确定实际的源字符集。
best regards, carsten
最好的问候,卡斯滕
ps:
oracle names for charsets:
CP1252
-> WE8MSWIN1252
LATIN-1
-> WE8ISO8859P1
ps:字符集的oracle名称:
CP1252
-> WE8MSWIN1252
LATIN-1
->WE8ISO8859P1