Java 从字符串中删除不适合 UTF-8 编码的字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27794993/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove characters not-suitable for UTF-8 encoding from String
提问by Abhi
I have a text-area on website where user can write anything. Problem happens when user copy paste some text or something which contains non-UTF 8 characters and submit them to server.
我在网站上有一个文本区域,用户可以在其中编写任何内容。当用户复制粘贴一些文本或包含非 UTF 8 字符的内容并将它们提交到服务器时会发生问题。
Java successfully handles it, as it support UTF-16 but my mySql table support UTF-8 and thus insertion fails.
Java 成功地处理了它,因为它支持 UTF-16,但我的 mySql 表支持 UTF-8,因此插入失败。
I was trying to implement some way in business logic itself, to remove any characters which is not suitable for UTF-8 encoding.
我试图在业务逻辑本身中实现某种方式,以删除任何不适合 UTF-8 编码的字符。
Currently I am using this code:
目前我正在使用此代码:
new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array());
But it replaces characters not suitable for UTF-8 with some other obscure characters. Which also does not look good to end user. Could someone please throw some light over any possible solution to tackle this using Java code?
但它用其他一些晦涩的字符替换了不适合 UTF-8 的字符。这对最终用户来说也不好看。有人可以介绍一下使用 Java 代码解决这个问题的任何可能的解决方案吗?
EDIT :For example, exception I got while insertion of such values
编辑:例如,插入此类值时出现异常
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A\x0D\x0A...' for column
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x80\xF0\x9F...' for column
采纳答案by icza
UTF-8 is not a character set, it's a character encoding, just like UTF-16.
UTF-8 不是字符集,它是一种字符编码,就像 UTF-16 一样。
UTF-8 is capable to encode any unicode character and any unicode text to a sequence of bytes, so there is no such thing as characters not suitable for UTF-8.
UTF-8 能够将任何 unicode 字符和任何 unicode 文本编码为字节序列,因此没有不适合 UTF-8 的字符。
You're using a constructor of String
which only takes a byte array (String(byte[] bytes)) which according to the javadocs:
您正在使用一个构造函数String
,它只接受一个字节数组(String(byte[] bytes)),根据 javadocs :
Constructs a new String by decoding the specified array of bytes using the platform's default charset.
通过使用平台的默认 charset解码指定的字节数组来构造一个新的 String 。
It uses the default charset of the platform to interpret the bytes (to convert the bytes to characters). Do not use this. Instead when converting a byte array to String
, specify the encoding you wish to use explicitly with the String(byte[] bytes, Charset charset)constructor.
它使用平台的默认字符集来解释字节(将字节转换为字符)。不要使用这个。相反,在将字节数组转换为 时String
,请使用String(byte[] bytes, Charset charset)构造函数指定您希望显式使用的编码。
If you have issues with certain characters, that is most likely due to using different character sets or encodings at the server side and at the client side (brownser+HTML). Make sure you use UTF-8 everywhere, do not mix encodings and do not use the default encoding of the platform.
如果您对某些字符有问题,那很可能是由于在服务器端和客户端(浏览器 + HTML)使用了不同的字符集或编码。确保在任何地方都使用 UTF-8,不要混合编码,也不要使用平台的默认编码。
Some readings how to achieve this:
一些阅读如何实现这一目标:
回答by Erwin Bolwidt
The problem in your code is that you are calling new String
on a byte[]
. The result of encode
is a ByteBuffer, and the result of array
on a ByteBuffer is a byte[]
.
The constructor new String(byte[])
will use the platform default encoding for your computer; it can be different on each computer that you run on, so that's not something that you want.
You should at least pass in a character set as the second argument to the String constructor, although I'm not sure which character set you would have in mind.
在你的代码的问题是,你在呼唤new String
一个byte[]
。的结果encode
是 ByteBuffer,而 ByteBuffer 的结果array
是byte[]
。构造函数new String(byte[])
将为您的计算机使用平台默认编码;它在您运行的每台计算机上都可能不同,因此这不是您想要的。您至少应该将字符集作为第二个参数传递给 String 构造函数,尽管我不确定您会想到哪个字符集。
I'm not sure why you're doing it: if your database uses UTF-8, it will do the encoding for you. You just need to pass un-encoded strings into it.
我不确定您为什么要这样做:如果您的数据库使用 UTF-8,它将为您进行编码。您只需要将未编码的字符串传递给它。
UTF-8 and UTF-16 can both encode the entire Unicode 6 character set; there are no characters that can be encoded by UTF-16 but not by UTF-8. So that part of your question is unfortunately unanswerable.
UTF-8 和 UTF-16 都可以编码整个 Unicode 6 字符集;没有可以由 UTF-16 编码但不能由 UTF-8 编码的字符。因此,不幸的是,您问题的那部分无法回答。
For some background:
对于一些背景:
回答by gclaussn
Maybe the answer with the CharsetDecoderof this questionhelps. You could change the CodingErrorActionto REPLACE and set a replacement in my example "?". This will output a given replacement string for invalid byte sequences. In this example a UTF-8 decoder capability and stress test fileis read and decoded:
也许这个问题的CharsetDecoder的答案有帮助。您可以将CodingErrorAction更改为 REPLACE 并在我的示例“?”中设置替换。这将为无效字节序列输出给定的替换字符串。在此示例中,读取并解码了UTF-8 解码器功能和压力测试文件:
CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf8Decoder.replaceWith("?");
// Read stress file
Path path = Paths.get("<path>/UTF-8-test.txt");
byte[] data = Files.readAllBytes(path);
ByteBuffer input = ByteBuffer.wrap(data);
// UTF-8 decoding
CharBuffer output = utf8Decoder.decode(input);
// Char buffer to string
String outputString = output.toString();
System.out.println(outputString);
回答by Vanaja Jayaraman
I think this may be useful to you Easy way to remove UTF-8 accents from a string?
我认为这可能对您有用 从字符串中删除 UTF-8 重音的简单方法?
Try to use Normalizer as,
尝试使用 Normalizer 作为,
s = Normalizer.normalize(s, Normalizer.Form.NFD);