Java 从字符串中删除不适合 UTF-8 编码的字符

Question

提问by Abhi

I have a text-area on website where user can write anything. Problem happens when user copy paste some text or something which contains non-UTF 8 characters and submit them to server.

我在网站上有一个文本区域，用户可以在其中编写任何内容。当用户复制粘贴一些文本或包含非 UTF 8 字符的内容并将它们提交到服务器时会发生问题。

Java successfully handles it, as it support UTF-16 but my mySql table support UTF-8 and thus insertion fails.

Java 成功地处理了它，因为它支持 UTF-16，但我的 mySql 表支持 UTF-8，因此插入失败。

I was trying to implement some way in business logic itself, to remove any characters which is not suitable for UTF-8 encoding.

我试图在业务逻辑本身中实现某种方式，以删除任何不适合 UTF-8 编码的字符。

Currently I am using this code:

目前我正在使用此代码：

new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array());

But it replaces characters not suitable for UTF-8 with some other obscure characters. Which also does not look good to end user. Could someone please throw some light over any possible solution to tackle this using Java code?

但它用其他一些晦涩的字符替换了不适合 UTF-8 的字符。这对最终用户来说也不好看。有人可以介绍一下使用 Java 代码解决这个问题的任何可能的解决方案吗？

EDIT :For example, exception I got while insertion of such values

编辑：例如，插入此类值时出现异常

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A\x0D\x0A...' for column

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x80\xF0\x9F...' for column

Answer 1

采纳答案by icza

UTF-8 is not a character set, it's a character encoding, just like UTF-16.

UTF-8 不是字符集，它是一种字符编码，就像 UTF-16 一样。

UTF-8 is capable to encode any unicode character and any unicode text to a sequence of bytes, so there is no such thing as characters not suitable for UTF-8.

UTF-8 能够将任何 unicode 字符和任何 unicode 文本编码为字节序列，因此没有不适合 UTF-8 的字符。

You're using a constructor of Stringwhich only takes a byte array (String(byte[] bytes)) which according to the javadocs:

您正在使用一个构造函数String，它只接受一个字节数组（String(byte[] bytes)），根据 javadocs ：

Constructs a new String by decoding the specified array of bytes using the platform's default charset.

通过使用平台的默认 charset解码指定的字节数组来构造一个新的 String 。

It uses the default charset of the platform to interpret the bytes (to convert the bytes to characters). Do not use this. Instead when converting a byte array to String, specify the encoding you wish to use explicitly with the String(byte[] bytes, Charset charset)constructor.

它使用平台的默认字符集来解释字节（将字节转换为字符）。不要使用这个。相反，在将字节数组转换为时String，请使用String(byte[] bytes, Charset charset)构造函数指定您希望显式使用的编码。

If you have issues with certain characters, that is most likely due to using different character sets or encodings at the server side and at the client side (brownser+HTML). Make sure you use UTF-8 everywhere, do not mix encodings and do not use the default encoding of the platform.

如果您对某些字符有问题，那很可能是由于在服务器端和客户端（浏览器 + HTML）使用了不同的字符集或编码。确保在任何地方都使用 UTF-8，不要混合编码，也不要使用平台的默认编码。

Some readings how to achieve this:

一些阅读如何实现这一目标：

How to get UTF-8 working in Java webapps?

如何让 UTF-8 在 Java webapps 中工作？

Answer 2

回答by Erwin Bolwidt

The problem in your code is that you are calling new Stringon a byte[]. The result of encodeis a ByteBuffer, and the result of arrayon a ByteBuffer is a byte[]. The constructor new String(byte[])will use the platform default encoding for your computer; it can be different on each computer that you run on, so that's not something that you want. You should at least pass in a character set as the second argument to the String constructor, although I'm not sure which character set you would have in mind.

在你的代码的问题是，你在呼唤new String一个byte[]。的结果encode是 ByteBuffer，而 ByteBuffer 的结果array是byte[]。构造函数new String(byte[])将为您的计算机使用平台默认编码；它在您运行的每台计算机上都可能不同，因此这不是您想要的。您至少应该将字符集作为第二个参数传递给 String 构造函数，尽管我不确定您会想到哪个字符集。

I'm not sure why you're doing it: if your database uses UTF-8, it will do the encoding for you. You just need to pass un-encoded strings into it.

我不确定您为什么要这样做：如果您的数据库使用 UTF-8，它将为您进行编码。您只需要将未编码的字符串传递给它。

UTF-8 and UTF-16 can both encode the entire Unicode 6 character set; there are no characters that can be encoded by UTF-16 but not by UTF-8. So that part of your question is unfortunately unanswerable.

UTF-8 和 UTF-16 都可以编码整个 Unicode 6 字符集；没有可以由 UTF-16 编码但不能由 UTF-8 编码的字符。因此，不幸的是，您问题的那部分无法回答。

For some background:

对于一些背景：

http://unicodebook.readthedocs.org/en/latest/unicode_encodings.html

http://unicodebook.readthedocs.org/en/latest/unicode_encodings.html

Answer 3

回答by gclaussn

Maybe the answer with the CharsetDecoderof this questionhelps. You could change the CodingErrorActionto REPLACE and set a replacement in my example "?". This will output a given replacement string for invalid byte sequences. In this example a UTF-8 decoder capability and stress test fileis read and decoded:

也许这个问题的CharsetDecoder的答案有帮助。您可以将CodingErrorAction更改为 REPLACE 并在我的示例“？”中设置替换。这将为无效字节序列输出给定的替换字符串。在此示例中，读取并解码了UTF-8 解码器功能和压力测试文件：

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf8Decoder.replaceWith("?");

// Read stress file
Path path = Paths.get("<path>/UTF-8-test.txt");
byte[] data = Files.readAllBytes(path);
ByteBuffer input = ByteBuffer.wrap(data);

// UTF-8 decoding
CharBuffer output = utf8Decoder.decode(input);

// Char buffer to string
String outputString = output.toString();

System.out.println(outputString);

Answer 4

回答by Vanaja Jayaraman

I think this may be useful to you Easy way to remove UTF-8 accents from a string?

我认为这可能对您有用从字符串中删除 UTF-8 重音的简单方法？

Try to use Normalizer as,

尝试使用 Normalizer 作为，

s = Normalizer.normalize(s, Normalizer.Form.NFD);

Java 从字符串中删除不适合 UTF-8 编码的字符

提问by Abhi

采纳答案by icza

回答by Erwin Bolwidt

回答by gclaussn

回答by Vanaja Jayaraman

相关推荐

最近更新

标签

Java 从字符串中删除不适合 UTF-8 编码的字符

提问by Abhi

采纳答案by icza

回答by Erwin Bolwidt

回答by gclaussn

回答by Vanaja Jayaraman

相关推荐

如何在java中使用套接字发送/接收对象

Java 如何在 Android Studio 上使用 View Parameter 调用方法

Java 无法安装 Android Studio Bundle？“无法提升错误消息”

Java 从具有 OutputStream 的 Spring @Controller 返回文件

相关推荐

最近更新

标签