java 替换 Unicode 控制字符

Question

提问by Cyril Gandon

I need to replace all special control character in a string in Java.

我需要在 Java 中替换字符串中的所有特殊控制字符。

I want to ask the Google maps API v3, and Google doesn't seems to like these characters.

我想问谷歌地图API v3，谷歌似乎不喜欢这些字符。

Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

示例：http: //www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm

此 URL 包含此字符：http: //www.fileformat.info/info/unicode/char/008f/index.htm

So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list.

所以我收到了一些数据，我需要对这些数据进行地理编码。我知道某些字符不会通过地理编码，但我不知道确切的列表。

I was not able to find any documentation about this issue, so I think the list of characters that Google doesn't like is this one: http://www.fileformat.info/info/unicode/category/Cc/list.htm

我找不到有关此问题的任何文档，因此我认为 Google 不喜欢的字符列表是这样的：http: //www.fileformat.info/info/unicode/category/Cc/list.htm

Is there any already built function to get rid of these characters, or do I have to build a new one, with a replace one by one?

是否有任何已经构建的函数来摆脱这些字符，或者我是否必须构建一个新的，一个一个地替换？

Or is there a good regexp to do the job done?

或者是否有一个很好的正则表达式来完成工作？

And does somebody know which exact list of characters Google doesn't like?

有人知道 Google 不喜欢哪个确切的字符列表吗？

Edit : Google have create a webpage for this :

编辑：谷歌为此创建了一个网页：

https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

Answer 1

回答by polygenelubricants

If you want to delete all characters in Other/Control Unicode category, you can do something like this:

如果要删除其他/控制 Unicode 类别中的所有字符，可以执行以下操作：

    System.out.println(
        "a\u0000b\u0007c\u008fd".replaceAll("\p{Cc}", "")
    ); // abcd

Note that this actually removes (among others) '\u008f'Unicode character from the string, not the escaped form "%8F"string.

请注意，这实际上'\u008f'从字符串中删除了（除其他外）Unicode 字符，而不是转义形式的"%8F"字符串。

If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.

如果黑名单没有被一个 Unicode 块/类别很好地捕获，Java 确实有一个强大的字符类算术，您可以使用它的交、减等。或者，您也可以使用否定白名单方法，即不是明确指定哪些字符是非法的，而是指定哪些是合法的，然后其他所有内容都变得非法。

API links

接口链接

Examples

例子

Here's a subtraction example:

这是一个减法示例：

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[a-z&&[^aeiou]]", "_")
    );
    //   _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!

The […]is a character class. Something like [aeiou]matches one of any of the lowercase vowels. [^…]is a negatedcharacter class. [^aeiou]matches one of anything butthe lowercase vowels.

该[…]是字符类。类似的东西[aeiou]匹配任何小写元音之一。[^…]是一个否定字符类。[^aeiou]匹配除小写元音之外的任何一个。

[a-z&&[^aeiou]]matches [a-z]subtracted by [aeiou], i.e. all lowercase consonants.

[a-z&&[^aeiou]]匹配[a-z]减去[aeiou]，即所有小写辅音。

The next example shows the negated whitelist approach:

下一个示例显示了否定白名单方法：

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[^a-z]", "_")
    );
    //   regular_expressions__now_you_have_two_problems__

Only lowercase letters a-zare legal; everything else is illegal.

只有小写字母a-z是合法的；其他一切都是非法的。

java 替换 Unicode 控制字符

提问by Cyril Gandon

回答by polygenelubricants

API links

接口链接

Examples

例子

相关推荐

最近更新

标签

java 替换 Unicode 控制字符

提问by Cyril Gandon

回答by polygenelubricants

API links

接口链接

Examples

例子

相关推荐

java 未捕获 SQLiteConstraintException

java Quartz Scheduler 关机后不停止

java 使用Zxing从手机上存储的图像中解码二维码（在Android手机上）

java 我想使用 servlet 创建一个登录页面

相关推荐

最近更新

标签