java 替换 Unicode 控制字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3438854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 01:52:08  来源:igfitidea点击:

Replace Unicode Control Characters

javaregexgoogle-mapsunicodecharacter-properties

提问by Cyril Gandon

I need to replace all special control character in a string in Java.

我需要在 Java 中替换字符串中的所有特殊控制字符。

I want to ask the Google maps API v3, and Google doesn't seems to like these characters.

我想问谷歌地图API v3,谷歌似乎不喜欢这些字符。

Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

示例:http: //www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm

此 URL 包含此字符:http: //www.fileformat.info/info/unicode/char/008f/index.htm

So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list.

所以我收到了一些数据,我需要对这些数据进行地理编码。我知道某些字符不会通过地理编码,但我不知道确切的列表。

I was not able to find any documentation about this issue, so I think the list of characters that Google doesn't like is this one: http://www.fileformat.info/info/unicode/category/Cc/list.htm

我找不到有关此问题的任何文档,因此我认为 Google 不喜欢的字符列表是这样的:http: //www.fileformat.info/info/unicode/category/Cc/list.htm

Is there any already built function to get rid of these characters, or do I have to build a new one, with a replace one by one?

是否有任何已经构建的函数来摆脱这些字符,或者我是否必须构建一个新的,一个一个地替换?

Or is there a good regexp to do the job done?

或者是否有一个很好的正则表达式来完成工作?

And does somebody know which exact list of characters Google doesn't like?

有人知道 Google 不喜欢哪个确切的字符列表吗?

Edit : Google have create a webpage for this :

编辑:谷歌为此创建了一个网页:

https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

回答by polygenelubricants

If you want to delete all characters in Other/Control Unicode category, you can do something like this:

如果要删除其他/控制 Unicode 类别中的所有字符,可以执行以下操作:

    System.out.println(
        "a\u0000b\u0007c\u008fd".replaceAll("\p{Cc}", "")
    ); // abcd

Note that this actually removes (among others) '\u008f'Unicode character from the string, not the escaped form "%8F"string.

请注意,这实际上'\u008f'从字符串中删除了(除其他外)Unicode 字符,而不是转义形式的"%8F"字符串。

If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.

如果黑名单没有被一个 Unicode 块/类别很好地捕获,Java 确实有一个强大的字符类算术,您可以使用它的交、减等。或者,您也可以使用否定白名单方法,即不是明确指定哪些字符是非法的,而是指定哪些是合法的,然后其他所有内容都变得非法。

API links

接口链接



Examples

例子

Here's a subtraction example:

这是一个减法示例:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[a-z&&[^aeiou]]", "_")
    );
    //   _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!

The […]is a character class. Something like [aeiou]matches one of any of the lowercase vowels. [^…]is a negatedcharacter class. [^aeiou]matches one of anything butthe lowercase vowels.

[…]字符类。类似的东西[aeiou]匹配任何小写元音之一。[^…]是一个否定字符类。[^aeiou]匹配除小写元音之外的任何一个。

[a-z&&[^aeiou]]matches [a-z]subtracted by [aeiou], i.e. all lowercase consonants.

[a-z&&[^aeiou]]匹配[a-z]减去[aeiou],即所有小写辅音。

The next example shows the negated whitelist approach:

下一个示例显示了否定白名单方法:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[^a-z]", "_")
    );
    //   regular_expressions__now_you_have_two_problems__

Only lowercase letters a-zare legal; everything else is illegal.

只有小写字母a-z是合法的;其他一切都是非法的。