java 替换 Unicode 控制字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3438854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Replace Unicode Control Characters
提问by Cyril Gandon
I need to replace all special control character in a string in Java.
我需要在 Java 中替换字符串中的所有特殊控制字符。
I want to ask the Google maps API v3, and Google doesn't seems to like these characters.
我想问谷歌地图API v3,谷歌似乎不喜欢这些字符。
Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F
示例:http: //www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F
This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm
此 URL 包含此字符:http: //www.fileformat.info/info/unicode/char/008f/index.htm
So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list.
所以我收到了一些数据,我需要对这些数据进行地理编码。我知道某些字符不会通过地理编码,但我不知道确切的列表。
I was not able to find any documentation about this issue, so I think the list of characters that Google doesn't like is this one: http://www.fileformat.info/info/unicode/category/Cc/list.htm
我找不到有关此问题的任何文档,因此我认为 Google 不喜欢的字符列表是这样的:http: //www.fileformat.info/info/unicode/category/Cc/list.htm
Is there any already built function to get rid of these characters, or do I have to build a new one, with a replace one by one?
是否有任何已经构建的函数来摆脱这些字符,或者我是否必须构建一个新的,一个一个地替换?
Or is there a good regexp to do the job done?
或者是否有一个很好的正则表达式来完成工作?
And does somebody know which exact list of characters Google doesn't like?
有人知道 Google 不喜欢哪个确切的字符列表吗?
Edit : Google have create a webpage for this :
编辑:谷歌为此创建了一个网页:
https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs
https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs
回答by polygenelubricants
If you want to delete all characters in Other/Control Unicode category, you can do something like this:
如果要删除其他/控制 Unicode 类别中的所有字符,可以执行以下操作:
System.out.println(
"a\u0000b\u0007c\u008fd".replaceAll("\p{Cc}", "")
); // abcd
Note that this actually removes (among others) '\u008f'Unicode character from the string, not the escaped form "%8F"string.
请注意,这实际上'\u008f'从字符串中删除了(除其他外)Unicode 字符,而不是转义形式的"%8F"字符串。
If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.
如果黑名单没有被一个 Unicode 块/类别很好地捕获,Java 确实有一个强大的字符类算术,您可以使用它的交、减等。或者,您也可以使用否定白名单方法,即不是明确指定哪些字符是非法的,而是指定哪些是合法的,然后其他所有内容都变得非法。
API links
接口链接
Examples
例子
Here's a subtraction example:
这是一个减法示例:
System.out.println(
"regular expressions: now you have two problems!!"
.replaceAll("[a-z&&[^aeiou]]", "_")
);
// _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!
The […]is a character class. Something like [aeiou]matches one of any of the lowercase vowels. [^…]is a negatedcharacter class. [^aeiou]matches one of anything butthe lowercase vowels.
该[…]是字符类。类似的东西[aeiou]匹配任何小写元音之一。[^…]是一个否定字符类。[^aeiou]匹配除小写元音之外的任何一个。
[a-z&&[^aeiou]]matches [a-z]subtracted by [aeiou], i.e. all lowercase consonants.
[a-z&&[^aeiou]]匹配[a-z]减去[aeiou],即所有小写辅音。
The next example shows the negated whitelist approach:
下一个示例显示了否定白名单方法:
System.out.println(
"regular expressions: now you have two problems!!"
.replaceAll("[^a-z]", "_")
);
// regular_expressions__now_you_have_two_problems__
Only lowercase letters a-zare legal; everything else is illegal.
只有小写字母a-z是合法的;其他一切都是非法的。

