javascript 正则表达式删除非字母字符但保留重音字母

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8340719/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 03:06:53  来源:igfitidea点击:

Regex to remove non-letter characters but keep accented letters

javascriptregexstringdiacritics

提问by devjs11

I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ?, á, ó, í etc and they need to remain. So I am trying to do it with regexp the following way:

我有西班牙语和其他语言的字符串,其中可能包含 ()、* 等通用特殊字符。我需要删除这些字符。但问题是它也可能包含特殊语言字符,如 ?、á、ó、í 等,它们需要保留。所以我试图通过以下方式使用正则表达式来做到这一点:

var desired = stringToReplace.replace(/[^\w\s]/gi, '');

Unfortunately it is removing all special characters including the language related. Not sure how to avoid that. Maybe someone could suggest?

不幸的是,它正在删除所有特殊字符,包括相关的语言。不知道如何避免这种情况。也许有人可以建议?

采纳答案by Tim Down

I would suggest using Steven Levithan's excellent XRegExplibrary and its Unicode plug-in.

我建议使用 Steven Levithan 出色的XRegExp库及其Unicode 插件

Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/

这是一个从字符串中去除非拉丁单词字符的示例:http: //jsfiddle.net/b3awZ/1/

var regex = XRegExp("[^\s\p{Latin}]+", "g");
var str = "?Me puedes decir la contrase?a de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");

See also this answer by Steven Levithan himself:

另请参阅 Steven Levithan 本人的回答:

Regular expression Spanish and Arabic words

正则表达式西班牙语和阿拉伯语单词

回答by nalply

Note!Works only for 16bit code points. This answer is incomplete.

笔记!仅适用于 16 位代码点。这个答案是不完整的。

Short answer

简答

The character class for all arabic digitsand latin lettersis: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

所有阿拉伯数字拉丁字母的字符类是:[0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

To get a regex you can use, prepend /^and append +$/. This will match strings consisting of only latin letters and digits like "mérito"or "Sch?nheit".

要获得正则表达式,您可以使用、前置/^和附加+$/。这将匹配仅由拉丁字母和数字组成的字符串,例如"mérito"or "Sch?nheit"

To match non-digits or non-letter characters to remove them, write a ^as first character after the opening bracket [and prepend /and append +/.

要匹配非数字或非字母字符以将其删除,请^[左括号后写入 a作为第一个字符,并在前面加上/append +/

How did I find that out?Continue reading.

我是怎么发现的?继续阅读。

Long answer: use metaprogramming!

长答案:使用元编程!

Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?

因为 Javascript 没有 Unicode 正则表达式,我写了一个 Python 程序来遍历整个 Unicode 并按 Unicode 名称过滤。手动解决这个问题是很困难的。为什么不让电脑做这些肮脏和琐碎的工作呢?

import unicodedata
import re
import sys

def unicodeNameMatch(pattern, codepoint):
  try:
    return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
  except ValueError:
    return None

def regexChr(codepoint):
  return chr(codepoint) if 32 <= codepoint < 127 else "\u%04x" % codepoint

names = sys.argv
prev = None

js_regex = ""
for codepoint in range(pow(2, 16)):
  if any([unicodeNameMatch(name, codepoint) for name in names]):
    if prev is None: js_regex += regexChr(codepoint)
    prev = codepoint
  else:
    if not prev is None: js_regex += "-" + regexChr(prev)
    prev = None

print "[" + js_regex + "]"

Invoke it like this: python char_class.py latin digitand you get the character class mentioned above. It's an ugly char class but you know for surethat you catched all characters whose names contain latinor digit.

像这样调用它:python char_class.py latin digit你会得到上面提到的字符类。这是一个丑陋的字符类,但您肯定知道您捕获了名称包含latin或的所有字符digit

Browse the Unicode Character Databaseto view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for Aits the line

浏览Unicode 字符数据库以查看所有 Unicode 字符的名称。名称在第一个分号后是大写的,例如对于A它的行

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Try python char_class.py "latin small"and you get a character class for all latin small letters.

尝试python char_class.py "latin small"你会得到一个所有拉丁小写字母的字符类。

Edit: There is a small misfeature (aka bug) in that \u271d-\u271doccurs in the regex. Perhaps this fix helps: Replace

编辑\u271d-\u271d正则表达式中存在一个小的错误功能(又名错误)。也许此修复有帮助:替换

if not prev is None: js_regex += "-" + regexChr(prev)

by

经过

if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)

回答by socha23

Instead of whitelisting characters you accept, you could try blacklisting illegal characters:

您可以尝试将非法字符列入黑名单,而不是将您接受的字符列入白名单:

var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\/]/gi, '')

回答by ???

If you must insist on whitelisting here is the rawest way of doing it:

如果您必须坚持在此处列入白名单,这是最原始的做法:

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

测试字符串是否只包含字母(az + é ü ö ê å ø 等)

It works by keeping track of 'all' unicode letter chars.

它的工作原理是跟踪“所有”Unicode 字母字符。

回答by Density 21.5

var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');

might do the trick.

可能会起作用。

See also this Javascript + Unicode regexesquestion.

另请参阅此Javascript + Unicode 正则表达式问题。

回答by Martin Ender

Unfortunately, Javascript does not support Unicode character properties(which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:

不幸的是,Javascript 不支持Unicode 字符属性(这将是适合您的正则表达式功能)。如果您可以选择更改语言,PHP(例如)可以这样做:

preg_replace("/[^\pL0-9_\s]/", "", $str);

Where \pLmatches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).

Where\pL匹配任何代表字母的 Unicode 字符(小写、大写、修改或未修改)。

If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:

如果您必须坚持使用 JavaScript 并且不能使用 Tim Down 建议的库,那么唯一的选择可能是列入黑名单或列入白名单。但是您的赏金提到,在您的情况下,黑名单实际上并不是一种选择。因此,您可能只需要手动包含相关语言中的特殊字符。所以你可以简单地这样做:

var desired = stringToReplace.replace(/[^\w\s?áóí]/gi, '');

Or use their corresponding Unicode sequences:

或者使用它们对应的 Unicode 序列:

var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');

Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.

然后只需添加您想要处理的所有内容。请注意,不区分大小写的修饰符也适用于 Unicode 序列。