如何使用正则表达式从 JavaScript 中的字符串中删除所有标点符号?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4328500/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I strip all punctuation from a string in JavaScript using regex?
提问by Quentin Fisk
If I have a string with any type of non-alphanumeric character in it:
如果我有一个包含任何类型的非字母数字字符的字符串:
"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"
How would I get a no-punctuation version of it in JavaScript:
我如何在 JavaScript 中获得它的无标点版本:
"This is an example of a string with punctuation"
回答by Mike Grace
If you want to remove specific punctuation from a string, it will probably be best to explicitly remove exactly what you want like
如果您想从字符串中删除特定的标点符号,最好明确删除您想要的内容
replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"")
Doing the above still doesn't return the string as you have specified it. If you want to remove any extra spaces that were left over from removing crazy punctuation, then you are going to want to do something like
执行上述操作仍然不会返回您指定的字符串。如果您想删除因删除疯狂标点符号而留下的任何额外空格,那么您将想要做类似的事情
replace(/\s{2,}/g," ");
My full example:
我的完整示例:
var s = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var punctuationless = s.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"");
var finalString = punctuationless.replace(/\s{2,}/g," ");
Results of running code in firebug console:
在 firebug 控制台中运行代码的结果:
回答by John Kugelman
str = str.replace(/[^\w\s]|_/g, "")
.replace(/\s+/g, " ");
Removes everything except alphanumeric characters and whitespace, then collapses multiple adjacent characters to single spaces.
删除除字母数字字符和空格之外的所有内容,然后将多个相邻字符折叠为单个空格。
Detailed explanation:
详细解释:
\w
is any digit, letter, or underscore.\s
is any whitespace.[^\w\s]
is anything that's not a digit, letter, whitespace, or underscore.[^\w\s]|_
is the same as #3 except with the underscores added back in.
\w
是任何数字、字母或下划线。\s
是任何空格。[^\w\s]
是不是数字、字母、空格或下划线的任何东西。[^\w\s]|_
除了重新添加下划线外,与 #3 相同。
回答by Joseph
Here are the standard punctuation characters for US-ASCII: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
以下是 US-ASCII 的标准标点符号: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
For Unicode punctuation (such as curly quotes, em-dashes, etc), you can easily match on specific block ranges. The General Punctuationblock is \u2000-\u206F
, and the Supplemental Punctuationblock is \u2E00-\u2E7F
.
对于 Unicode 标点符号(例如弯引号、长破折号等),您可以轻松匹配特定的块范围。在一般标点符号块\u2000-\u206F
,并补充标点符号块\u2E00-\u2E7F
。
Put together, and properly escaped, you get the following RegExp:
放在一起并正确转义,您将获得以下 RegExp:
/[\u2000-\u206F\u2E00-\u2E7F\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/
That should match pretty much any punctuation you encounter. So, to answer the original question:
这应该与您遇到的任何标点符号几乎匹配。所以,要回答原来的问题:
var punctRE = /[\u2000-\u206F\u2E00-\u2E7F\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/g;
var spaceRE = /\s+/g;
var str = "This, -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
str.replace(punctRE, '').replace(spaceRE, ' ');
>> "This is an example of a string with punctuation"
US-ASCII source: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#posix
US-ASCII 源:http: //docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#posix
Unicode source: http://kourge.net/projects/regexp-unicode-block
Unicode 来源:http: //kourge.net/projects/regexp-unicode-block
回答by adnan2nd
/[^A-Za-z0-9\s]/g should match all punctuation but keep the spaces.
So you can use .replace(/\s{2,}/g, " ")
to replace extra spaces if you need to do so. You can test the regex in http://rubular.com/
/[^A-Za-z0-9\s]/g 应该匹配所有标点符号但保留空格。因此,.replace(/\s{2,}/g, " ")
如果需要,您可以使用来替换额外的空格。您可以在http://rubular.com/ 中测试正则表达式
.replace(/[^A-Za-z0-9\s]/g,"").replace(/\s{2,}/g, " ")
Update: Will only work if the input is ANSI English.
更新:仅当输入为 ANSI 英语时才有效。
回答by jacobedawson
I ran across the same issue, this solution did the trick and was very readable:
我遇到了同样的问题,这个解决方案成功了,并且非常易读:
var sentence = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newSen = sentence.match(/[^_\W]+/g).join(' ');
console.log(newSen);
Result:
结果:
"This is an example of a string with punctuation"
The trick was to create a negated set. This means that it matches anything that is not within the set i.e. [^abc]
- not a, bor c
诀窍是创建一个否定集。这意味着它匹配不在集合内的任何东西,即[^abc]
- 不是a、b或c
\W
is any non-word, so [^\W]+
will negate anything that is not a word char.
\W
是任何非单词,因此[^\W]+
将否定不是单词char 的任何内容。
By adding in the _ (underscore) you can negate that as well.
通过添加 _(下划线),您也可以否定它。
Make it apply globally /g
, then you can run any string through it and clear out the punctuation:
让它全局应用/g
,然后你可以通过它运行任何字符串并清除标点符号:
/[^_\W]+/g
Nice and clean ;)
漂亮干净;)
回答by Shimon Doodkin
I'll just put it here for others.
我只是把它放在这里给别人。
Match all punctuation chars for for all languages:
匹配所有语言的所有标点符号:
Constructed from Unicode punctuation category and added some common keyboard symbols like $
and brackets and \-=_
从 Unicode 标点类别构建,并添加了一些常见的键盘符号,如$
括号和\-=_
http://www.fileformat.info/info/unicode/category/Po/list.htm
http://www.fileformat.info/info/unicode/category/Po/list.htm
basic replace:
基本替换:
".test'da, te\"xt".replace(/[\-=_!"#%&'*{},.\/:;?\(\)\[\]@\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g,"")
"testda text"
added \s as space
添加 \s 作为空格
".da'fla, te\"te".split(/[\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)
added ^ to invert patternt to match not punctuation but the words them selves
添加 ^ 以反转模式以匹配不是标点符号而是单词本身
".test';the, te\"xt".match(/[^\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)
for language like Hebrew maybe to remove " ' the single and the double quote. and do more thinking on it.
对于像希伯来语这样的语言,可能会删除 " ' 单引号和双引号。并对其进行更多思考。
using this script:
使用这个脚本:
step 1: select in Firefox holding control a column of U+1234 numbers and copy it, do not copy U+12456 they replace English
第一步:在火狐浏览器中选择一列U+1234数字并复制,不要复制U+12456,他们替换英文
step 2 (i did in chrome)find some textarea and paste it into it then rightclick and click inspect. then you can access the selected element with $0.
第 2 步(我在 chrome 中做的)找到一些 textarea 并将其粘贴到其中,然后右键单击并单击检查。然后您可以使用 $0 访问所选元素。
var x=var punctuationRegEx = /[!-/:-@[-`{-~?-??-??-±′?-???×÷?-??-??-???-????-?????-??-??????-??-???-??-????-??-??-??-???-?????-???-??????-??-??-?????-???-??-??-??-??-???-??-??-??-??-??-??-??-??-???-??-??-??-??-??-??-???-??-??-??-??-?\u2000-\u206e?-??-??-??-?℃-??-℉?№-??-??????-??-??-??←-??-??-??-?─-??-??-??-??-??-??-???-???-??-???-??-??-???-??-??-??-??-??-\u2e7e?-??-??-??-?\u3000-?゛-゜???-?上-人?-??-?月-至??-?月-夜?-??-??-??-??-????-??-??-??-??-??-??-???-???-??-??-?︰-﹒﹔-﹦﹨-﹫!-/:-@[-`{-?¢-??-??-?]|\ud800[\udd00-\udd02\udd37-\udd3f\udd79-\udd89\udd90-\udd9b\uddd0-\uddfc\udf9f\udfd0]|\ud802[\udd1f\udd3f\ude50-\ude58]|\ud809[\udc00-\udc7e]|\ud834[\udc00-\udcf5\udd00-\udd26\udd29-\udd64\udd6a-\udd6c\udd83-\udd84\udd8c-\udda9\uddae-\udddd\ude00-\ude41\ude45\udf00-\udf56]|\ud835[\udec1\udedb\udefb\udf15\udf35\udf4f\udf6f\udf89\udfa9\udfc3]|\ud83c[\udc00-\udc2b\udc30-\udc93]/g;
var string = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newString = string.replace(punctuationRegEx, '').replace(/(\s){2,}/g, '');
console.log(newString)
.value
var z=x.replace(/U\+/g,"").split(/[\r\n]+/).map(function(a){return parseInt(a,16)})
var ret=[];z.forEach(function(a,k){if(z[k-1]===a-1 && z[k+1]===a+1) { if(ret[ret.length-1]!="-")ret.push("-");} else { var c=a.toString(16); var prefix=c.length<3?"\u0000":c.length<5?"\u0000":"\u000000"; var uu=prefix.substring(0,prefix.length-c.length)+c; ret.push(c.length<3?String.fromCharCode(a):uu)}});ret.join("")
step 3 copied over the first letters the ascii as separate chars not ranges because someone might add or remove individual chars
第 3 步复制第一个字母 ascii 作为单独的字符而不是范围,因为有人可能会添加或删除单个字符
回答by tchrist
In a Unicode-aware language, the Unicode Punctuationcharacter property is \p{P}
— which you can usually abbreviate \pP
and sometimes expand to \p{Punctuation}
for readability.
在可识别 Unicode 的语言中,Unicode标点字符属性是\p{P}
- 您通常可以缩写\pP
并有时将其扩展\p{Punctuation}
为可读性。
Are you using a Perl Compatible Regular Expression library?
您使用的是 Perl 兼容的正则表达式库吗?
回答by Salvatore
If you want to remove punctuation from any string you should use the P
Unicode class.
如果您想从任何字符串中删除标点符号,您应该使用P
Unicode 类。
But, because classes are not accepted in the JavaScript RegEx, you could try this RegEx that should match all the punctuation. It matches the following categories: Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So GeneralPunctuation SupplementalPunctuation CJKSymbolsAndPunctuation CuneiformNumbersAndPunctuation.
但是,因为 JavaScript 正则表达式不接受类,所以您可以尝试使用这个应该匹配所有标点符号的正则表达式。它匹配以下类别:Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So GeneralPunctuation SupplementalPunctuation CJKSymbolsAndPunctuation CuneiformNumbersAndPunctuation。
I created it using this online toolthat generates Regular Expressions specifically for JavaScript. That's the code to reach your goal:
我使用这个专门为 JavaScript 生成正则表达式的在线工具创建了它。这是实现目标的代码:
"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation".replace( /[^a-zA-Z ]/g, '').replace( /\s\s+/g, ' ' )
回答by meder omuraliev
For en-US ( American English ) strings this should suffice:
对于 en-US (美国英语)字符串,这应该足够了:
_.words('This, is : my - test,line:').join(' ')
Be aware that if you support UTF-8 and characters like chinese/russian and all, this will replace them as well, so you really have to specify what you want.
请注意,如果您支持 UTF-8 和中文/俄语等字符,这也将替换它们,因此您确实必须指定您想要的内容。