如何从 JavaScript 字符串中删除无效的 UTF-8 字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2670037/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove invalid UTF-8 characters from a JavaScript string?
提问by Matthew Sielski
I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:
我想从 JavaScript 的字符串中删除所有无效的 UTF-8 字符。我已经尝试过使用这个 JavaScript:
strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");
strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");
It seems that the UTF-8 validation regex described here (link removed)is more complete and I adapted it in the same way like:
似乎此处描述的 UTF-8 验证正则表达式(链接已删除)更完整,我以相同的方式对其进行了修改,例如:
strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");
strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");
Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.
这两段代码似乎都允许有效的 UTF-8 通过,但几乎没有从我的测试数据中过滤掉任何错误的 UTF-8 字符:UTF-8 解码器功能和压力测试。坏字符要么保持不变,要么似乎删除了一些字节,从而创建了一个新的无效字符。
I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.
我对 UTF-8 标准或 JavaScript 中的多字节不是很熟悉,所以我不确定我是否未能在正则表达式中表示正确的 UTF-8,或者我是否在 JavaScript 中不正确地应用了该正则表达式。
Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.
编辑:根据 Tomalak 的评论在我的正则表达式中添加了全局标志 - 但是这仍然对我不起作用。根据 bobince 的评论,我将放弃在客户端执行此操作。
回答by Ali
I use this simple and sturdy approach:
我使用这种简单而可靠的方法:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
return output;
}
Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).
基本上你真正想要的是 0-127 的 ASCII 字符,所以只需按字符重建字符串字符。如果它是一个好字符,请保留它 - 如果不是,则丢弃它。非常强大,如果您的目标是卫生,那么它就足够快了(实际上它真的很快)。
回答by bobince
JavaScript strings are natively Unicode. They hold character sequences* not byte sequences, so it is impossible for one to contain an invalid byte sequence.
JavaScript 字符串本身就是 Unicode。它们保存字符序列*而不是字节序列,因此不可能包含无效的字节序列。
(Technically, they actually contain UTF-16 code unit sequences, which is not quite the same thing, but this probably isn't anything you need to worry about right now.)
(从技术上讲,它们实际上包含 UTF-16 代码单元序列,这并不完全相同,但这可能不是您现在需要担心的任何事情。)
You can, if you need to for some reason, create a string holding characters used as placeholders for bytes. ie. using the character U+0080('\x80') to stand for the byte 0x80. This is what you would get if you encoded characters to bytes using UTF-8, then decoded them back to characters using ISO-8859-1 by mistake. There is a special JavaScript idiom for this:
如果出于某种原因需要,您可以创建一个字符串,其中包含用作字节占位符的字符。IE。使用字符U+0080('\x80') 代表字节 0x80。如果您使用 UTF-8 将字符编码为字节,然后错误地使用 ISO-8859-1 将它们解码回字符,您会得到这样的结果。对此有一个特殊的 JavaScript 习惯用法:
var bytelike= unescape(encodeURIComponent(characters));
and to get back from UTF-8 pseudobytes to characters again:
并再次从 UTF-8 伪字节返回到字符:
var characters= decodeURIComponent(escape(bytelike));
(This is, notably, pretty much the only time the escape/unescapefunctions should ever be used. Their existence in any other program is almost always a bug.)
(值得注意的是,这几乎是唯一一次应该使用escape/unescape函数。它们在任何其他程序中的存在几乎总是一个错误。)
decodeURIComponent(escape(bytes)), since it behaves like a UTF-8 decoder, will raise an error if the sequence of code units fed into it would not be acceptable as UTF-8 bytes.
decodeURIComponent(escape(bytes)),因为它的行为类似于 UTF-8 解码器,如果输入的代码单元序列不能作为 UTF-8 字节接受,则会引发错误。
It is very rare for you to need to work on byte strings like this in JavaScript. Better to keep working natively in Unicode on the client side. The browser will take care of UTF-8-encoding the string on the wire (in a form submission or XMLHttpRequest).
您很少需要在 JavaScript 中处理这样的字节字符串。最好在客户端以 Unicode 本地方式继续工作。浏览器将负责对线路上的字符串进行 UTF-8 编码(在表单提交或 XMLHttpRequest 中)。
回答by Tomalak
Simple mistake, big effect:
简单的错误,大的影响:
strTest = strTest.replace(/your regex here/g, "");
// ----------------------------------------^
without the "global" flag, the replace occurs for the first match only.
如果没有“全局”标志,则仅在第一次匹配时进行替换。
Side note: To remove any character that does not fulfillsome kind of complex condition, like falling into a set of certain Unicode character ranges, you can use negative lookahead:
旁注:要删除不满足某种复杂条件的任何字符,例如落入一组特定的 Unicode 字符范围,您可以使用负前瞻:
var re = /(?![\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})./g;
strTest = strTest.replace(re, "")
where rereads as
其中re读作
(?! # negative look-ahead: a position *not followed by*: […] # any allowed character range from above ) # end lookahead . # match this character (only if previous condition is met!)
回答by Dan Mantyla
If you're trying to remove the "invalid character" - ? - from javascript strings then you can get rid of them like this:
如果您试图删除“无效字符” - ? - 从 javascript 字符串然后你可以像这样摆脱它们:
myString = myString.replace(/\uFFFD/g, '')
回答by O'Neill
Languages like spanish and french have accented characters like "é" and codes are in the range 160-255 see https://www.ascii.cl/htmlcodes.htm
西班牙语和法语等语言有重音字符,如“é”,代码在 160-255 范围内,请参阅https://www.ascii.cl/htmlcodes.htm
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127 || input.charCodeAt(i) >= 160 && input.charCodeAt(i) <= 255) {
output += input.charAt(i);
}
}
return output;
}
回答by Marcus Pope
I ran into this problem with a really weird result from the Date Taken data of a digital image. My scenario is admittedly unique - using windows scripting host (wsh) and the Shell.Application activex object which allows for getting the namespace object of a folder and calling the GetDetailsOf function to essentially return exif data after it has been parsed by the OS.
我遇到了这个问题,从数字图像的拍摄日期数据中得到了一个非常奇怪的结果。诚然,我的场景是独一无二的 - 使用 Windows 脚本主机 (wsh) 和 Shell.Application activex 对象,该对象允许获取文件夹的命名空间对象并调用 GetDetailsOf 函数以在操作系统解析后返回 exif 数据。
var app = new ActiveXObject("Shell.Application");
var info = app.Namespace("c:\");
var date = info.GetDetailsOf(info.ParseName("testimg.jpg"), 12);
In windws vista and 7, the result looked like this:
在 windws vista 和 7 中,结果如下所示:
?8/?27/?2011 ??11:45 PM
?8/?27/?2011 ??11:45 PM
So my approach was as follows:
所以我的方法如下:
var chars = date.split(''); //split into characters
var clean = "";
for (var i = 0; i < chars.length; i++) {
if (chars[i].charCodeAt(0) < 255) clean += chars[i];
}
The result of course is a string that excludes those question mark characters.
结果当然是排除那些问号字符的字符串。
I know you went with a different solution altogether, but I thought I'd post my solution in case anyone else is having troubles with this and cannot use a server side language approach.
我知道您完全采用了不同的解决方案,但我想我会发布我的解决方案,以防其他人遇到此问题并且无法使用服务器端语言方法。
回答by loretoparisi
I have put together some solutions proposed above to be error-safe
我已经将上面提出的一些解决方案放在一起以确保错误安全
var removeNonUtf8 = (characters) => {
try {
// ignore invalid char ranges
var bytelike = unescape(encodeURIComponent(characters));
characters = decodeURIComponent(escape(bytelike));
} catch (error) { }
// remove ?
characters = characters.replace(/\uFFFD/g, '');
return characters;
},
回答by Doran
I used @Ali's solution to not only clean my string, but replace the invalid chars with html replacement:
我使用@Ali 的解决方案不仅清理了我的字符串,还用 html 替换替换了无效字符:
cleanString(input) {
var output = "";
for (var i = 0; i < input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
} else {
output += "&#" + input.charCodeAt(i) + ";";
}
}
return output;
}

