C# 在构造使用数据的 XmlReader 或 XPathDocument 之前,如何从基于 XML 的数据源中删除无效的十六进制字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20762/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?
提问by Oppositional
Is there any easy/general way to clean an XML based data source prior to using it in an XmlReader so that I can gracefully consume XML data that is non-conformant to the hexadecimal character restrictions placed on XML?
在 XmlReader 中使用基于 XML 的数据源之前,是否有任何简单/通用的方法来清理它,以便我可以优雅地使用不符合 XML 上的十六进制字符限制的 XML 数据?
Note:
笔记:
- The solution needs to handle XML data sources that use character encodings other than UTF-8, e.g. by specifying the character encoding at the XML document declaration. Not mangling the character encoding of the source while stripping invalid hexadecimal characters has been a major sticking point.
- The removal of invalid hexadecimal characters should only remove hexadecimal encoded values, as you can often find href values in data that happens to contains a string that would be a string match for a hexadecimal character.
- 该解决方案需要处理使用非 UTF-8 字符编码的 XML 数据源,例如通过在 XML 文档声明中指定字符编码。在剥离无效的十六进制字符的同时不修改源的字符编码一直是一个主要的症结所在。
- 删除无效的十六进制字符应该只删除十六进制编码的值,因为您经常可以在数据中找到 href 值,这些值恰好包含一个字符串,该字符串与十六进制字符的字符串匹配。
Background:
背景:
I need to consume an XML-based data source that conforms to a specific format (think Atom or RSS feeds), but want to be able to consume data sources that have been published which contain invalid hexadecimal characters per the XML specification.
我需要使用符合特定格式(例如 Atom 或 RSS 提要)的基于 XML 的数据源,但希望能够使用已发布的数据源,这些数据源包含符合 XML 规范的无效十六进制字符。
In .NET if you have a Stream that represents the XML data source, and then attempt to parse it using an XmlReader and/or XPathDocument, an exception is raised due to the inclusion of invalid hexadecimal characters in the XML data. My current attempt to resolve this issue is to parse the Stream as a string and use a regular expression to remove and/or replace the invalid hexadecimal characters, but I am looking for a more performant solution.
在 .NET 中,如果您有一个表示 XML 数据源的 Stream,然后尝试使用 XmlReader 和/或 XPathDocument 解析它,则会由于 XML 数据中包含无效的十六进制字符而引发异常。我目前解决此问题的尝试是将 Stream 解析为字符串并使用正则表达式删除和/或替换无效的十六进制字符,但我正在寻找性能更高的解决方案。
采纳答案by Eugene Katz
It may not be perfect(emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.
它可能并不完美(由于人们错过了此免责声明,因此强调了这一点),但我在这种情况下所做的如下。您可以调整以与流一起使用。
/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null) return null;
StringBuilder newString = new StringBuilder();
char ch;
for (int i = 0; i < inString.Length; i++)
{
ch = inString[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
//if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
//if using .NET version prior to 4, use above logic
if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
{
newString.Append(ch);
}
}
return newString.ToString();
}
回答by dnewcome
I like Eugene's whitelist concept. I needed to do a similar thing as the original poster, but I needed to support all Unicode characters, not just up to 0x00FD. The XML spec is:
我喜欢 Eugene 的白名单概念。我需要做与原始海报类似的事情,但我需要支持所有 Unicode 字符,而不仅仅是 0x00FD。XML 规范是:
Char = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
字符 = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
In .NET, the internal representation of Unicode characters is only 16 bits, so we can't `allow' 0x10000-0x10FFFF explicitly. The XML spec explicitly disallowsthe surrogate code points starting at 0xD800 from appearing. However it is possible that if we allowed these surrogate code points in our whitelist, utf-8 encoding our string might produce valid XML in the end as long as proper utf-8 encoding was produced from the surrogate pairs of utf-16 characters in the .NET string. I haven't explored this though, so I went with the safer bet and didn't allow the surrogates in my whitelist.
在 .NET 中,Unicode 字符的内部表示只有 16 位,因此我们不能明确地“允许”0x10000-0x10FFFF。XML 规范明确禁止出现从 0xD800 开始的代理代码点。但是,如果我们在白名单中允许这些代理代码点,那么只要从 utf-16 字符的代理对生成正确的 utf-8 编码,对我们的字符串进行 utf-8 编码最终可能会生成有效的 XML。 .NET 字符串。不过我还没有探索过这个,所以我选择了更安全的赌注,并且不允许我的白名单中的代理人。
The comments in Eugene's solution are misleading though, the problem is that the characters we are excluding are not valid in XML... they are perfectly valid Unicode code points. We are not removing `non-utf-8 characters'. We are removing utf-8 characters that may not appear in well-formed XML documents.
Eugene 解决方案中的注释虽然具有误导性,但问题是我们排除的字符在XML 中无效……它们是完全有效的 Unicode 代码点。我们不会删除“非 utf-8 字符”。我们正在删除可能不会出现在格式良好的 XML 文档中的 utf-8 字符。
public static string XmlCharacterWhitelist( string in_string ) {
if( in_string == null ) return null;
StringBuilder sbOutput = new StringBuilder();
char ch;
for( int i = 0; i < in_string.Length; i++ ) {
ch = in_string[i];
if( ( ch >= 0x0020 && ch <= 0xD7FF ) ||
( ch >= 0xE000 && ch <= 0xFFFD ) ||
ch == 0x0009 ||
ch == 0x000A ||
ch == 0x000D ) {
sbOutput.Append( ch );
}
}
return sbOutput.ToString();
}
回答by savio
private static String removeNonUtf8CompliantCharacters( final String inString ) {
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
byteArr[i]=' ';
}
}
return new String( byteArr );
}
回答by Kesavan
Try this for PHP!
试试这个为 PHP !
$goodUTF8 = iconv("utf-8", "utf-8//IGNORE", $badUTF8);
回答by Nathan G
The above solutions seem to be for removing invalid characters prior to converting to XML.
上述解决方案似乎是为了在转换为 XML 之前删除无效字符。
Use this code to remove invalid XML characters from an XML string. eg. &x1A;
使用此代码从 XML 字符串中删除无效的 XML 字符。例如。&x1A;
public static string CleanInvalidXmlChars( string Xml, string XMLVersion )
{
string pattern = String.Empty;
switch( XMLVersion )
{
case "1.0":
pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F]);";
break;
case "1.1":
pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);";
break;
default:
throw new Exception( "Error: Invalid XML Version!" );
}
Regex regex = new Regex( pattern, RegexOptions.IgnoreCase );
if( regex.IsMatch( Xml ) )
Xml = regex.Replace( Xml, String.Empty );
return Xml;
}
http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/
http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/
回答by Murari Kumar
You can pass non-UTF characters with the following:
您可以通过以下方式传递非 UTF 字符:
string sFinalString = "";
string hex = "";
foreach (char ch in UTFCHAR)
{
int tmp = ch;
if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
{
sFinalString += ch;
}
else
{
sFinalString += "&#" + tmp+";";
}
}
回答by Jodrell
Modernising dnewcombe'sanswer, you could take a slightly simpler approach
现代化dnewcombe 的答案,您可以采用稍微简单的方法
public static string RemoveInvalidXmlChars(string input)
{
var isValid = new Predicate<char>(value =>
(value >= 0x0020 && value <= 0xD7FF) ||
(value >= 0xE000 && value <= 0xFFFD) ||
value == 0x0009 ||
value == 0x000A ||
value == 0x000D);
return new string(Array.FindAll(input.ToCharArray(), isValid));
}
or, with Linq
或者,使用 Linq
public static string RemoveInvalidXmlChars(string input)
{
return new string(input.Where(value =>
(value >= 0x0020 && value <= 0xD7FF) ||
(value >= 0xE000 && value <= 0xFFFD) ||
value == 0x0009 ||
value == 0x000A ||
value == 0x000D).ToArray());
}
I'd be interested to know how the performance of these methods compares and how they all compare to a black list approach using Buffer.BlockCopy
.
我很想知道这些方法的性能如何比较,以及它们与使用Buffer.BlockCopy
.
回答by Igor Kustov
As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlCharmethod. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
作为删除无效 XML 字符的方法,我建议您使用XmlConvert.IsXmlChar方法。它是从 .NET Framework 4 开始添加的,并且也出现在 Silverlight 中。这是小样本:
void Main() {
string content = "\v\fpublic static string StripInvalidXmlCharacters(string str)
{
var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
return invalidXmlCharactersRegex.Replace(str, "");
";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}
static string RemoveInvalidXmlChars(string text) {
char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
回答by mnaoumov
Regex based approach
基于正则表达式的方法
public class InvalidXmlCharacterReplacingStreamReader : TextReader
{
private StreamReader implementingStreamReader;
private char replacementCharacter;
public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter)
{
implementingStreamReader = new StreamReader(stream);
this.replacementCharacter = replacementCharacter;
}
public override void Close()
{
implementingStreamReader.Close();
}
public override ObjRef CreateObjRef(Type requestedType)
{
return implementingStreamReader.CreateObjRef(requestedType);
}
public void Dispose()
{
implementingStreamReader.Dispose();
}
public override bool Equals(object obj)
{
return implementingStreamReader.Equals(obj);
}
public override int GetHashCode()
{
return implementingStreamReader.GetHashCode();
}
public override object InitializeLifetimeService()
{
return implementingStreamReader.InitializeLifetimeService();
}
public override int Peek()
{
int ch = implementingStreamReader.Peek();
if (ch != -1)
{
if (
(ch < 0x0020 || ch > 0xD7FF) &&
(ch < 0xE000 || ch > 0xFFFD) &&
ch != 0x0009 &&
ch != 0x000A &&
ch != 0x000D
)
{
return replacementCharacter;
}
}
return ch;
}
public override int Read()
{
int ch = implementingStreamReader.Read();
if (ch != -1)
{
if (
(ch < 0x0020 || ch > 0xD7FF) &&
(ch < 0xE000 || ch > 0xFFFD) &&
ch != 0x0009 &&
ch != 0x000A &&
ch != 0x000D
)
{
return replacementCharacter;
}
}
return ch;
}
public override int Read(char[] buffer, int index, int count)
{
int readCount = implementingStreamReader.Read(buffer, index, count);
for (int i = index; i < readCount+index; i++)
{
char ch = buffer[i];
if (
(ch < 0x0020 || ch > 0xD7FF) &&
(ch < 0xE000 || ch > 0xFFFD) &&
ch != 0x0009 &&
ch != 0x000A &&
ch != 0x000D
)
{
buffer[i] = replacementCharacter;
}
}
return readCount;
}
public override Task<int> ReadAsync(char[] buffer, int index, int count)
{
throw new NotImplementedException();
}
public override int ReadBlock(char[] buffer, int index, int count)
{
throw new NotImplementedException();
}
public override Task<int> ReadBlockAsync(char[] buffer, int index, int count)
{
throw new NotImplementedException();
}
public override string ReadLine()
{
throw new NotImplementedException();
}
public override Task<string> ReadLineAsync()
{
throw new NotImplementedException();
}
public override string ReadToEnd()
{
throw new NotImplementedException();
}
public override Task<string> ReadToEndAsync()
{
throw new NotImplementedException();
}
public override string ToString()
{
return implementingStreamReader.ToString();
}
}
}
}
See my blogpostfor more details
查看我的博文了解更多详情
回答by Ryan Adams
Here is dnewcome's answer in a custom StreamReader. It simply wraps a real stream reader and replaces the characters as they are read.
这是dnewcome在自定义 StreamReader 中的答案。它只是包装了一个真正的流读取器并在读取字符时替换它们。
I only implemented a few methods to save myself time. I used this in conjunction with XDocument.Load and a file stream and only the Read(char[] buffer, int index, int count) method was called, so it worked like this. You may need to implement additional methods to get this to work for your application. I used this approach because it seems more efficient than the other answers. I also only implemented one of the constructors, you could obviously implement any of the StreamReader constructors that you need, since it is just a pass through.
我只实施了一些方法来节省自己的时间。我将它与 XDocument.Load 和文件流结合使用,并且只调用了 Read(char[] buffer, int index, int count) 方法,所以它是这样工作的。您可能需要实现其他方法才能使其适用于您的应用程序。我使用这种方法是因为它似乎比其他答案更有效。我也只实现了一个构造函数,你显然可以实现你需要的任何 StreamReader 构造函数,因为它只是一个传递。
I chose to replace the characters rather than removing them because it greatly simplifies the solution. In this way the length of the text stays the same, so there is no need to keep track of a separate index.
我选择替换字符而不是删除它们,因为它大大简化了解决方案。通过这种方式,文本的长度保持不变,因此无需跟踪单独的索引。
##代码##