如何使用 PHP 跳过 XML 文件中的无效字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3466035/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to skip invalid characters in XML file using PHP
提问by user315396
I'm trying to parse an XML file using PHP, but I get an error message:
我正在尝试使用 PHP 解析 XML 文件,但收到一条错误消息:
parser error : Char 0x0 out of allowed range in
解析器错误:字符 0x0 超出允许的范围
I think it's because of the content of the XML, I think there is a speical symbol "☆", any ideas what I can do to fix it?
我认为这是因为 XML 的内容,我认为有一个特殊的符号“☆”,我可以做些什么来修复它?
I also get:
我也得到:
parser error : Premature end of data in tag item line
解析器错误:标记项行中的数据过早结束
What might be causing that error?
什么可能导致该错误?
I'm using simplexml_load_file
.
我正在使用simplexml_load_file
.
Update:
更新:
I try to find the error line and paste its content as single xml file and it can work!! so I still cannot figure out what makes xml file parse fails. PS it's a huge xml file over 100M, will it makes parse error?
我尝试找到错误行并将其内容粘贴为单个 xml 文件,它可以工作!!所以我仍然无法弄清楚是什么导致 xml 文件解析失败。PS这是一个超过100M的巨大xml文件,它会导致解析错误吗?
回答by Jhong
Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[
.. ]]>
blocks.
您可以控制 XML 吗?如果是这样,请确保数据包含在<![CDATA[
..]]>
块中。
And you also need to clear the invalid characters:
您还需要清除无效字符:
/**
* Removes invalid XML
*
* @access public
* @param string $value
* @return string
*/
function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
$current = ord($value{$i});
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x20) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$ret .= chr($current);
}
else
{
$ret .= " ";
}
}
return $ret;
}
回答by mikeytown2
I decided to test all UTF-8values (0-1114111) to make sure things work as they should. Using preg_replace()causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.
我决定测试所有UTF-8值 (0-1114111) 以确保一切正常。使用preg_replace()会导致在测试所有 utf-8 值时由于错误而返回 NULL。这是我提出的解决方案。
$utf_8_range = range(0, 1114111);
$output = ords_to_utfstring($utf_8_range);
$sanitized = sanitize_for_xml($output);
/**
* Removes invalid XML
*
* @access public
* @param string $value
* @return string
*/
function sanitize_for_xml($input) {
// Convert input to UTF-8.
$old_setting = ini_set('mbstring.substitute_character', '"none"');
$input = mb_convert_encoding($input, 'UTF-8', 'auto');
ini_set('mbstring.substitute_character', $old_setting);
// Use fast preg_replace. If failure, use slower chr => int => chr conversion.
$output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input);
if (is_null($output)) {
// Convert to ints.
// Convert ints back into a string.
$output = ords_to_utfstring(utfstring_to_ords($input), TRUE);
}
return $output;
}
/**
* Given a UTF-8 string, output an array of ordinal values.
*
* @param string $input
* UTF-8 string.
* @param string $encoding
* Defaults to UTF-8.
*
* @return array
* Array of ordinal values representing the input string.
*/
function utfstring_to_ords($input, $encoding = 'UTF-8'){
// Turn a string of unicode characters into UCS-4BE, which is a Unicode
// encoding that stores each character as a 4 byte integer. This accounts for
// the "UCS-4"; the "BE" prefix indicates that the integers are stored in
// big-endian order. The reason for this encoding is that each character is a
// fixed size, making iterating over the string simpler.
$input = mb_convert_encoding($input, "UCS-4BE", $encoding);
// Visit each unicode character.
$ords = array();
for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) {
// Now we have 4 bytes. Find their total numeric value.
$s2 = mb_substr($input, $i, 1, "UCS-4BE");
$val = unpack("N", $s2);
$ords[] = $val[1];
}
return $ords;
}
/**
* Given an array of ints representing Unicode chars, outputs a UTF-8 string.
*
* @param array $ords
* Array of integers representing Unicode characters.
* @param bool $scrub_XML
* Set to TRUE to remove non valid XML characters.
*
* @return string
* UTF-8 String.
*/
function ords_to_utfstring($ords, $scrub_XML = FALSE) {
$output = '';
foreach ($ords as $ord) {
// 0: Negative numbers.
// 55296 - 57343: Surrogate Range.
// 65279: BOM (byte order mark).
// 1114111: Out of range.
if ( $ord < 0
|| ($ord >= 0xD800 && $ord <= 0xDFFF)
|| $ord == 0xFEFF
|| $ord > 0x10ffff) {
// Skip non valid UTF-8 values.
continue;
}
// 9: Anything Below 9.
// 11: Vertical Tab.
// 12: Form Feed.
// 14-31: Unprintable control codes.
// 65534, 65535: Unicode noncharacters.
elseif ($scrub_XML && (
$ord < 0x9
|| $ord == 0xB
|| $ord == 0xC
|| ($ord > 0xD && $ord < 0x20)
|| $ord == 0xFFFE
|| $ord == 0xFFFF
)) {
// Skip non valid XML values.
continue;
}
// 127: 1 Byte char.
elseif ( $ord <= 0x007f) {
$output .= chr($ord);
continue;
}
// 2047: 2 Byte char.
elseif ($ord <= 0x07ff) {
$output .= chr(0xc0 | ($ord >> 6));
$output .= chr(0x80 | ($ord & 0x003f));
continue;
}
// 65535: 3 Byte char.
elseif ($ord <= 0xffff) {
$output .= chr(0xe0 | ($ord >> 12));
$output .= chr(0x80 | (($ord >> 6) & 0x003f));
$output .= chr(0x80 | ($ord & 0x003f));
continue;
}
// 1114111: 4 Byte char.
elseif ($ord <= 0x10ffff) {
$output .= chr(0xf0 | ($ord >> 18));
$output .= chr(0x80 | (($ord >> 12) & 0x3f));
$output .= chr(0x80 | (($ord >> 6) & 0x3f));
$output .= chr(0x80 | ($ord & 0x3f));
continue;
}
}
return $output;
}
And to do this on a simple object or array
并在一个简单的对象或数组上执行此操作
// Recursive sanitize_for_xml.
function recursive_sanitize_for_xml(&$input){
if (is_null($input) || is_bool($input) || is_numeric($input)) {
return;
}
if (!is_array($input) && !is_object($input)) {
$input = sanitize_for_xml($input);
}
else {
foreach ($input as &$value) {
recursive_sanitize_for_xml($value);
}
}
}
回答by Dominic Rodger
If you have control over the data, ensure that it is encoded correctly (i.e. is in the encoding that you promised in the xml tag, e.g. if you have:
如果您可以控制数据,请确保其编码正确(即采用您在 xml 标签中承诺的编码,例如,如果您有:
<?xml version="1.0" encoding="UTF-8"?>
then you'll need to ensure your data is in UTF-8.
那么你需要确保你的数据是 UTF-8。
If you don't have control over the data, yell at those who do.
如果您无法控制数据,请对有控制权的人大喊大叫。
You can use a tool like xmllintto check which part(s) of the data are not valid.
您可以使用xmllint 之类的工具来检查数据的哪些部分无效。
回答by Martin
My problem was "&"character (HEX 0x24), i changed to:
我的问题是“&”字符(HEX 0x24),我改为:
function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
$current = ord($value{$i});
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x28) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$ret .= chr($current);
}
else
{
$ret .= " ";
}
}
return $ret;
}
回答by bcosca
Make sure your XML source is valid. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
确保您的 XML 源有效。请参阅http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
回答by Mike Venzke
For a non-destructive method of loading this type of input into a SimpleXMLElement, see my answer on How to handle invalid unicode with simplexml
有关将此类输入加载到 SimpleXMLElement 的非破坏性方法,请参阅我关于如何使用 simplexml 处理无效的 unicode 的回答
回答by user1593640
Not a php solution but, it works:
不是 php 解决方案,但它有效:
Download Notepad++ https://notepad-plus-plus.org/
下载 Notepad++ https://notepad-plus-plus.org/
Open your .xml file in Notepad++
在 Notepad++ 中打开您的 .xml 文件
From Main Menu: Search -> Search Modeset this to: Extended
从主菜单:搜索 ->搜索模式将其设置为:扩展
Then,
然后,
Replace -> Find what \x00; Replace with {leave empty}
替换 -> 查找 \x00; 替换为{留空}
Then, Replace_All
然后,Replace_All
Rob
抢