如何使用 PHP 跳过 XML 文件中的无效字符

Question

提问by user315396

I'm trying to parse an XML file using PHP, but I get an error message:

我正在尝试使用 PHP 解析 XML 文件，但收到一条错误消息：

parser error : Char 0x0 out of allowed range in

解析器错误：字符 0x0 超出允许的范围

I think it's because of the content of the XML, I think there is a speical symbol "☆", any ideas what I can do to fix it?

我认为这是因为 XML 的内容，我认为有一个特殊的符号“☆”，我可以做些什么来修复它？

I also get:

我也得到：

parser error : Premature end of data in tag item line

解析器错误：标记项行中的数据过早结束

What might be causing that error?

什么可能导致该错误？

I'm using simplexml_load_file.

我正在使用simplexml_load_file.

Update:

更新：

I try to find the error line and paste its content as single xml file and it can work!! so I still cannot figure out what makes xml file parse fails. PS it's a huge xml file over 100M, will it makes parse error?

我尝试找到错误行并将其内容粘贴为单个 xml 文件，它可以工作！！所以我仍然无法弄清楚是什么导致 xml 文件解析失败。PS这是一个超过100M的巨大xml文件，它会导致解析错误吗？

Answer 1

回答by Jhong

Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[.. ]]>blocks.

您可以控制 XML 吗？如果是这样，请确保数据包含在<![CDATA[..]]>块中。

And you also need to clear the invalid characters:

您还需要清除无效字符：

/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            $ret .= chr($current);
        }
        else
        {
            $ret .= " ";
        }
    }
    return $ret;
}

Answer 2

回答by mikeytown2

I decided to test all UTF-8values (0-1114111) to make sure things work as they should. Using preg_replace()causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.

我决定测试所有UTF-8值 (0-1114111) 以确保一切正常。使用preg_replace()会导致在测试所有 utf-8 值时由于错误而返回 NULL。这是我提出的解决方案。

$utf_8_range = range(0, 1114111);
$output = ords_to_utfstring($utf_8_range);
$sanitized = sanitize_for_xml($output);


/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function sanitize_for_xml($input) {
  // Convert input to UTF-8.
  $old_setting = ini_set('mbstring.substitute_character', '"none"');
  $input = mb_convert_encoding($input, 'UTF-8', 'auto');
  ini_set('mbstring.substitute_character', $old_setting);

  // Use fast preg_replace. If failure, use slower chr => int => chr conversion.
  $output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input);
  if (is_null($output)) {
    // Convert to ints.
    // Convert ints back into a string.
    $output = ords_to_utfstring(utfstring_to_ords($input), TRUE);
  }
  return $output;
}

/**
 * Given a UTF-8 string, output an array of ordinal values.
 *
 * @param string $input
 *   UTF-8 string.
 * @param string $encoding
 *   Defaults to UTF-8.
 *
 * @return array
 *   Array of ordinal values representing the input string.
 */
function utfstring_to_ords($input, $encoding = 'UTF-8'){
  // Turn a string of unicode characters into UCS-4BE, which is a Unicode
  // encoding that stores each character as a 4 byte integer. This accounts for
  // the "UCS-4"; the "BE" prefix indicates that the integers are stored in
  // big-endian order. The reason for this encoding is that each character is a
  // fixed size, making iterating over the string simpler.
  $input = mb_convert_encoding($input, "UCS-4BE", $encoding);

  // Visit each unicode character.
  $ords = array();
  for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) {
    // Now we have 4 bytes. Find their total numeric value.
    $s2 = mb_substr($input, $i, 1, "UCS-4BE");
    $val = unpack("N", $s2);
    $ords[] = $val[1];
  }
  return $ords;
}

/**
 * Given an array of ints representing Unicode chars, outputs a UTF-8 string.
 *
 * @param array $ords
 *   Array of integers representing Unicode characters.
 * @param bool $scrub_XML
 *   Set to TRUE to remove non valid XML characters.
 *
 * @return string
 *   UTF-8 String.
 */
function ords_to_utfstring($ords, $scrub_XML = FALSE) {
  $output = '';
  foreach ($ords as $ord) {
    // 0: Negative numbers.
    // 55296 - 57343: Surrogate Range.
    // 65279: BOM (byte order mark).
    // 1114111: Out of range.
    if (   $ord < 0
        || ($ord >= 0xD800 && $ord <= 0xDFFF)
        || $ord == 0xFEFF
        || $ord > 0x10ffff) {
      // Skip non valid UTF-8 values.
      continue;
    }
    // 9: Anything Below 9.
    // 11: Vertical Tab.
    // 12: Form Feed.
    // 14-31: Unprintable control codes.
    // 65534, 65535: Unicode noncharacters.
    elseif ($scrub_XML && (
               $ord < 0x9
            || $ord == 0xB
            || $ord == 0xC
            || ($ord > 0xD && $ord < 0x20)
            || $ord == 0xFFFE
            || $ord == 0xFFFF
            )) {
      // Skip non valid XML values.
      continue;
    }
    // 127: 1 Byte char.
    elseif ( $ord <= 0x007f) {
      $output .= chr($ord);
      continue;
    }
    // 2047: 2 Byte char.
    elseif ($ord <= 0x07ff) {
      $output .= chr(0xc0 | ($ord >> 6));
      $output .= chr(0x80 | ($ord & 0x003f));
      continue;
    }
    // 65535: 3 Byte char.
    elseif ($ord <= 0xffff) {
      $output .= chr(0xe0 | ($ord >> 12));
      $output .= chr(0x80 | (($ord >> 6) & 0x003f));
      $output .= chr(0x80 | ($ord & 0x003f));
      continue;
    }
    // 1114111: 4 Byte char.
    elseif ($ord <= 0x10ffff) {
      $output .= chr(0xf0 | ($ord >> 18));
      $output .= chr(0x80 | (($ord >> 12) & 0x3f));
      $output .= chr(0x80 | (($ord >> 6) & 0x3f));
      $output .= chr(0x80 | ($ord & 0x3f));
      continue;
    }
  }
  return $output;
}

And to do this on a simple object or array

并在一个简单的对象或数组上执行此操作

// Recursive sanitize_for_xml.
function recursive_sanitize_for_xml(&$input){
  if (is_null($input) || is_bool($input) || is_numeric($input)) {
    return;
  }
  if (!is_array($input) && !is_object($input)) {
    $input = sanitize_for_xml($input);
  }
  else {
    foreach ($input as &$value) {
      recursive_sanitize_for_xml($value);
    }
  }
}

Answer 3

回答by Dominic Rodger

If you have control over the data, ensure that it is encoded correctly (i.e. is in the encoding that you promised in the xml tag, e.g. if you have:

如果您可以控制数据，请确保其编码正确（即采用您在 xml 标签中承诺的编码，例如，如果您有：

<?xml version="1.0" encoding="UTF-8"?>

then you'll need to ensure your data is in UTF-8.

那么你需要确保你的数据是 UTF-8。

If you don't have control over the data, yell at those who do.

如果您无法控制数据，请对有控制权的人大喊大叫。

You can use a tool like xmllintto check which part(s) of the data are not valid.

您可以使用xmllint 之类的工具来检查数据的哪些部分无效。

Answer 4

回答by Martin

My problem was "&"character (HEX 0x24), i changed to:

我的问题是“&”字符（HEX 0x24），我改为：

function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||

            (($current >= 0x28) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            $ret .= chr($current);
        }
        else
        {
            $ret .= " ";
        }
    }
    return $ret;
}

Answer 5

回答by bcosca

Make sure your XML source is valid. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

确保您的 XML 源有效。请参阅http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

Answer 6

回答by Mike Venzke

For a non-destructive method of loading this type of input into a SimpleXMLElement, see my answer on How to handle invalid unicode with simplexml

有关将此类输入加载到 SimpleXMLElement 的非破坏性方法，请参阅我关于如何使用 simplexml 处理无效的 unicode 的回答

Answer 7

回答by user1593640

Not a php solution but, it works:

不是 php 解决方案，但它有效：

Download Notepad++ https://notepad-plus-plus.org/

下载 Notepad++ https://notepad-plus-plus.org/

Open your .xml file in Notepad++

在 Notepad++ 中打开您的 .xml 文件

From Main Menu: Search -> Search Modeset this to: Extended

从主菜单：搜索 ->搜索模式将其设置为：扩展

Then,

然后，

Replace -> Find what \x00; Replace with {leave empty}

替换 -> 查找 \x00; 替换为{留空}

Then, Replace_All

然后，Replace_All

Rob

抢

如何使用 PHP 跳过 XML 文件中的无效字符

提问by user315396

Update:

更新：

回答by Jhong

回答by mikeytown2

回答by Dominic Rodger

回答by Martin

回答by bcosca

回答by Mike Venzke

回答by user1593640

相关推荐

最近更新

标签

如何使用 PHP 跳过 XML 文件中的无效字符

提问by user315396

Update:

更新：

回答by Jhong

回答by mikeytown2

回答by Dominic Rodger

回答by Martin

回答by bcosca

回答by Mike Venzke

回答by user1593640

相关推荐

PHP mail() - 如何在电子邮件中放置 html 链接？

php 检查并返回重复数组php

PHP - MySQL 选择，从，哪里

PHP - “Segmentation fault (core dumped)”-Error 是什么意思？

相关推荐

最近更新

标签