Java org.xml.sax.SAXParseException：对实体“T”的引用必须以“;”结尾分隔符

Question

提问by vasumathi

I am trying to parse an XML file whcih contains some special characters like "&" using DOM parser. I am getting the saxparse exception "the reference to entity must end with a a delimiter". Is there any way to overcome this exception, since I can not modify the XML file to remove the special characters, since it is coming from different application. Please suggest a way to parse this XML file to get the root element?

我正在尝试使用 DOM 解析器解析包含一些特殊字符（如“&”）的 XML 文件。我收到 saxparse 异常“对实体的引用必须以定界符结尾”。有什么办法可以克服这个异常，因为我不能修改 XML 文件来删除特殊字符，因为它来自不同的应用程序。请提出一种解析此 XML 文件以获取根元素的方法？

Thanks in advance

提前致谢

This the part of the XML which I am parsing

这是我正在解析的 XML 的一部分

<P>EDTA/THAM WASH 
</P>

<P>jhc ^ 72. METER SOLVENT: Meter 21 LITERS of R. O. WATER through the add line into 
FT-250. Start agitator. 
</P>

<P>R. O. WATER &lt;ZLl LITERS </P>

<P>?     NOTE: The following is a tool control operation. The area within 10 feet of any open vessel or container is under tool control. </P>

<P>-af . 73. CHARGE SOLIDS: Remove any unnecessary items from the tool controlled area. Indicate the numbers of each item that will remain in the tool controlled area during the operation in the IN box of the Tool Control Log. </P>

<P>^___y_ a. To minimize the potential for cross contamination, confirm that no other solids are being charged or packaged in adjacent equipment. </P>

<P>kk k WARNING: Wear protective gloves, air Hymanet and use local exhaust when handling TROMETHAMINE USP (189400) (THAM) (K-l--Irritant!). The THAM may be dusty. </P>

<P>-&lt;&amp;^b .   Charge 2.1 KG of TROMETHAMINE USP (189400) (THAM) into FT-250 through the top. </P>

<P>TROMETHAMINE USP (189400) (THAM) </P>

<P>Scale ID:     / / 7S </P>

<P>LotNo.:   qy/o^yo^ </P>

<P>Gross:    ^ . S </P>

<P>Tare: 10 ,1 </P>

<P>Net:     J^l </P>

<P>Total:   JL'J </P>

<P><Figure ActualText="&T ">

<ImageData src="images/17PT 07009K_img_1.jpg"/>
&amp;T </Figure>
Checked by </P>

Answer 1

回答by paxdiablo

I'm not sure I understand the question. As far as I'm aware, unless you're inside a CDATA, naked &characters without a closing ;are invalid.

我不确定我是否理解这个问题。据我所知，除非你在 a 内，否则没有结束的CDATA裸&字符;是无效的。

If that's not the case for your XML file, then it's invalid, and you'll need to find another way of parsing it, or fixing it before SAX gets a hold of it.

如果您的 XML 文件不是这种情况，那么它就是无效的，您需要找到另一种解析它的方法，或者在 SAX 获得它之前修复它。

If I'm misunderstanding something here, you should probably post a sample of the actual XML so we can hep further.

如果我在这里误解了某些内容，您可能应该发布实际 XML 的示例，以便我们可以进一步了解。

Update:

更新：

It looks like:

看起来像：

Figure ActualText="&T "

is the offending line. Is this section within a CDATAor not? If not, this is notvalid XML and you should not expect SAX to be able to handle it.

是违规行。此部分是否在 a 内CDATA？如果不是，则这不是有效的 XML，您不应期望 SAX 能够处理它。

You'll need to either:

您需要：

change the application that created it; or
fix it before it's loaded by SAX (if you can't change that application) to something like "Figure ActualText="&T ""; or
find a non-SAX method for parsing.

更改创建它的应用程序；或者
在 SAX 加载它之前将其修复（如果您无法更改该应用程序）为类似“ Figure ActualText="&T "”的内容；或者
找到一个非 SAX 的解析方法。

Answer 2

回答by Eli Acherkan

As a workaround, you can:

作为解决方法，您可以：

Replace all the occurrences of &with &in the original input;
Parse it;
In your code that handles the result, handle the case where you now get escaped characters (e.g. <instead of <).

替换原始输入中所有出现的&with &；
解析它；
在处理结果的代码中，处理现在获得转义字符的情况（例如，<代替<）。

Depending on the parser you're using, you can also try to find the class responsible for parsing and unescaping &-strings, and see if you can extend it/supply your own resolver. (What I'm saying is very vague, but the specifics depend on the tools you're using.)

根据您使用的解析器，您还可以尝试查找负责解析和转义&-strings 的类，并查看是否可以扩展它/提供您自己的解析器。（我说的很模糊，但具体取决于您使用的工具。）

Answer 3

回答by Stephen C

Your input is invalid XML. Specifically, you cannot have an '&' character in an attribute value unless it is part of a well-formed character entity reference.

您输入的 XML 无效。具体来说，除非它是格式良好的字符实体引用的一部分，否则属性值中不能有“&”字符。

AFAIK, you have two choices:

AFAIK，你有两个选择：

Write a "not exactly XML" parser yourself. I seriously doubt that you will find an existing one. Any self-respecting XML parser will reject invalid input.
Fix whatever is creating this (so-called) XML so that it doesn't put random '&' characters in places where they are not allowed. It's quite simple really. As you are building the XML, replace the '&' character that is not already part of a character reference with '&'

自己编写一个“不完全是 XML”的解析器。我严重怀疑你会找到一个现有的。任何自尊的 XML 解析器都会拒绝无效输入。
修复创建此（所谓的）XML 的任何内容，以便它不会在不允许的地方放置随机的“&”字符。这真的很简单。在构建 XML 时，将不属于字符引用的“&”字符替换为“&”

Answer 4

回答by PSpeed

As others have stated, your XML is definitely invalid. However, if you can't change the generating application and can add a cleaning step then the following should clean up the XML:

正如其他人所说，您的 XML 绝对无效。但是，如果您无法更改生成的应用程序并且可以添加清理步骤，那么以下内容应该清理 XML：

String clean = xml.replaceAll( "&([^;]+(?!(?:\w|;)))", "&amp;" );

What that regex is doing is looking for any badly formed entity references and escaping the ampersand.

该正则表达式正在做的是寻找任何格式错误的实体引用并转义＆符号。

Specifically, (?!(?:\\w|;))is a negative look-ahead that makes that match stop at anything that is not a word character (a-z,0-9) and not a semi-colon. So the whole regex grabs everything from the & that is not a ; up until the first non-word, non-semi-colon character.

具体来说，(?!(?:\\w|;))是一个否定前瞻，使匹配停止在任何不是单词字符 (az,0-9) 和分号的地方。所以整个正则表达式从 & 中获取所有不是 ; 直到第一个非单词、非分号字符。

It puts everything except the ampersand in the first capture group so that it can be referred to in the replace string. That's the $1.

它将除与号之外的所有内容都放在第一个捕获组中，以便可以在替换字符串中引用它。那是1美元。

Note that this won't fix references that look like they are valid but aren't. For example, if you had &T; that would throw a different kind of error altogether unless the XML actually defines the entity.

请注意，这不会修复看起来有效但实际上无效的引用。例如，如果你有 &T; 除非 XML 实际定义了实体，否则这将完全引发不同类型的错误。

Answer 5

回答by Ranadheer Reddy

Some of you might be familiar with the ERROR “The reference to entity XX must end with the ‘;' delimiter” while adding or altering any piece of code to your XML Templates. Even I get that ERROR sometimes when I try to alter or add some codes to my blogger blog's templates(XML).

你们中的一些人可能熟悉错误“对实体 XX 的引用必须以 ';' 结尾分隔符”，同时向您的 XML 模板添加或更改任何代码段。当我尝试更改或添加一些代码到我的博主博客模板 (XML) 时，有时我也会收到该错误。

Mostly these kind of ERRORS occur while we add any third-party banner or widgets to our XML Templates. We can easily rectify that ERROR by making a slight alteration in the piece of code we add!

大多数情况下，当我们向 XML 模板添加任何第三方横幅或小部件时会发生此类错误。我们可以通过对我们添加的代码段稍加改动来轻松纠正该错误！

Just replace “&” with “&amp;” in your HTML/Javascript code!

EXAMPLE

例子

Original Code:
<!– Begin Code –>
<script src="http://XXXXXX.com/XXX.php?sid=XXX&br=XXX&dk=XXXXXXXXXXXX" type="text/javascript"/>
<!– End Code –>

Altered Code:

<!– Begin Code –>
<script src="http://XXXXXX.com/XXX.php?sid=XXX&amp;br=XXX&amp;dk=XXXXXXXXXXXX" type="text/javascript"/>
<!– End Code –>

Answer 6

回答by Joe Goodfellow

Building on an answer above from PSpeed the following replaceAll regex and replacement text will replace all unescaped ampersands with escaped ampersands.

基于以上来自 PSpeed 的答案，以下 replaceAll 正则表达式和替换文本将用转义的＆符号替换所有未转义的＆符号。

String clean = xml.replaceAll( ("(&(?!amp;))", "&amp;") );

The pattern is a negative lookahead to match on any ampersands that have not yet been escaped and the replacement string is simply an escaped ampersand. This can be optimized further for performance by using a statically compiled Pattern.

该模式是一个否定前瞻，以匹配任何尚未转义的＆符号，并且替换字符串只是一个转义的＆符号。这可以通过使用静态编译的模式进一步优化性能。

private final static Pattern unescapedAmpersands = Pattern.compile("(&(?!amp;))");

...

Matcher m = unescapedAmpersands.matcher(xml);
String xmlWithAmpersandsEscaped = m.replaceAll("&amp;");

Answer 7

回答by L01c

Simply replace your &with &and it will work.

只需替换你&的&，它就会起作用。

Answer 8

回答by Anurag Arya

It will work if you use below command before publishing.

如果您在发布前使用以下命令，它将起作用。

please put your xml file name in below command

请在下面的命令中输入您的 xml 文件名

sed -i "s/&/;/g" *.xml

Answer 9

回答by Emerica

In complement of @PSpeed's answer, here is a complete solution (SAX parser):

作为对@PSpeed 的回答的补充，这里有一个完整的解决方案（SAX 解析器）：

    try {

        InputStream xmlStreamToParse = blob.getBinaryStream();

        // Clean
        BufferedReader br = new BufferedReader(new InputStreamReader(xmlStreamToParse));

        StringBuilder sb = new StringBuilder();

        String line;
        while ((line = br.readLine()) != null) {
            sb.append(line.replaceAll("&([^;]+(?!(?:\w|;)))", "&amp;")); // or whatever you want to clean
        }

        InputStream stream = org.apache.commons.io.IOUtils.toInputStream(sb.toString(), "UTF-8");

        // Parsing
        SAXParserFactory saxFactory = SAXParserFactory.newInstance();
        saxFactory.setNamespaceAware(true);
        SAXParser theParser = saxFactory.newSAXParser();
        XMLReader xmlReader = theParser.getXMLReader();
        LicenceXMLHandler licence = new LicenceXMLHandler();
        xmlReader.setContentHandler(licence);
        xmlReader.parse(new InputSource(stream));

    } catch (SQLException | SAXException | IOException | ParserConfigurationException e) {
        log.error("Error: " + e);
    }

Explanations:

说明：

Transform the Blob into an InputStream
Clean the Blob
Parse the file (LicenceXMLHandler is the parser class)

将 Blob 转换为 InputStream
清洁 Blob
解析文件（LicenceXMLHandler 是解析器类）

Java org.xml.sax.SAXParseException：对实体“T”的引用必须以“;”结尾分隔符

提问by vasumathi

回答by paxdiablo

回答by Eli Acherkan

回答by Stephen C

回答by PSpeed

回答by Ranadheer Reddy

回答by Joe Goodfellow

回答by L01c

回答by Anurag Arya

回答by Emerica

相关推荐

最近更新

标签

Java org.xml.sax.SAXParseException：对实体“T”的引用必须以“;”结尾 分隔符

提问by vasumathi

回答by paxdiablo

回答by Eli Acherkan

回答by Stephen C

回答by PSpeed

回答by Ranadheer Reddy

回答by Joe Goodfellow

回答by L01c

回答by Anurag Arya

回答by Emerica

相关推荐

Java 在一个语句中用新值替换空字段、空字段和仅空白字段

Java Spring 3.0：无法为 XML 架构命名空间找到 Spring NamespaceHandler

如何使用相同的模型对象初始化 JavaFX 控制器？

如何在 Java 中初始化数组？

相关推荐

最近更新

标签

Java org.xml.sax.SAXParseException：对实体“T”的引用必须以“;”结尾分隔符