Java 如何在读取文件以生成 XML DOM 时忽略空格

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/229310/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 11:46:02  来源:igfitidea点击:

How to ignore whitespace while reading a file to produce an XML DOM

javaxmlwhitespace

提问by Telcontar

I'm trying to read a file to produce a DOM Document, but the file has whitespace and newlines and I'm trying to ignore them, but I couldn't:

我正在尝试读取文件以生成 DOM 文档,但该文件有空格和换行符,我试图忽略它们,但我不能:

DocumentBuilderFactory docfactory=DocumentBuilderFactory.newInstance();
docfactory.setIgnoringElementContentWhitespace(true);

I see in Javadoc that setIgnoringElementContentWhitespace method operates only when the validating flag is enabled, but I haven't the DTD or XML Schema for the document.

我在 Javadoc 中看到 setIgnoringElementContentWhitespace 方法仅在启用验证标志时运行,但我没有文档的 DTD 或 XML 模式。

What can I do?

我能做什么?

Update

更新

I don't like the idea of introduce mySelf < !ELEMENT... declarations and i have tried the solution proposed in the forumpointed by Tomalak, but it doesn't work, i have used java 1.6 in an linux environment. I think if no more is proposed i will make a few methods to ignore whitespace text nodes

我不喜欢引入 mySelf < !ELEMENT... 声明的想法,我已经尝试了Tomalak 指出的论坛中提出的解决方案,但它不起作用,我在 linux 环境中使用了 java 1.6。我想如果没有更多的提议,我会做一些方法来忽略空白文本节点

采纳答案by bobince

‘IgnoringElementContentWhitespace' is not about removing allpure-whitespace text nodes, only whitespace nodes whose parents are described in the schema as having ELEMENT content?—?that is to say, they only contain other elements and never text.

'IgnoringElementContentWhitespace' 并不是要删除所有纯空白文本节点,而是要删除其父项在模式中被描述为具有 ELEMENT 内容的空白节点?-?也就是说,它们只包含其他元素而从不包含文本。

If you don't have a schema (DTD or XSD) in use, element content defaults to MIXED, so this parameter will never have any effect. (Unless the parser provides a non-standard DOM extension to treat all unknown elements as containing ELEMENT content, which as far as I know the ones available for Java do not.)

如果您没有使用架构(DTD 或 XSD),则元素内容默认为 MIXED,因此此参数永远不会产生任何影响。(除非解析器提供非标准的 DOM 扩展来将所有未知元素视为包含 ELEMENT 内容,据我所知,可用于 Java 的元素没有。)

You could hack the document on the way into the parser to include the schema information, for example by adding an internal subset to the < !DOCTYPE ... [...] > declaration containing < !ELEMENT ... > declarations, then use the IgnoringElementContentWhitespace parameter.

您可以在进入解析器的途中破解文档以包含模式信息,例如通过将内部子集添加到包含 < !ELEMENT ... > 声明的 < !DOCTYPE ... [...] > 声明,然后使用 IgnoringElementContentWhitespace 参数。

Or, possibly easier, you could just strip out the whitespace nodes, either in a post-process, or as they come in using an LSParserFilter.

或者,可能更简单,您可以在后期处理中或在使用 LSParserFilter 进入时去除空白节点。

回答by jjnguy

This is a (really) late answer, but here is how I solved it. I wrote my own implementation of a NodeListclass. It simply ignores text nodes that are empty. Code follows:

这是一个(真的)迟到的答案,但这是我解决它的方法。我自己编写了一个NodeList类的实现。它只是忽略空的文本节点。代码如下:

private static class NdLst implements NodeList, Iterable<Node> {

    private List<Node> nodes;

    public NdLst(NodeList list) {
        nodes = new ArrayList<Node>();
        for (int i = 0; i < list.getLength(); i++) {
            if (!isWhitespaceNode(list.item(i))) {
                nodes.add(list.item(i));
            }
        }
    }

    @Override
    public Node item(int index) {
        return nodes.get(index);
    }

    @Override
    public int getLength() {
        return nodes.size();
    }

    private static boolean isWhitespaceNode(Node n) {
        if (n.getNodeType() == Node.TEXT_NODE) {
            String val = n.getNodeValue();
            return val.trim().length() == 0;
        } else {
            return false;
        }
    }

    @Override
    public Iterator<Node> iterator() {
        return nodes.iterator();
    }
}

You then wrap all of your NodeLists in this class and it will effectively ignore all whitespace nodes. (Which I define as Text Nodes with 0-length trimmed text.)

然后,您将所有NodeLists包装在此类中,它将有效地忽略所有空白节点。(我将其定义为带有 0 长度修剪文本的文本节点。)

It also has the added benefit of being able to be used in a for-each loop.

它还具有能够在 for-each 循环中使用的额外好处。

回答by huppyuy

I made it works by doing this

我这样做了

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        dbFactory.setIgnoringElementContentWhitespace(true);
        dbFactory.setSchema(schema);
        dbFactory.setNamespaceAware(true);
NodeList nodeList = element.getElementsByTagNameNS("*", "associate");

回答by ImGroot

Try this:

尝试这个:

private static Document prepareXML(String param) throws ParserConfigurationException, SAXException, IOException {

        param = param.replaceAll(">\s+<", "><").trim();
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setIgnoringElementContentWhitespace(true);
        DocumentBuilder builder = factory.newDocumentBuilder();
        InputSource in = new InputSource(new StringReader(param));
        return builder.parse(in);

    }

回答by Tamias

I ended up following @bobince's idea of using an LSParserFilter. Yes, the interface is documented at https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/ls/LSParserFilter.htmlbut it's very hard to find good example/explanation material. After considerable searching I located DOM Level 3 Load and Save XML Reference Guide at http://www.informit.com/articles/article.aspx?p=31297&seqNum=29(Nicholas Chase, Mar 14, 2003). That helped me considerably. Here are portions of my code, which does an XML diff with org.custommonkey.xmlunit. (This is a tool written on my own time to help me with paid work, so I have left a lot of things, like better exception handling, for when things are slow.)

我最终遵循了@bobince 使用 LSParserFilter 的想法。是的,该接口记录在https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/ls/LSParserFilter.html但很难找到好的示例/解释材料。经过大量搜索,我在http://www.informit.com/articles/article.aspx?p=31297&seqNum=29(NicholasChase,2003 年 3 月 14 日)找到了 DOM Level 3 Load and Save XML Reference Guide 。这对我帮助很大。这是我的代码的一部分,它使用 org.custommonkey.xmlunit 进行 XML 差异。(这是我自己写的一个工具,用来帮助我完成有偿工作,所以我留下了很多东西,比如更好的异常处理,以备不时之需。)

I especially like the use of an LSParserFilter because, for my purpose, I will likely add an option in the future to ignore id attributes too, which should be an easy enhancement with this framework.

我特别喜欢 LSParserFilter 的使用,因为出于我的目的,我将来可能会添加一个选项来忽略 id 属性,这应该是这个框架的一个简单增强。

// A small portion of my main class.
// Other imports may be necessary...
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSParser;
import org.w3c.dom.ls.LSParserFilter;

Document controlDoc = null;
Document testDoc = null;
try {
    System.setProperty(DOMImplementationRegistry.PROPERTY, "org.apache.xerces.dom.DOMImplementationSourceImpl");
    DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
    DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
    LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
    LSParserFilter filter = new InputFilter();
    builder.setFilter(filter);
    controlDoc = builder.parseURI(files[0].getPath());
    testDoc = builder.parseURI(files[1].getPath());
} catch (Exception exc) {
    System.out.println(exc.getMessage());
}

//--------------------------------------

import org.w3c.dom.ls.LSParserFilter;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.traversal.NodeFilter;

public class InputFilter implements LSParserFilter {

    public short acceptNode(Node node) {
        if (Utils.isNewline(node)) {
            return NodeFilter.FILTER_REJECT;
        }
        return NodeFilter.FILTER_ACCEPT;
    }

    public int getWhatToShow() {
        return NodeFilter.SHOW_ALL;
    }

    public short startElement(Element elem) {
        return LSParserFilter.FILTER_ACCEPT;
    }

}

//-------------------------------------
// From my Utils.java:

    public static boolean isNewline(Node node) {
        return (node.getNodeType() == Node.TEXT_NODE) && node.getTextContent().equals("\n");
    }