Java 如何在序列化之前从 DOM 中去除纯空白文本节点?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/978810/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to strip whitespace-only text nodes from a DOM before serialization?
提问by Marc Novakowski
I have some Java (5.0) code that constructs a DOM from various (cached) data sources, then removes certain element nodes that are not required, then serializes the result into an XML string using:
我有一些 Java (5.0) 代码从各种(缓存的)数据源构造 DOM,然后删除某些不需要的元素节点,然后使用以下方法将结果序列化为 XML 字符串:
// Serialize DOM back into a string
Writer out = new StringWriter();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
tf.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tf.setOutputProperty(OutputKeys.INDENT, "no");
tf.transform(new DOMSource(doc), new StreamResult(out));
return out.toString();
However, since I'm removing several element nodes, I end up with a lot of extra whitespace in the final serialized document.
但是,由于我要删除多个元素节点,因此最终序列化文档中会出现很多额外的空格。
Is there a simple way to remove/collapse the extraneous whitespace from the DOM before (or while) it's serialized into a String?
是否有一种简单的方法可以在将 DOM 序列化为字符串之前(或同时)从 DOM 中删除/折叠无关的空格?
采纳答案by James Murty
You can find empty text nodes using XPath, then remove them programmatically like so:
您可以使用 XPath 找到空文本节点,然后像这样以编程方式删除它们:
XPathFactory xpathFactory = XPathFactory.newInstance();
// XPath to find empty text nodes.
XPathExpression xpathExp = xpathFactory.newXPath().compile(
"//text()[normalize-space(.) = '']");
NodeList emptyTextNodes = (NodeList)
xpathExp.evaluate(doc, XPathConstants.NODESET);
// Remove each empty text node from document.
for (int i = 0; i < emptyTextNodes.getLength(); i++) {
Node emptyTextNode = emptyTextNodes.item(i);
emptyTextNode.getParentNode().removeChild(emptyTextNode);
}
This approach might be useful if you want more control over node removal than is easily achieved with an XSL template.
如果您想要比 XSL 模板更容易实现的节点删除控制,则此方法可能很有用。
回答by objects
Try using the following XSL and the strip-space
element to serialize your DOM:
尝试使用以下 XSL 和strip-space
元素来序列化您的 DOM:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
http://helpdesk.objects.com.au/java/how-do-i-remove-whitespace-from-an-xml-document
http://helpdesk.objects.com.au/java/how-do-i-remove-whitespace-from-an-xml-document
回答by Swapna Kasula
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
This will retain xml indentation.
这将保留 xml 缩进。
回答by Venkata Raju
Below code deletes the comment nodes and text nodes with all empty spaces. If the text node has some value, value will be trimmed
下面的代码删除所有空格的注释节点和文本节点。如果文本节点有一些值,值将被修剪
public static void clean(Node node)
{
NodeList childNodes = node.getChildNodes();
for (int n = childNodes.getLength() - 1; n >= 0; n--)
{
Node child = childNodes.item(n);
short nodeType = child.getNodeType();
if (nodeType == Node.ELEMENT_NODE)
clean(child);
else if (nodeType == Node.TEXT_NODE)
{
String trimmedNodeVal = child.getNodeValue().trim();
if (trimmedNodeVal.length() == 0)
node.removeChild(child);
else
child.setNodeValue(trimmedNodeVal);
}
else if (nodeType == Node.COMMENT_NODE)
node.removeChild(child);
}
}
Ref: http://www.sitepoint.com/removing-useless-nodes-from-the-dom/
参考:http: //www.sitepoint.com/removing-useless-nodes-from-the-dom/
回答by pimlottc
Another possible approach is to remove neighboring whitespace at the same time as you're removing the target nodes:
另一种可能的方法是在删除目标节点的同时删除相邻的空格:
private void removeNodeAndTrailingWhitespace(Node node) {
List<Node> exiles = new ArrayList<Node>();
exiles.add(node);
for (Node whitespace = node.getNextSibling();
whitespace != null && whitespace.getNodeType() == Node.TEXT_NODE && whitespace.getTextContent().matches("\s*");
whitespace = whitespace.getNextSibling()) {
exiles.add(whitespace);
}
for (Node exile: exiles) {
exile.getParentNode().removeChild(exile);
}
}
This has the benefit of keeping the rest of the existing formatting intact.
这有利于保持现有格式的其余部分完好无损。
回答by user6615071
The following code works:
以下代码有效:
public String getSoapXmlFormatted(String pXml) {
try {
if (pXml != null) {
DocumentBuilderFactory tDbFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder tDBuilder;
tDBuilder = tDbFactory.newDocumentBuilder();
Document tDoc = tDBuilder.parse(new InputSource(
new StringReader(pXml)));
removeWhitespaces(tDoc);
final DOMImplementationRegistry tRegistry = DOMImplementationRegistry
.newInstance();
final DOMImplementationLS tImpl = (DOMImplementationLS) tRegistry
.getDOMImplementation("LS");
final LSSerializer tWriter = tImpl.createLSSerializer();
tWriter.getDomConfig().setParameter("format-pretty-print",
Boolean.FALSE);
tWriter.getDomConfig().setParameter(
"element-content-whitespace", Boolean.TRUE);
pXml = tWriter.writeToString(tDoc);
}
} catch (RuntimeException | ParserConfigurationException | SAXException
| IOException | ClassNotFoundException | InstantiationException
| IllegalAccessException tE) {
tE.printStackTrace();
}
return pXml;
}
public void removeWhitespaces(Node pRootNode) {
if (pRootNode != null) {
NodeList tList = pRootNode.getChildNodes();
if (tList != null && tList.getLength() > 0) {
ArrayList<Node> tRemoveNodeList = new ArrayList<Node>();
for (int i = 0; i < tList.getLength(); i++) {
Node tChildNode = tList.item(i);
if (tChildNode.getNodeType() == Node.TEXT_NODE) {
if (tChildNode.getTextContent() == null
|| "".equals(tChildNode.getTextContent().trim()))
tRemoveNodeList.add(tChildNode);
} else
removeWhitespaces(tChildNode);
}
for (Node tRemoveNode : tRemoveNodeList) {
pRootNode.removeChild(tRemoveNode);
}
}
}
}