如何使用 Java 在 XML 中解析 CDATA
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25275248/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse within CDATA in XML using Java
提问by stitch70
Upon searching through existing CDATA discussions, none that I found were able to achieve what I'm attempting.
在搜索现有的 CDATA 讨论后,我发现没有一个能够实现我的尝试。
Is it possible to parse within CDATA where the tag is not unique?
是否可以在标签不唯一的 CDATA 中进行解析?
Below is the XML document where I'm attempting to retrieve each field within the CDATA block that has multiple fields of interest (i.e. Data Loaded, Quality, Status, Index) on line 5 below. Each field is marked with the "li" tag within the CDATA block (even though it's a character data space):
下面是 XML 文档,我试图在下面的第 5 行检索 CDATA 块中具有多个感兴趣字段(即数据加载、质量、状态、索引)的每个字段。每个字段都在 CDATA 块中用“li”标签标记(即使它是一个字符数据空间):
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Document>
<name>area Area Date: 2014-07-31</name>
<Placemark><name>P07L327</name><Point><coordinates>-96.26879,85.19125</coordinates></Point><description><![CDATA[<ol><li> Data Loaded: NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>]]></description><Style> id = "colorIcon"</Style></Placemark>
<coordinates>-96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,45.14698,0 </coordinates>
</Document>
</kml>
Currently output is like this:
目前输出是这样的:
Name: <ol><li> Data Loaded: NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>
From WITHIN the CDATA block, my intention is to output a new line for each field along with it's appropriate result.
从 CDATA 块的 WITHIN 中,我的目的是为每个字段输出一个新行及其适当的结果。
Below is the code that's written up until now that gives the current output listed above:
下面是迄今为止编写的代码,它给出了上面列出的当前输出:
package com.lucy.seo;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.CharacterData;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
import org.w3c.dom.CDATASection;
import org.w3c.dom.Comment;
import org.w3c.dom.Text;
import org.xml.sax.SAXException;
public class ReadXMLFile {
public static void main(String[] args ) throws Exception {
File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/Oracle_Java_Project/Test_Doc.xml");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("Placemark");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Element element = (Element) nList.item(temp);
NodeList name = element.getElementsByTagName("description");
Element line = (Element) name.item(0);
System.out.println("Name: " + getCharacterDataFromElement(line));
}
}
public static String getCharacterDataFromElement(Element f) {
NodeList list = f.getChildNodes();
String data;
for(int index = 0; index < list.getLength(); index++){
if(list.item(index) instanceof CharacterData){
CharacterData child = (CharacterData) list.item(index);
data = child.getData();
if(data != null && data.trim().length() > 0)
return child.getData();
}
}
return "";
}
}
Appreciate any help towards this! -- thanks!
感谢您对此的任何帮助! - 谢谢!
Sep 2, 2014 update
2014 年 9 月 2 日更新
Updated edit with final solution. Thank you to all here that posted solutions and helped. Solution was broken up into two pieces of code / files due to library conflicts:
使用最终解决方案更新编辑。感谢所有在这里发布解决方案并提供帮助的人。由于库冲突,解决方案被分成两段代码/文件:
//First file which is input to the second file followed afterwards
import java.io.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.CharacterData;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class ReadXMLFile {
public static void main(String[] args ) throws Exception {
PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html"));
System.setOut(out);
File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/raw_input.xml");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);
//optional, but recommended
//read this - http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
doc.getDocumentElement().normalize();
NodeList nList = doc.getElementsByTagName("Placemark");
//create a buffered reader that connects to the console, we use it so we can read lines
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
System.out.println("<html xlmns=http://www.w3.org/1999/xhtml>");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
Element eElement = (Element) nNode;
Element element = (Element) nList.item(temp);
NodeList name = element.getElementsByTagName("description");
Element line = (Element) name.item(0);
System.out.println("<bracket><li>Name: " + eElement.getElementsByTagName("name").item(0).getTextContent() + "</li>");
System.out.println("<description>Description: " + getCharacterDataFromElement(line) + "</description></bracket>");
}
System.out.println("</html>");
//read a line from the console
String lineFromInput = in.readLine();
//output to the file a line
out.println(lineFromInput);
out.close();
}
public static String getCharacterDataFromElement(Element f) {
NodeList list = f.getChildNodes();
String data;
for(int index = 0; index < list.getLength(); index++){
if(list.item(index) instanceof CharacterData){
CharacterData child = (CharacterData) list.item(index);
data = child.getData();
if(data != null && data.trim().length() > 0)
return child.getData();
}
}
return "";
}
}
//Second File
package ReadXMLFile_part2;
import java.io.*;
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.logging.Level;
import java.util.logging.Logger;
public class ReadXMLFile_part2 {
public static void main(String[] args) throws Exception {
PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/PA-PTH013_Output_Meters.xml"));
System.setOut(out);
System.out.println("*** JSOUP ***");
File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html");
Document doc = null;
try {
doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
} catch (IOException ex) {
Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
}
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
Elements brackets = doc.getElementsByTag("bracket");
for (Element bracket : brackets) {
Elements lis = bracket.select("li");
for (Element li : lis){
System.out.println(li.text());
}
break;
}
System.out.println();
//read a line from the console
String lineFromInput = in.readLine();
//output to the file a line
out.println(lineFromInput);
out.close();
}
}
回答by GPI
CDATA
is a marker to XML interpreting engines, that whatever they encounter in between the start and end, should be treated as "pure" (raw) character data.
CDATA
是 XML 解释引擎的标记,无论它们在开始和结束之间遇到什么,都应该被视为“纯”(原始)字符数据。
So, in a way, it's like an escape character for the parser (one that can encompass many characters).
因此,在某种程度上,它就像解析器的转义字符(可以包含多个字符)。
Therefore, you won't find a XML parser that will report whatever is inside a CDATA as XML because the norm says that it MUST report it as a character stream. (As a consequence : it MUST NOT interpret it as XML stream, which is actually good because nothing mandates the content to be XML indeed).
因此,您不会找到将 CDATA 中的任何内容报告为 XML 的 XML 解析器,因为规范说它必须将其报告为字符流。(因此:它不得将其解释为 XML 流,这实际上很好,因为没有任何内容确实要求内容为 XML)。
Anyway, your parser and your code is working as expected.
无论如何,您的解析器和您的代码按预期工作。
But if, as in your case, you happen to know that the content of a certain CDATA instance is indeed a valid XML instance, then you can open a new Parser for this precise content, and deal with it appropriatly.
但是,如果在您的情况下,您碰巧知道某个 CDATA 实例的内容确实是一个有效的 XML 实例,那么您可以为这个精确的内容打开一个新的解析器,并适当地处理它。
So you can get the output of your getCharacterDataFromElement(line)
call, feed it to your documentBuilder
, and use this new Document
instance to parse the content of your li
elements.
因此,您可以获取getCharacterDataFromElement(line)
调用的输出,将其提供给您的documentBuilder
,并使用这个新Document
实例来解析li
元素的内容。
回答by Michael Kay
Your question is something of a contradiction, since CDATA is an explicit instruction to the parser NOT to parse what it sees inside the CDATA. So the simplest way to get the content parsed is not to include the CDATA tags in the first place.
您的问题有些矛盾,因为 CDATA 是对解析器的显式指令,而不是解析它在 CDATA 中看到的内容。因此,解析内容的最简单方法是首先不包含 CDATA 标记。
However, having told the parser not to parse the CDATA content, what you can do is extract the content as text, and then submit the text to the parser as a second parse operation.
但是,告诉解析器不要解析 CDATA 内容,您可以做的是将内容提取为文本,然后将文本提交给解析器作为第二次解析操作。