如何在 Java 中解析 HTML 字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1497946/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I parse a HTML string in Java?
提问by IttayD
采纳答案by IttayD
I found this somewhere (don't remember where):
我在某处找到了这个(不记得在哪里):
public static DocumentFragment parseXml(Document doc, String fragment)
{
// Wrap the fragment in an arbitrary element.
fragment = "<fragment>"+fragment+"</fragment>";
try
{
// Create a DOM builder and parse the fragment.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document d = factory.newDocumentBuilder().parse(
new InputSource(new StringReader(fragment)));
// Import the nodes of the new document into doc so that they
// will be compatible with doc.
Node node = doc.importNode(d.getDocumentElement(), true);
// Create the document fragment node to hold the new nodes.
DocumentFragment docfrag = doc.createDocumentFragment();
// Move the nodes into the fragment.
while (node.hasChildNodes())
{
docfrag.appendChild(node.removeChild(node.getFirstChild()));
}
// Return the fragment.
return docfrag;
}
catch (SAXException e)
{
// A parsing error occurred; the XML input is not valid.
}
catch (ParserConfigurationException e)
{
}
catch (IOException e)
{
}
return null;
}
回答by Andrew Hare
How do you make use of the HTML-processing capabilities that are built into Java? You may not know that Swing contains all the classes necessary to parse HTML. Jeff Heaton shows you how.
您如何利用内置于 Java 中的 HTML 处理功能?您可能不知道 Swing 包含解析 HTML 所需的所有类。杰夫·希顿(Jeff Heaton)向您展示了方法。
回答by nkr1pt
you could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge
您可以使用 HTML Parser,它是一个 Java 库,用于以线性或嵌套方式解析 HTML。它是一个开源工具,可以在 SourceForge 上找到
回答by non sequitor
I've used Jericho HTML Parserit's OSS, detects(forgives) badly formatted tags and is lightweight
我使用过Jericho HTML Parser它是 OSS,检测(原谅)格式错误的标签并且是轻量级的
回答by Bart Kiers
Here's a way:
这里有一个方法:
import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String [] args) throws Exception {
Reader reader = new StringReader("<table><tr><td>Hello</td><td>World!</td></tr></table>");
HTMLEditorKit.Parser parser = new ParserDelegator();
parser.parse(reader, new HTMLTableParser(), true);
reader.close();
}
}
class HTMLTableParser extends HTMLEditorKit.ParserCallback {
private boolean encounteredATableRow = false;
public void handleText(char[] data, int pos) {
if(encounteredATableRow) System.out.println(new String(data));
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if(t == HTML.Tag.TR) encounteredATableRow = true;
}
public void handleEndTag(HTML.Tag t, int pos) {
if(t == HTML.Tag.TR) encounteredATableRow = false;
}
}
回答by zygimantus
If you have a string which contains HTML you can use Jsouplibrary like this to get HTML elements:
如果你有一个包含 HTML 的字符串,你可以像这样使用Jsoup库来获取 HTML 元素:
String htmlTable= "<table><tr><td>Hello World!</td></tr></table>";
Document doc = Jsoup.parse(htmlTable);
// then use something like this to get your element:
Elements tds = doc.getElementsByTag("td");
// tds will contain this one element: <td>Hello World!</td>
Good luck!
祝你好运!