Java 如何使用 HTML Parser 获取 HTML 页面中所有标签的完整信息

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2287872/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 05:41:33  来源:igfitidea点击:

How to use HTML Parser to get complete information about all tags in the HTML page

javascreen-scraping

提问by

I am using HTML Parser to develop an application. The code below is not able to get the entire set of tags in the page. There are some tags which are missed out and the attributes and text body of them are also missed out. Please help me to explain why is this happening.....or suggest me other way....

我正在使用 HTML Parser 来开发应用程序。下面的代码无法获取页面中的整个标签集。有一些标签被遗漏了,它们的属性和文本正文也被遗漏了。请帮我解释为什么会发生这种情况.....或建议我其他方式....

 URL url = new URL("...");
 PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));

 URLConnection connection = url.openConnection();
 InputStream is = connection.getInputStream();
 InputStreamReader isr = new InputStreamReader(is);
 BufferedReader br = new BufferedReader(isr);

 HTMLEditorKit htmlKit = new HTMLEditorKit();
 HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
 HTMLEditorKit.Parser parser = new ParserDelegator();
 HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
 parser.parse(br, callback, true);

 ElementIterator iterator = new ElementIterator(htmlDoc);
 Element element;
   while ((element = iterator.next()) != null) 
   {
     AttributeSet attributes = element.getAttributes();
     Enumeration e=attributes.getAttributeNames();

     pw.println("Element Name :"+element.getName());
     while(e.hasMoreElements())
     {
      Object key=e.nextElement();
      Object val=attributes.getAttribute(key);
      int startOffset = element.getStartOffset();
   int endOffset = element.getEndOffset();
   int length = endOffset - startOffset;
   String text=htmlDoc.getText(startOffset, length);

      pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");

     }
   }

}

}

回答by Riduidel

You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever. I believe you would have better results using, as an example, NekoHtml.

您似乎使用了 Swing HtmlDocument。这可能不是有史以来最聪明的想法。我相信您会以NekoHtml为例获得更好的结果。

回答by gicappa

Or another simple library you can use is jtidy that can clean up your html before parsing it. Hope this helps.

或者您可以使用的另一个简单库是 jtidy,它可以在解析之前清理您的 html。希望这可以帮助。

http://sourceforge.net/projects/jtidy/

http://sourceforge.net/projects/jtidy/

Ciao!

再见!

回答by BalusC

As per the comments:

根据评论:

actually i want to extract information such as product name,price etc of all products listed in an online shopping site such as amazon.com How should i go about it???

实际上,我想提取在线购物网站(例如 amazon.com)中列出的所有产品的产品名称、价格等信息,我应该怎么做???

Step 1:read their robotsfile. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a Disallowon an User-Agentof *, then stophere. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.

第 1 步:阅读他们的机器人文件。它通常位于站点的根目录,例如http://amazon.com/robots.txt。如果你试图访问的URL请求被覆盖DisallowUser-Agent*,然后在这里。联系他们,向他们详细解释您要做什么,并询问他们可以为您提供所需信息的方法/替代方案/网络服务。否则,您违反了法律,您可能会被网站和/或您的 ISP 列入黑名单,或者更糟。如果没有,则继续执行步骤 2。

Step 2:check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.

第 2 步:检查有问题的站点是否还没有可用的公共网络服务,这比解析整个 HTML 页面要容易得多。使用网络服务,您将基于一组简单的参数以简洁的格式(JSON 或 XML)准确地获得您正在寻找的信息。环顾四周或联系他们以获取有关任何网络服务的详细信息。如果没有办法,请继续执行步骤 3。

Step 3:learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.

第 3 步:了解 HTML/CSS/JS 的工作原理,了解如何使用 Firebug 等网络开发工具,了解如何通过右键单击 >查看页面源代码来解释您看到的 HTML/CSS/JS 源代码。我敢打赌,该站点使用 JS/Ajax 来加载/填充您想要收集的信息。在这种情况下,您将需要使用一个能够解析和执行 JS 的 HTML 解析器(您正在使用的那个解析器不会这样做)。这不会是一项容易的工作,所以我不会详细解释它,直到完全清楚您要实现的目标以及是否允许以及是否有更易于使用的网络服务可用的。

回答by bakkal

I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.

我使用HTML Parser相当可靠地执行此操作(前提是 HTML 文档不更改其结构)。具有稳定 API 的 Web 服务要好得多,但有时我们只是没有。

General idea:

大概的概念:

You first have to know in what tags (div, meta, span, etc) the information you want are in, and know the attributes to identify those tags. Example :

您首先必须知道您想要的信息位于哪些标签(divmetaspan等)中,并知道识别这些标签的属性。例子 :

 <span class="price"> .95</span>

if you are looking for this "price", then you are interested in spantags with class"price".

如果您正在寻找这个“价格”,那么您span会对带有class“价格”的标签感兴趣。

HTML Parser has a filter-by-attribute functionality.

HTML 解析器具有按属性过滤的功能。

filter = new HasAttributeFilter("class", "price");

When you parse using a filter, you will get a list of Nodesthat you can do a instanceofoperation on them to determine if they are of the type you are interested in, for spanyou'd do something like

当您使用过滤器解析时,您将获得一个列表Nodes,您可以instanceof对它们进行操作以确定它们是否属于您感兴趣的类型,因为span您会执行类似的操作

if (node instanceof Span) // or any other supported element.

See list of supported tags here.

在此处查看支持的标签列表。

An example with HTML Parser to grab the meta tag that has description about a site:

使用 HTML 解析器获取具有站点描述的元标记的示例:

Tag Sample :

标签示例:

<meta name="description" content="Amazon.com: frankenstein: Books"/> 

Code:

代码:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}