java 从网页程序中获取所有图像 | 爪哇

Question

提问by Phil

Currently I need a program that given a URL, returns a list of all the images on the webpage.

目前我需要一个程序，给出一个 URL，返回网页上所有图像的列表。

ie:

IE：

logo.png gallery1.jpg test.gif

Is there any open source software available before I try and code something?

在我尝试编写代码之前，是否有可用的开源软件？

Language should be java. Thanks Philip

语言应该是java。谢谢菲利普

Answer 1

回答by BalusC

Just use a simple HTML parser, like jTidy, and then get all elements by tag nameimgand then collect the srcattribute of each in a List<String>or maybe List<URI>.

只需使用一个简单的HTML解析器，像jTidy，然后获得通过标签名的所有元素img，然后收集src各属性的List<String>或可能List<URI>。

You can obtain an InputStreamof an URLusing URL#openStream()and then feed it to any HTML parser you like to use. Here's a kickoff example:

您可以获取InputStream的URL使用URL#openStream()，然后将其提供给任何HTML解析器你喜欢用。这是一个启动示例：

InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();

for (int i = 0; i < imgs.getLength(); i++) {
    srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}

for (String src: srcs) {
    System.out.println(src);
}

I must however admit that HtmlUnit as suggested by Bozho indeed looks better.

但是我必须承认，Bozho 建议的 HtmlUnit 确实看起来更好。

Answer 2

回答by Bozho

HtmlUnithas HtmlPage.getElementsByTagName("img"), which will probably suit you.

HtmlUnit有HtmlPage.getElementsByTagName("img")，它可能适合你。

(read the short Get startedguide to see how to obtain the correct HtmlPageobject)

（阅读简短的入门指南以了解如何获取正确的HtmlPage对象）

Answer 3

回答by Pascal Thivent

This is dead simple with HTML Parser(and any other decent HTML parser):

这对于HTML 解析器（以及任何其他像样的 HTML 解析器）来说非常简单：

Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));

for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
    Tag tag = (Tag) iterator.nextNode();
    System.out.println(tag.getAttribute("src"));
}

Answer 4

回答by Bruno Carrier

With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results):

使用 Open Graph 标签和 HTML 单元，您可以非常轻松地提取数据（PageMeta 是一个简单的 POJO 保存结果）：

    Parser parser = new Parser(url);

    PageMeta pageMeta = new PageMeta();
    pageMeta.setUrl(url);

    NodeList meta = parser.parse(new TagNameFilter("meta"));
    for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
        Tag tag = (Tag) iterator.nextNode();

        if ("og:image".equals(tag.getAttribute("property"))) {
            pageMeta.setImageUrl(tag.getAttribute("content"));
        }

        if ("og:title".equals(tag.getAttribute("property"))) {
            pageMeta.setTitle(tag.getAttribute("content"));
        }

        if ("og:description".equals(tag.getAttribute("property"))) {
            pageMeta.setDescription(tag.getAttribute("content"));
        }
    }

Answer 5

回答by pravenndra thakur

You can simply use regular expression in Java

您可以简单地在 Java 中使用正则表达式

<html>
<body>
<p>
<img src="38220.png" alt="test" title="test" /> 
<img src="32222.png" alt="test" title="test" />
</p>
</body>
</html>

    String s ="html";  //above html content
    Pattern p = Pattern.compile("<img [^>]*src=[\\"']([^\\"^']*)");
    Matcher  m = p.matcher (s);
    while (m.find()) {
        String src = m.group();
        int startIndex = src.indexOf("src=") + 5;
        String srcTag = src.substring(startIndex, src.length());
        System.out.println( srcTag );
    }

Answer 6

回答by PeterMmm

You can use wgetthat has a lot of options available.

您可以使用具有很多可用选项的wget。

Or google for java wget...

或谷歌java wget...

Answer 7

回答by craftsman

You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. Then download each resource from each url and write it to a file. For parsing there are several HTML parsers available, Cobrais one of them.

您可以解析HTML，将IMG 元素的所有SRC 属性收集到一个Collection 中。然后从每个 url 下载每个资源并将其写入文件。对于解析，有几种可用的 HTML 解析器，Cobra就是其中之一。

java 从网页程序中获取所有图像 | 爪哇

提问by Phil

回答by BalusC

回答by Bozho

回答by Pascal Thivent

回答by Bruno Carrier

回答by pravenndra thakur

回答by PeterMmm

回答by craftsman

相关推荐

最近更新

标签

java 从网页程序中获取所有图像 | 爪哇

提问by Phil

回答by BalusC

回答by Bozho

回答by Pascal Thivent

回答by Bruno Carrier

回答by pravenndra thakur

回答by PeterMmm

回答by craftsman

相关推荐

什么是 C++ 模板的 Java 等价物？

java 公共 https 网站上的 jsse 握手失败

是否有用于在 Java 中设置默认日志级别的命令行选项

java 如何从用户那里获取保存文件的路径？

相关推荐

最近更新

标签