java 从网页程序中获取所有图像 | 爪哇
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2172733/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get all Images from WebPage Program | Java
提问by Phil
Currently I need a program that given a URL, returns a list of all the images on the webpage.
目前我需要一个程序,给出一个 URL,返回网页上所有图像的列表。
ie:
IE:
logo.png gallery1.jpg test.gif
logo.png gallery1.jpg test.gif
Is there any open source software available before I try and code something?
在我尝试编写代码之前,是否有可用的开源软件?
Language should be java. Thanks Philip
语言应该是java。谢谢菲利普
回答by BalusC
Just use a simple HTML parser, like jTidy, and then get all elements by tag nameimgand then collect the srcattribute of each in a List<String>or maybe List<URI>.
只需使用一个简单的HTML解析器,像jTidy,然后获得通过标签名的所有元素img,然后收集src各属性的List<String>或可能List<URI>。
You can obtain an InputStreamof an URLusing URL#openStream()and then feed it to any HTML parser you like to use. Here's a kickoff example:
您可以获取InputStream的URL使用URL#openStream(),然后将其提供给任何HTML解析器你喜欢用。这是一个启动示例:
InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();
for (int i = 0; i < imgs.getLength(); i++) {
srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}
for (String src: srcs) {
System.out.println(src);
}
I must however admit that HtmlUnit as suggested by Bozho indeed looks better.
但是我必须承认,Bozho 建议的 HtmlUnit 确实看起来更好。
回答by Bozho
HtmlUnithas HtmlPage.getElementsByTagName("img"), which will probably suit you.
HtmlUnit有HtmlPage.getElementsByTagName("img"),它可能适合你。
(read the short Get startedguide to see how to obtain the correct HtmlPageobject)
(阅读简短的入门指南以了解如何获取正确的HtmlPage对象)
回答by Pascal Thivent
This is dead simple with HTML Parser(and any other decent HTML parser):
这对于HTML 解析器(以及任何其他像样的 HTML 解析器)来说非常简单:
Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));
for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
System.out.println(tag.getAttribute("src"));
}
回答by Bruno Carrier
With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results):
使用 Open Graph 标签和 HTML 单元,您可以非常轻松地提取数据(PageMeta 是一个简单的 POJO 保存结果):
Parser parser = new Parser(url);
PageMeta pageMeta = new PageMeta();
pageMeta.setUrl(url);
NodeList meta = parser.parse(new TagNameFilter("meta"));
for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
if ("og:image".equals(tag.getAttribute("property"))) {
pageMeta.setImageUrl(tag.getAttribute("content"));
}
if ("og:title".equals(tag.getAttribute("property"))) {
pageMeta.setTitle(tag.getAttribute("content"));
}
if ("og:description".equals(tag.getAttribute("property"))) {
pageMeta.setDescription(tag.getAttribute("content"));
}
}
回答by pravenndra thakur
You can simply use regular expression in Java
您可以简单地在 Java 中使用正则表达式
<html>
<body>
<p>
<img src="38220.png" alt="test" title="test" />
<img src="32222.png" alt="test" title="test" />
</p>
</body>
</html>
String s ="html"; //above html content
Pattern p = Pattern.compile("<img [^>]*src=[\\"']([^\\"^']*)");
Matcher m = p.matcher (s);
while (m.find()) {
String src = m.group();
int startIndex = src.indexOf("src=") + 5;
String srcTag = src.substring(startIndex, src.length());
System.out.println( srcTag );
}
回答by PeterMmm
回答by craftsman
You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. Then download each resource from each url and write it to a file. For parsing there are several HTML parsers available, Cobrais one of them.
您可以解析HTML,将IMG 元素的所有SRC 属性收集到一个Collection 中。然后从每个 url 下载每个资源并将其写入文件。对于解析,有几种可用的 HTML 解析器,Cobra就是其中之一。

