java 如何从 servlet 获取给定 URL 的源?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7138296/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 18:47:41  来源:igfitidea点击:

How do I get the source of a given URL from a servlet?

javahtmljspservletsweb-scraping

提问by Débora

I want to read a source code (HTML tags) of a given URL from my servlet.

我想从我的 servlet 中读取给定 URL 的源代码(HTML 标记)。

For example, URL is http://www.google.comand my servlet needs to read the HTML source code. Why I need this is, my web application is going to read other web pages and get useful content and do something with it.

例如,URL 是http://www.google.com并且我的 servlet 需要读取 HTML 源代码。我需要这个的原因是,我的 Web 应用程序将读取其他网页并获取有用的内容并对其进行处理。

Lets say, my application shows a shop list of one category in a city. How that list is generated is, my web application (servlet) goes through a given web page which is displaying various shops and read content. With the source code my servlet filters that source and get useful details. Finally creates the list (because my servlet has no access to the given URL's web applications database).

比方说,我的应用程序显示了一个城市中一个类别的商店列表。该列表是如何生成的,我的 Web 应用程序 (servlet) 通过一个给定的网页,该网页显示各种商店并阅读内容。使用源代码,我的 servlet 过滤源代码并获取有用的详细信息。最后创建列表(因为我的 servlet 无法访问给定 URL 的 Web 应用程序数据库)。

Any know any solution? (specially I need this to do in servlets) If do you think that there is another best way to get details from another site, please let me know.

任何知道任何解决方案?(特别是我需要在 servlet 中这样做)如果您认为还有另一种从其他站点获取详细信息的最佳方式,请告诉我。

Thank you

谢谢

采纳答案by Srinivas

What you are trying to do is called web scraping. Kayak and similar websites do it. Do search for it on web ;) Well in java you can do this.

您正在尝试做的称为网络抓取。Kayak 和类似的网站就是这样做的。一定要在网上搜索它 ;) 在 Java 中你可以做到这一点。

URL url = new URL(<your URL>);

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
StringBuffer response = new StringBuffer();

while ((inputLine = in.readLine()) != null) {
  response.append(inputLine + "\n");
}

in.close();

response will give you complete HTML content returned by that URL.

响应将为您提供该 URL 返回的完整 HTML 内容。

回答by Andrey Adamovich

You don't need servlet to read data from a remote server. You can just use java.net.URLor java.net.URLConnectionclass to read remote content from HTTP server. For example,

您不需要 servlet 从远程服务器读取数据。您可以只使用java.net.URLjava.net.URLConnection类从 HTTP 服务器读取远程内容。例如,

InputStream input = (InputStream) new URL("http://www.google.com").getContent();

回答by Jeremy

Take a look at jsoupfor fetching and parsing the HTML.

查看用于获取和解析 HTML 的jsoup

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

回答by hsestupin

As writing above, you don't need servlet for this purpose. Servlet API is used for responsing to requests, servlet container is running on the server side. If i understand you right, you don't need any server for this purpose. You need just simple http client emulator. I hope that following example will help you:

如上所述,您不需要为此目的使用 servlet。Servlet API 用于响应请求,servlet 容器运行在服务器端。如果我理解你的正确,你不需要任何服务器用于此目的。您只需要一个简单的 http 客户端模拟器。我希望下面的例子能帮助你:

import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;

import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;

public class SimpleHttpClient {

public String execute() {

        HttpClient httpClient = new DefaultHttpClient();
        HttpGet httpGet = new HttpGet("google.com");
        StringBuilder content = new StringBuilder();

        try {
            HttpResponse response = httpClient.execute(httpGet);

            int bufferLength = 1024;
            byte[] buffer = new byte[bufferLength];
            InputStream is = response.getEntity().getContent();

            while (is.read(buffer) != -1) {
                content.append(new String(buffer, "UTF-8"));
            }
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } 
        return content.toString();
    }
}

回答by AlexR

There are several solutions.

有几种解决方案。

The simplest one is using regular expressions. If you just want to extract links from tags like <a href="THE URL">user regular expression like <a\s+href\s*=\s*["']?(.*?)["']\s*/>. The group(1) contains URL. Now just create Matcher and iterate over your document while matcher.find()is true.

最简单的一种是使用正则表达式。如果您只想从诸如<a href="THE URL">用户正则表达式之类的标签中提取链接<a\s+href\s*=\s*["']?(.*?)["']\s*/>。group(1) 包含 URL。现在只需创建 Matcher 并在matcher.find()为 true 时迭代您的文档。

The next solution is using XML parser to parse HTML. This will work fine if you sites are written using well formatted HTML (XHTML). Since it is not always true this solution is applicable for selected sites only.

下一个解决方案是使用 XML 解析器来解析 HTML。如果您的网站是使用格式良好的 HTML (XHTML) 编写的,这将正常工作。由于并非总是如此,此解决方案仅适用于选定的站点。

The next solution is using the java built-in HTML parser: http://java.sun.com/products/jfc/tsc/articles/bookmarks/

下一个解决方案是使用 java 内置的 HTML 解析器:http: //java.sun.com/products/jfc/tsc/articles/bookmarks/

The next, most flexible is way is using the "real" html parser and even better java based HTML browser: Java HTML Parsing

下一个最灵活的方法是使用“真正的”html 解析器和更好的基于 Java 的 HTML 浏览器:Java HTML Parsing

Now it depends on details of your task. If parsing of static anchor tags is enough, user regular expressions. If not choose one of the next ways.

现在这取决于您的任务的详细信息。如果解析静态锚标记就足够了,则使用正则表达式。如果没有,请选择以下方式之一。

回答by umbr

As people said, you may use core classes java.net.URL and java.net.URLConnection for fetch webpages. But more useful for that purpose is Apache HttpClient. Look for docs & examples here: http://hc.apache.org/

正如人们所说,您可以使用核心类 java.net.URL 和 java.net.URLConnection 来获取网页。但更有用的是 Apache HttpClient。在此处查找文档和示例:http: //hc.apache.org/