获取 html 文件 Java

Question

提问by rpf

Duplicate:

复制：

How do you Programmatically Download a Webpage in Java?
How to fetch html in Java

如何以编程方式下载 Java 网页？
如何在 Java 中获取 html

I'm developping an application that consists on: the user inputs an URL of some website, and then the application have to analyze that URL.

我正在开发一个包含以下内容的应用程序：用户输入某个网站的 URL，然后应用程序必须分析该 URL。

How can I have access to the HTML file, using Java? Does I need to use HttpRequest? How does that works?

如何使用 Java 访问 HTML 文件？我需要使用 HttpRequest 吗？这是如何工作的？

Thks.

谢谢。

Answer 1

采纳答案by kgiannakakis

URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient

URLConnection 适用于简单的情况。当涉及重定向之类的事情时，最好使用 Apache 的HTTPClient

Answer 2

回答by willcodejavaforfood

You could just use a URLConnection. See this Java Tutorialfrom Sun

你可以只使用一个 URLConnection。请参阅Sun 的Java 教程

Answer 3

回答by Mark

You can use java.net.URL and then open an input stream to read the HTML from the server. See the example here.

您可以使用 java.net.URL 然后打开一个输入流从服务器读取 HTML。请参阅此处的示例。

Answer 4

回答by McDowell

This code downloads data from a URL, treating it as binary content:

此代码从 URL 下载数据，将其视为二进制内容：

public class Download {

  private static void download(URL input, File output)
      throws IOException {
    InputStream in = input.openStream();
    try {
      OutputStream out = new FileOutputStream(output);
      try {
        copy(in, out);
      } finally {
        out.close();
      }
    } finally {
      in.close();
    }
  }

  private static void copy(InputStream in, OutputStream out)
      throws IOException {
    byte[] buffer = new byte[1024];
    while (true) {
      int readCount = in.read(buffer);
      if (readCount == -1) {
        break;
      }
      out.write(buffer, 0, readCount);
    }
  }

  public static void main(String[] args) {
    try {
      URL url = new URL("http://stackoverflow.com");
      File file = new File("data");
      download(url, file);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

}

The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection(or a more sophisticated API, like the Apache one).

这种方法的缺点是它会忽略任何元数据，例如Content-Type，您可以通过使用HttpURLConnection（或更复杂的 API，如 Apache 的 API）获得这些元数据。

In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidyit first before parsing using a XML parser.

为了解析 HTML 数据，您需要一个专门的 HTML 解析器来处理格式不佳的标记，或者在使用 XML 解析器解析之前首先对其进行整理。

Answer 5

回答by Kris

Funnily enough I wrote utility method that does just that the other week

有趣的是，我写了一个实用方法，前一周就是这样做的

/**
 * Retrieves the file specified by <code>fileUrl</code> and writes it to 
 * <code>out</code>.
 * <p>
 * Does not close <code>out</code>, but does flush.
 * @param fileUrl The URL of the file.
 * @param out An output stream to capture the contents of the file
 * @param batchWriteSize The number of bytes to write to <code>out</code>
 *                       at once (larger files than this will be written
 *                       in several batches)
 * @throws IOException If call to web server fails
 * @throws FileNotFoundException If the call to the web server does not
 *                               return status code 200. 
 */
public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize)
                            throws IOException{
    GetMethod get = new GetMethod(fileURL);
    HttpClient client = new HttpClient();
    HttpClientParams params = client.getParams();
    params.setSoTimeout(2000);
    client.setParams(params);
    try {
        client.executeMethod(get);
    } catch(ConnectException e){
        // Add some context to the exception and rethrow
        throw new IOException("ConnectionException trying to GET " + 
                fileURL,e);
    }

    if(get.getStatusCode()!=200){
        throw new FileNotFoundException(
                "Server returned " + get.getStatusCode());
    }

    // Get the input stream
    BufferedInputStream bis = 
        new BufferedInputStream(get.getResponseBodyAsStream());

    // Read the file and stream it out
    byte[] b = new byte[batchWriteSize];
    int bytesRead = bis.read(b,0,batchWriteSize);
    long bytesTotal = 0;
    while(bytesRead!=-1) {
        bytesTotal += bytesRead;
        out.write(b, 0, bytesRead);
        bytesRead = bis.read(b,0,batchWriteSize);;
    } 
    bis.close(); // Release the input stream.
    out.flush();        
}

Uses Apache Commons library i.e.

使用 Apache Commons 库，即

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpClientParams;

获取 html 文件 Java

提问by rpf

Duplicate:

复制：

采纳答案by kgiannakakis

回答by willcodejavaforfood

回答by Mark

回答by McDowell

回答by Kris

相关推荐

最近更新

标签

获取 html 文件 Java

提问by rpf

Duplicate:

复制：

采纳答案by kgiannakakis

回答by willcodejavaforfood

回答by Mark

回答by McDowell

回答by Kris

相关推荐

Java Mockito : doAnswer Vs thenReturn

Java JAX-WS 使用 WS-Security 和 WS-Addressing 使用 Web 服务

使用 Java 在 DynamoDB 扫描中使用包含过滤器

在 Java 中将 char 表示为一个字节

相关推荐

最近更新

标签