获取 html 文件 Java

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/704821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 18:33:07  来源:igfitidea点击:

Get html file Java

java

提问by rpf

Duplicate:

复制:

How do you Programmatically Download a Webpage in Java?

How to fetch html in Java

如何以编程方式下载 Java 网页?

如何在 Java 中获取 html

I'm developping an application that consists on: the user inputs an URL of some website, and then the application have to analyze that URL.

我正在开发一个包含以下内容的应用程序:用户输入某个网站的 URL,然后应用程序必须分析该 URL。

How can I have access to the HTML file, using Java? Does I need to use HttpRequest? How does that works?

如何使用 Java 访问 HTML 文件?我需要使用 HttpRequest 吗?这是如何工作的?

Thks.

谢谢。

采纳答案by kgiannakakis

URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient

URLConnection 适用于简单的情况。当涉及重定向之类的事情时,最好使用 Apache 的HTTPClient

回答by willcodejavaforfood

You could just use a URLConnection. See this Java Tutorialfrom Sun

你可以只使用一个 URLConnection。请参阅Sun 的Java 教程

回答by Mark

You can use java.net.URL and then open an input stream to read the HTML from the server. See the example here.

您可以使用 java.net.URL 然后打开一个输入流从服务器读取 HTML。请参阅此处的示例。

回答by McDowell

This code downloads data from a URL, treating it as binary content:

此代码从 URL 下载数据,将其视为二进制内容:

public class Download {

  private static void download(URL input, File output)
      throws IOException {
    InputStream in = input.openStream();
    try {
      OutputStream out = new FileOutputStream(output);
      try {
        copy(in, out);
      } finally {
        out.close();
      }
    } finally {
      in.close();
    }
  }

  private static void copy(InputStream in, OutputStream out)
      throws IOException {
    byte[] buffer = new byte[1024];
    while (true) {
      int readCount = in.read(buffer);
      if (readCount == -1) {
        break;
      }
      out.write(buffer, 0, readCount);
    }
  }

  public static void main(String[] args) {
    try {
      URL url = new URL("http://stackoverflow.com");
      File file = new File("data");
      download(url, file);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

}

The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection(or a more sophisticated API, like the Apache one).

这种方法的缺点是它会忽略任何元数据,例如Content-Type,您可以通过使用HttpURLConnection(或更复杂的 API,如 Apache 的 API)获得这些元数据。

In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidyit first before parsing using a XML parser.

为了解析 HTML 数据,您需要一个专门的 HTML 解析器来处理格式不佳的标记,或者在使用 XML 解析器解析之前首先对其进行整理

回答by Kris

Funnily enough I wrote utility method that does just that the other week

有趣的是,我写了一个实用方法,前一周就是这样做的

/**
 * Retrieves the file specified by <code>fileUrl</code> and writes it to 
 * <code>out</code>.
 * <p>
 * Does not close <code>out</code>, but does flush.
 * @param fileUrl The URL of the file.
 * @param out An output stream to capture the contents of the file
 * @param batchWriteSize The number of bytes to write to <code>out</code>
 *                       at once (larger files than this will be written
 *                       in several batches)
 * @throws IOException If call to web server fails
 * @throws FileNotFoundException If the call to the web server does not
 *                               return status code 200. 
 */
public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize)
                            throws IOException{
    GetMethod get = new GetMethod(fileURL);
    HttpClient client = new HttpClient();
    HttpClientParams params = client.getParams();
    params.setSoTimeout(2000);
    client.setParams(params);
    try {
        client.executeMethod(get);
    } catch(ConnectException e){
        // Add some context to the exception and rethrow
        throw new IOException("ConnectionException trying to GET " + 
                fileURL,e);
    }

    if(get.getStatusCode()!=200){
        throw new FileNotFoundException(
                "Server returned " + get.getStatusCode());
    }

    // Get the input stream
    BufferedInputStream bis = 
        new BufferedInputStream(get.getResponseBodyAsStream());

    // Read the file and stream it out
    byte[] b = new byte[batchWriteSize];
    int bytesRead = bis.read(b,0,batchWriteSize);
    long bytesTotal = 0;
    while(bytesRead!=-1) {
        bytesTotal += bytesRead;
        out.write(b, 0, bytesRead);
        bytesRead = bis.read(b,0,batchWriteSize);;
    } 
    bis.close(); // Release the input stream.
    out.flush();        
}

Uses Apache Commons library i.e.

使用 Apache Commons 库,即

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpClientParams;