你如何以编程方式下载 Java 网页

Question

提问by jjnguy

I would like to be able to fetch a web page's html and save it to a String, so I can do some processing on it. Also, how could I handle various types of compression.

我希望能够获取网页的 html 并将其保存到 a String，以便我可以对其进行一些处理。另外，我如何处理各种类型的压缩。

How would I go about doing that using Java?

我将如何使用 Java 做到这一点？

Answer 1

采纳答案by Bill the Lizard

Here's some tested code using Java's URLclass. I'd recommend do a better job than I do here of handling the exceptions or passing them up the call stack, though.

下面是一些使用 Java 的URL类的经过测试的代码。不过，我建议在处理异常或将它们传递到调用堆栈方面做得比我在这里做得更好。

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}

Answer 2

回答by Jon Skeet

Well, you could go with the built-in libraries such as URLand URLConnection, but they don't give very much control.

好吧，您可以使用URL和URLConnection等内置库，但它们并没有提供太多控制权。

~~Personally I'd go with the Apache HTTPClientlibrary.~~
Edit:HTTPClient has been set to end of lifeby Apache. The replacement is: HTTP Components

~~我个人会使用Apache HTTPClient库。~~
编辑：HTTPClient 已被Apache设置为生命周期结束。替代品是：HTTP 组件

Answer 3

回答by Timo Geusch

On a Unix/Linux box you could just run 'wget' but this is not really an option if you're writing a cross-platform client. Of course this assumes that you don't really want to do much with the data you download between the point of downloading it and it hitting the disk.

在 Unix/Linux 机器上，您可以只运行“wget”，但如果您正在编写跨平台客户端，这不是一个真正的选择。当然，这假设您真的不想在下载数据和它到达磁盘之间对下载的数据做太多事情。

Answer 4

回答by jjnguy

Bill's answer is very good, but you may want to do some things with the request like compression or user-agents. The following code shows how you can various types of compression to your requests.

比尔的回答非常好，但您可能想对请求做一些事情，例如压缩或用户代理。以下代码显示了如何对请求进行各种类型的压缩。

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

To also set the user-agent add the following code:

要设置用户代理，请添加以下代码：

conn.setRequestProperty ( "User-agent", "my agent name");

Answer 5

回答by BalusC

I'd use a decent HTML parser like Jsoup. It's then as easy as:

我会使用像Jsoup这样不错的 HTML 解析器。然后就这么简单：

String html = Jsoup.connect("http://stackoverflow.com").get().html();

It handles GZIP and chunked responses and character encoding fully transparently. It offers more advantages as well, like HTML traversingand manipulationby CSS selectors like as jQuery can do. You only have to grab it as Document, not as a String.

它完全透明地处理 GZIP 和分块响应以及字符编码。它也提供了更多的优势，比如 HTML遍历和CSS 选择器操作，就像 jQuery 一样。你只需要抓住它Document，而不是一个String。

Document document = Jsoup.connect("http://google.com").get();

You really don'twant to run basic String methods or even regex on HTML to process it.

您真的不想在 HTML 上运行基本的 String 方法甚至正则表达式来处理它。

也可以看看：

What are the pros and cons of leading HTML parsers in Java?

Java 中领先的 HTML 解析器的优缺点是什么？

Answer 6

回答by user3690910

All the above mentioned approaches do not download the web page text as it looks in the browser. these days a lot of data is loaded into browsers through scripts in html pages. none of above mentioned techniques supports scripts, they just downloads the html text only. HTMLUNIT supports the javascripts. so if you are looking to download the web page text as it looks in the browser then you should use HTMLUNIT.

上面提到的所有方法都不会下载在浏览器中看起来的网页文本。如今，大量数据通过 html 页面中的脚本加载到浏览器中。上述技术均不支持脚本，它们仅下载 html 文本。HTMLUNIT 支持 javascripts。因此，如果您希望下载在浏览器中显示的网页文本，那么您应该使用HTMLUNIT。

Answer 7

回答by Jan Bodnar

Jetty has an HTTP client which can be use to download a web page.

Jetty 有一个 HTTP 客户端，可用于下载网页。

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();

            String url = "http://www.something.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

The example prints the contents of a simple web page.

该示例打印一个简单网页的内容。

In a Reading a web page in Javatutorial I have written six examples of dowloading a web page programmaticaly in Java using URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.

在阅读 Java 网页教程中，我编写了六个使用 URL、JSoup、HtmlCleaner、Apache HttpClient、Jetty HttpClient 和 HtmlUnit 在 Java 中以编程方式下载网页的示例。

Answer 8

回答by A_01

I used the actual answer to this post (url) and writing the output into a file.

我使用了这篇文章的实际答案 ( url) 并将输出写入文件。

package test;

import java.net.*;
import java.io.*;

public class PDFTest {
    public static void main(String[] args) throws Exception {
    try {
        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        String fileName = "D:\a_01\output.txt";

        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        OutputStream outputStream = new FileOutputStream(fileName);
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            writer.println(inputLine);
        }
        in.close();
        } catch(Exception e) {

        }

    }
}

Answer 9

回答by Sohaib Aslam

Get help from this class it get code and filter some information.

从这个类获取帮助它获取代码并过滤一些信息。

public class MainActivity extends AppCompatActivity {

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try {
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        }
        catch (Exception e)
        {
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        }

    }
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    {
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        {
            String htmlcontent = " ";
            try {
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                {
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                }
            }
            catch (Exception e)
            {
                Log.i("Status : ",e.toString());
            }
            return htmlcontent;
        }
    }
}

Answer 10

回答by QA Specialist

You'd most likely need to extract code from a secure web page (https protocol). In the following example, the html file is being saved into c:\temp\filename.html Enjoy!

您很可能需要从安全网页（https 协议）中提取代码。在下面的例子中，html 文件被保存到 c:\temp\filename.html 享受！

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil {
    public static void main(String[] args) throws Exception {
        String httpsURL = "https://stackoverflow.com";
        String FILENAME = "c:\temp\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            bw.write(inputLine);
        }
        in.close(); 
        bw.close();
    }
}

你如何以编程方式下载 Java 网页

提问by jjnguy

采纳答案by Bill the Lizard

回答by Jon Skeet

回答by Timo Geusch

回答by jjnguy

回答by BalusC

See also:

也可以看看：

回答by user3690910

回答by Jan Bodnar

回答by A_01

回答by Sohaib Aslam

回答by QA Specialist

相关推荐

最近更新

标签

你如何以编程方式下载 Java 网页

提问by jjnguy

采纳答案by Bill the Lizard

回答by Jon Skeet

回答by Timo Geusch

回答by jjnguy

回答by BalusC

See also:

也可以看看：

回答by user3690910

回答by Jan Bodnar

回答by A_01

回答by Sohaib Aslam

回答by QA Specialist

相关推荐

java.lang.String 是否有内存高效的替代品？

java.sql.SQLException: 连接已经关闭

Java 从 Windows Batch (cmd.exe) 文件中读取环境变量

Java BadPaddingException : 解密错误

相关推荐

最近更新

标签