你如何以编程方式下载 Java 网页
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/238547/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you Programmatically Download a Webpage in Java
提问by jjnguy
I would like to be able to fetch a web page's html and save it to a String
, so I can do some processing on it. Also, how could I handle various types of compression.
我希望能够获取网页的 html 并将其保存到 a String
,以便我可以对其进行一些处理。另外,我如何处理各种类型的压缩。
How would I go about doing that using Java?
我将如何使用 Java 做到这一点?
采纳答案by Bill the Lizard
Here's some tested code using Java's URLclass. I'd recommend do a better job than I do here of handling the exceptions or passing them up the call stack, though.
下面是一些使用 Java 的URL类的经过测试的代码。不过,我建议在处理异常或将它们传递到调用堆栈方面做得比我在这里做得更好。
public static void main(String[] args) {
URL url;
InputStream is = null;
BufferedReader br;
String line;
try {
url = new URL("http://stackoverflow.com/");
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
}
回答by Jon Skeet
Well, you could go with the built-in libraries such as URLand URLConnection, but they don't give very much control.
好吧,您可以使用URL和URLConnection等内置库,但它们并没有提供太多控制权。
Personally I'd go with the Apache HTTPClientlibrary.
Edit:HTTPClient has been set to end of lifeby Apache. The replacement is: HTTP Components
我个人会使用Apache HTTPClient库。
编辑:HTTPClient 已被Apache设置为生命周期结束。替代品是:HTTP 组件
回答by Timo Geusch
On a Unix/Linux box you could just run 'wget' but this is not really an option if you're writing a cross-platform client. Of course this assumes that you don't really want to do much with the data you download between the point of downloading it and it hitting the disk.
在 Unix/Linux 机器上,您可以只运行“wget”,但如果您正在编写跨平台客户端,这不是一个真正的选择。当然,这假设您真的不想在下载数据和它到达磁盘之间对下载的数据做太多事情。
回答by jjnguy
Bill's answer is very good, but you may want to do some things with the request like compression or user-agents. The following code shows how you can various types of compression to your requests.
比尔的回答非常好,但您可能想对请求做一些事情,例如压缩或用户代理。以下代码显示了如何对请求进行各种类型的压缩。
URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;
// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
inStr = new InflaterInputStream(conn.getInputStream(),
new Inflater(true));
} else {
inStr = conn.getInputStream();
}
To also set the user-agent add the following code:
要设置用户代理,请添加以下代码:
conn.setRequestProperty ( "User-agent", "my agent name");
回答by BalusC
I'd use a decent HTML parser like Jsoup. It's then as easy as:
我会使用像Jsoup这样不错的 HTML 解析器。然后就这么简单:
String html = Jsoup.connect("http://stackoverflow.com").get().html();
It handles GZIP and chunked responses and character encoding fully transparently. It offers more advantages as well, like HTML traversingand manipulationby CSS selectors like as jQuery can do. You only have to grab it as Document
, not as a String
.
它完全透明地处理 GZIP 和分块响应以及字符编码。它也提供了更多的优势,比如 HTML遍历和CSS 选择器操作,就像 jQuery 一样。你只需要抓住它Document
,而不是一个String
。
Document document = Jsoup.connect("http://google.com").get();
You really don'twant to run basic String methods or even regex on HTML to process it.
您真的不想在 HTML 上运行基本的 String 方法甚至正则表达式来处理它。
See also:
也可以看看:
回答by user3690910
All the above mentioned approaches do not download the web page text as it looks in the browser. these days a lot of data is loaded into browsers through scripts in html pages. none of above mentioned techniques supports scripts, they just downloads the html text only. HTMLUNIT supports the javascripts. so if you are looking to download the web page text as it looks in the browser then you should use HTMLUNIT.
上面提到的所有方法都不会下载在浏览器中看起来的网页文本。如今,大量数据通过 html 页面中的脚本加载到浏览器中。上述技术均不支持脚本,它们仅下载 html 文本。HTMLUNIT 支持 javascripts。因此,如果您希望下载在浏览器中显示的网页文本,那么您应该使用HTMLUNIT。
回答by Jan Bodnar
Jetty has an HTTP client which can be use to download a web page.
Jetty 有一个 HTTP 客户端,可用于下载网页。
package com.zetcode;
import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;
public class ReadWebPageEx5 {
public static void main(String[] args) throws Exception {
HttpClient client = null;
try {
client = new HttpClient();
client.start();
String url = "http://www.something.com";
ContentResponse res = client.GET(url);
System.out.println(res.getContentAsString());
} finally {
if (client != null) {
client.stop();
}
}
}
}
The example prints the contents of a simple web page.
该示例打印一个简单网页的内容。
In a Reading a web page in Javatutorial I have written six examples of dowloading a web page programmaticaly in Java using URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.
在阅读 Java 网页教程中,我编写了六个使用 URL、JSoup、HtmlCleaner、Apache HttpClient、Jetty HttpClient 和 HtmlUnit 在 Java 中以编程方式下载网页的示例。
回答by A_01
I used the actual answer to this post (url) and writing the output into a file.
我使用了这篇文章的实际答案 ( url) 并将输出写入文件。
package test;
import java.net.*;
import java.io.*;
public class PDFTest {
public static void main(String[] args) throws Exception {
try {
URL oracle = new URL("http://www.fetagracollege.org");
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String fileName = "D:\a_01\output.txt";
PrintWriter writer = new PrintWriter(fileName, "UTF-8");
OutputStream outputStream = new FileOutputStream(fileName);
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
writer.println(inputLine);
}
in.close();
} catch(Exception e) {
}
}
}
回答by Sohaib Aslam
Get help from this class it get code and filter some information.
从这个类获取帮助它获取代码并过滤一些信息。
public class MainActivity extends AppCompatActivity {
EditText url;
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate( savedInstanceState );
setContentView( R.layout.activity_main );
url = ((EditText)findViewById( R.id.editText));
DownloadCode obj = new DownloadCode();
try {
String des=" ";
String tag1= "<div class=\"description\">";
String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();
url.setText( l );
url.setText( " " );
String[] t1 = l.split(tag1);
String[] t2 = t1[0].split( "</div>" );
url.setText( t2[0] );
}
catch (Exception e)
{
Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
}
}
// input, extrafunctionrunparallel, output
class DownloadCode extends AsyncTask<String,Void,String>
{
@Override
protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
{
String htmlcontent = " ";
try {
URL url = new URL( WebAddress[0] );
HttpURLConnection c = (HttpURLConnection) url.openConnection();
c.connect();
InputStream input = c.getInputStream();
int data;
InputStreamReader reader = new InputStreamReader( input );
data = reader.read();
while (data != -1)
{
char content = (char) data;
htmlcontent+=content;
data = reader.read();
}
}
catch (Exception e)
{
Log.i("Status : ",e.toString());
}
return htmlcontent;
}
}
}
回答by QA Specialist
You'd most likely need to extract code from a secure web page (https protocol). In the following example, the html file is being saved into c:\temp\filename.html Enjoy!
您很可能需要从安全网页(https 协议)中提取代码。在下面的例子中,html 文件被保存到 c:\temp\filename.html 享受!
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import javax.net.ssl.HttpsURLConnection;
/**
* <b>Get the Html source from the secure url </b>
*/
public class HttpsClientUtil {
public static void main(String[] args) throws Exception {
String httpsURL = "https://stackoverflow.com";
String FILENAME = "c:\temp\filename.html";
BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
URL myurl = new URL(httpsURL);
HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
InputStream ins = con.getInputStream();
InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
BufferedReader in = new BufferedReader(isr);
String inputLine;
// Write each line into the file
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
bw.write(inputLine);
}
in.close();
bw.close();
}
}