获取 html 文件 Java
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/704821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get html file Java
提问by rpf
Duplicate:
复制:
I'm developping an application that consists on: the user inputs an URL of some website, and then the application have to analyze that URL.
我正在开发一个包含以下内容的应用程序:用户输入某个网站的 URL,然后应用程序必须分析该 URL。
How can I have access to the HTML file, using Java? Does I need to use HttpRequest? How does that works?
如何使用 Java 访问 HTML 文件?我需要使用 HttpRequest 吗?这是如何工作的?
Thks.
谢谢。
采纳答案by kgiannakakis
URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient
URLConnection 适用于简单的情况。当涉及重定向之类的事情时,最好使用 Apache 的HTTPClient
回答by willcodejavaforfood
You could just use a URLConnection. See this Java Tutorialfrom Sun
你可以只使用一个 URLConnection。请参阅Sun 的Java 教程
回答by Mark
回答by McDowell
This code downloads data from a URL, treating it as binary content:
此代码从 URL 下载数据,将其视为二进制内容:
public class Download {
private static void download(URL input, File output)
throws IOException {
InputStream in = input.openStream();
try {
OutputStream out = new FileOutputStream(output);
try {
copy(in, out);
} finally {
out.close();
}
} finally {
in.close();
}
}
private static void copy(InputStream in, OutputStream out)
throws IOException {
byte[] buffer = new byte[1024];
while (true) {
int readCount = in.read(buffer);
if (readCount == -1) {
break;
}
out.write(buffer, 0, readCount);
}
}
public static void main(String[] args) {
try {
URL url = new URL("http://stackoverflow.com");
File file = new File("data");
download(url, file);
} catch (IOException e) {
e.printStackTrace();
}
}
}
The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection(or a more sophisticated API, like the Apache one).
这种方法的缺点是它会忽略任何元数据,例如Content-Type,您可以通过使用HttpURLConnection(或更复杂的 API,如 Apache 的 API)获得这些元数据。
In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidyit first before parsing using a XML parser.
为了解析 HTML 数据,您需要一个专门的 HTML 解析器来处理格式不佳的标记,或者在使用 XML 解析器解析之前首先对其进行整理。
回答by Kris
Funnily enough I wrote utility method that does just that the other week
有趣的是,我写了一个实用方法,前一周就是这样做的
/**
* Retrieves the file specified by <code>fileUrl</code> and writes it to
* <code>out</code>.
* <p>
* Does not close <code>out</code>, but does flush.
* @param fileUrl The URL of the file.
* @param out An output stream to capture the contents of the file
* @param batchWriteSize The number of bytes to write to <code>out</code>
* at once (larger files than this will be written
* in several batches)
* @throws IOException If call to web server fails
* @throws FileNotFoundException If the call to the web server does not
* return status code 200.
*/
public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize)
throws IOException{
GetMethod get = new GetMethod(fileURL);
HttpClient client = new HttpClient();
HttpClientParams params = client.getParams();
params.setSoTimeout(2000);
client.setParams(params);
try {
client.executeMethod(get);
} catch(ConnectException e){
// Add some context to the exception and rethrow
throw new IOException("ConnectionException trying to GET " +
fileURL,e);
}
if(get.getStatusCode()!=200){
throw new FileNotFoundException(
"Server returned " + get.getStatusCode());
}
// Get the input stream
BufferedInputStream bis =
new BufferedInputStream(get.getResponseBodyAsStream());
// Read the file and stream it out
byte[] b = new byte[batchWriteSize];
int bytesRead = bis.read(b,0,batchWriteSize);
long bytesTotal = 0;
while(bytesRead!=-1) {
bytesTotal += bytesRead;
out.write(b, 0, bytesRead);
bytesRead = bis.read(b,0,batchWriteSize);;
}
bis.close(); // Release the input stream.
out.flush();
}
Uses Apache Commons library i.e.
使用 Apache Commons 库,即
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpClientParams;