将网页中的 html 正确加载到 Java 字符串中的最简单方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1381617/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Simplest way to correctly load html from web page into a string in Java
提问by Mark
Just what the title says.
就像标题所说的那样。
Help greatly appreciated!
非常感谢帮助!
采纳答案by erickson
An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a meta
tag is also an option.
一个极其常见的错误是未能正确地将 HTTP 响应从字节转换为字符。为此,您必须知道响应的字符编码。希望这被指定为“Content-Type”参数中的参数。但是将它放在正文本身中,作为meta
标签中的“http-equiv”属性也是一种选择。
So, it is surprisingly complicated to load a page into a String
correctly, and even 3rd party libraries like HttpClient don't offer a general solution.
因此,将页面String
正确加载到一个页面是非常复杂的,甚至像 HttpClient 这样的 3rd 方库也没有提供通用的解决方案。
Here's a simple implementation that will handle the most common case:
这是一个简单的实现,可以处理最常见的情况:
URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\s+charset=([^\s]+)\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and
* hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
Reader r = new InputStreamReader(con.getInputStream(), charset);
StringBuilder buf = new StringBuilder();
while (true) {
int ch = r.read();
if (ch < 0)
break;
buf.append((char) ch);
}
String str = buf.toString();
回答by OscarRyz
I use this:
我用这个:
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(
new URL(urlToSeach)
.openConnection()
.getInputStream() ));
StringBuilder sb = new StringBuilder();
String line = null;
while( ( line = bufferedReader.readLine() ) != null ) {
sb.append( line ) ;
sb.append( "\n");
}
.... in finally....
buffer.close();
It works most of the times.
它在大多数情况下都有效。
回答by altumano
You can still simplify it a bit using org.apache.commons.io.IOUtils
:
您仍然可以使用org.apache.commons.io.IOUtils
以下方法简化它:
URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\s+charset=([^\s]+)\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and
* hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
String str = IOUtils.toString(con.getInputStream(), charset);