将网页中的 html 正确加载到 Java 字符串中的最简单方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1381617/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 11:24:56  来源:igfitidea点击:

Simplest way to correctly load html from web page into a string in Java

javahtmlparsing

提问by Mark

Just what the title says.

就像标题所说的那样。

Help greatly appreciated!

非常感谢帮助!

采纳答案by erickson

An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a metatag is also an option.

一个极其常见的错误是未能正确地将 HTTP 响应从字节转换为字符。为此,您必须知道响应的字符编码。希望这被指定为“Content-Type”参数中的参数。但是将它放在正文本身中,作为meta标签中的“http-equiv”属性也是一种选择。

So, it is surprisingly complicated to load a page into a Stringcorrectly, and even 3rd party libraries like HttpClient don't offer a general solution.

因此,将页面String正确加载到一个页面是非常复杂的,甚至像 HttpClient 这样的 3rd 方库也没有提供通用的解决方案。

Here's a simple implementation that will handle the most common case:

这是一个简单的实现,可以处理最常见的情况:

URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\s+charset=([^\s]+)\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and 
 * hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
Reader r = new InputStreamReader(con.getInputStream(), charset);
StringBuilder buf = new StringBuilder();
while (true) {
  int ch = r.read();
  if (ch < 0)
    break;
  buf.append((char) ch);
}
String str = buf.toString();

回答by OscarRyz

I use this:

我用这个:

        BufferedReader bufferedReader = new BufferedReader( 
                                     new InputStreamReader( 
                                          new URL(urlToSeach)
                                              .openConnection()
                                              .getInputStream() ));

        StringBuilder sb = new StringBuilder();
        String line = null;
        while( ( line = bufferedReader.readLine() ) != null ) {
             sb.append( line ) ;
             sb.append( "\n");
        }
        .... in finally.... 
        buffer.close();

It works most of the times.

它在大多数情况下都有效。

回答by altumano

You can still simplify it a bit using org.apache.commons.io.IOUtils:

您仍然可以使用org.apache.commons.io.IOUtils以下方法简化它:

URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\s+charset=([^\s]+)\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and 
 * hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
String str = IOUtils.toString(con.getInputStream(), charset);