如何在 Java 中获取 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31462/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 07:23:02  来源:igfitidea点击:

How to fetch HTML in Java

javahtmlscreen-scraping

提问by pek

Without the use of any external library, what is the simplest way to fetch a website's HTML content into a String?

在不使用任何外部库的情况下,将网站的 HTML 内容提取到字符串中的最简单方法是什么?

采纳答案by pek

I'm currently using this:

我目前正在使用这个:

String content = null;
URLConnection connection = null;
try {
  connection =  new URL("http://www.google.com").openConnection();
  Scanner scanner = new Scanner(connection.getInputStream());
  scanner.useDelimiter("\Z");
  content = scanner.next();
  scanner.close();
}catch ( Exception ex ) {
    ex.printStackTrace();
}
System.out.println(content);

But not sure if there's a better way.

但不确定是否有更好的方法。

回答by Justin Bennett

I just left this post in your other thread, though what you have above might work as well. I don't think either would be any easier than the other. The Apache packages can be accessed by just using import org.apache.commons.HttpClientat the top of your code.

我只是把这篇文章在了你的另一个帖子中,尽管你上面的内容可能也有效。我不认为任何一个会比另一个容易。只需import org.apache.commons.HttpClient在代码顶部使用即可访问 Apache 包。

Edit: Forgot the link ;)

编辑:忘记链接;)

回答by Scott Bennett-McLeish

This has worked well for me:

这对我来说效果很好:

URL url = new URL(theURL);
InputStream is = url.openStream();
int ptr = 0;
StringBuffer buffer = new StringBuffer();
while ((ptr = is.read()) != -1) {
    buffer.append((char)ptr);
}

Not sure at to whether the other solution(s) provided are any more efficient or not.

不确定提供的其他解决方案是否更有效。

回答by Scott Bennett-McLeish

Whilst not vanilla-Java, I'll offer up a simpler solution. Use Groovy ;-)

虽然不是 vanilla-Java,但我将提供一个更简单的解决方案。使用 Groovy ;-)

String siteContent = new URL("http://www.google.com").text

回答by dinesh kandpal

Its not library but a tool named curl generally installed in most of the servers or you can easily install in ubuntu by

它不是库,而是一个名为 curl 的工具,通常安装在大多数服务器中,或者您可以通过以下方式轻松安装在 ubuntu 中

sudo apt install curl

Then fetch any html page and store it to your local file like an example

然后获取任何 html 页面并将其存储到您的本地文件中,例如

curl https://www.facebook.com/ > fb.html

You will get the home page html.You can run it in your browser as well.

您将获得主页 html。您也可以在浏览器中运行它。