如何使用 Java 从网页中读取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9825798/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 08:11:25  来源:igfitidea点击:

How to read a text from a web page with Java?

java

提问by Rigor Mortis

I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:

我想从网页上阅读文本。我不想获取网页的 HTML 代码。我找到了这个代码:

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");       

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?

但是这段代码给了我网页的 HTML 代码。我想在这个页面中获取整个文本。我怎样才能用 Java 做到这一点?

采纳答案by Fabian Barney

You may want to have a look at jsoupfor this:

你可能想看看jsoup

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

This example is an extract from one on their site.

这个例子是他们网站上的一个摘录。

回答by Paaske

You would have to take the content you get with your current code, then parse it and look for the tags that contains the text you want. A sax parser will be well suited for this job.

您必须使用当前代码获取的内容,然后对其进行解析并查找包含所需文本的标签。萨克斯解析器将非常适合这项工作。

Or if it is not a particular piece of text you want, simply remove all tags so that you're left with only the text. I guess you could use regexp for that.

或者,如果它不是您想要的特定文本,只需删除所有标签,这样您就只剩下文本了。我想你可以使用正则表达式。

回答by Nitzan Volman

Use JSoup.

使用JSoup

You will be able to parse the content using css style selectors.

您将能够使用 css 样式选择器解析内容。

In this example you can try

在这个例子中,你可以尝试

Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); 
String textContents = doc.select(".newsText").first().text();

回答by Prabuddha

You can also use HtmlCleanerjar. Below is the code.

您也可以使用HtmlCleaner罐子。下面是代码。

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean( url );

System.out.println( node.getText().toString() );