如何使用 Java 从网页中读取文本？

Question

提问by Rigor Mortis

I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:

我想从网页上阅读文本。我不想获取网页的 HTML 代码。我找到了这个代码：

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");       

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?

但是这段代码给了我网页的 HTML 代码。我想在这个页面中获取整个文本。我怎样才能用 Java 做到这一点？

Answer 1

采纳答案by Fabian Barney

You may want to have a look at jsoupfor this:

你可能想看看jsoup：

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

This example is an extract from one on their site.

这个例子是他们网站上的一个摘录。

Answer 2

回答by Paaske

You would have to take the content you get with your current code, then parse it and look for the tags that contains the text you want. A sax parser will be well suited for this job.

您必须使用当前代码获取的内容，然后对其进行解析并查找包含所需文本的标签。萨克斯解析器将非常适合这项工作。

Or if it is not a particular piece of text you want, simply remove all tags so that you're left with only the text. I guess you could use regexp for that.

或者，如果它不是您想要的特定文本，只需删除所有标签，这样您就只剩下文本了。我想你可以使用正则表达式。

Answer 3

回答by Nitzan Volman

Use JSoup.

使用JSoup。

You will be able to parse the content using css style selectors.

您将能够使用 css 样式选择器解析内容。

In this example you can try

在这个例子中，你可以尝试

Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); 
String textContents = doc.select(".newsText").first().text();

Answer 4

回答by Prabuddha

You can also use HtmlCleanerjar. Below is the code.

您也可以使用HtmlCleaner罐子。下面是代码。

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean( url );

System.out.println( node.getText().toString() );

如何使用 Java 从网页中读取文本？

提问by Rigor Mortis

采纳答案by Fabian Barney

回答by Paaske

回答by Nitzan Volman

回答by Prabuddha

相关推荐

最近更新

标签

如何使用 Java 从网页中读取文本？

提问by Rigor Mortis

采纳答案by Fabian Barney

回答by Paaske

回答by Nitzan Volman

回答by Prabuddha

相关推荐

Java 如何清除ResourceBundle缓存

Java 使用 JSTL forEach 循环的 varStatus 作为 ID

java.io.File的java中mkdir()和mkdirs()的区别

Java 无法为 XML 架构命名空间找到 Spring NamespaceHandler

相关推荐

最近更新

标签