如何在java中提取网页文本内容？

Question

提问by Radi

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

我正在寻找一种使用 jdk 或其他库从网页（最初是 html）中提取文本的方法。请帮忙

thanks

谢谢

Answer 1

采纳答案by polygenelubricants

Use a HTML parserif at all possible; there are many available for Java.

尽可能使用HTML 解析器；有许多可用于 Java。

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

或者您可以像许多人一样使用正则表达式。但是，这通常是不可取的，除非您进行非常简单的处理。

回答by Pascal Thivent

Use jsoup. This is currently the most elegant library for screen scraping.

使用jsoup。这是目前最优雅的屏幕抓取库。

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

I just love its CSS selector syntax.

我只是喜欢它的CSS 选择器语法。

Answer 3

回答by Itay Maman

Here's a short method that nicely wraps these details (based on java.util.Scanner):

这是一个很好地包装这些细节的简短方法（基于java.util.Scanner）：

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

And this is how it is used:

这就是它的使用方式：

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}

如何在java中提取网页文本内容？

提问by Radi

采纳答案by polygenelubricants

Related questions

相关问题

回答by Pascal Thivent

回答by Itay Maman

相关推荐

最近更新

标签

如何在java中提取网页文本内容？

提问by Radi

采纳答案by polygenelubricants

Related questions

相关问题

回答by Pascal Thivent

回答by Itay Maman

相关推荐

Java JRE 与 GCJ

终止 Java 程序

java中的轻量级发布/订阅框架

Java Android SeekBar 最小值

相关推荐

最近更新

标签