如何在java中提取网页文本内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3036638/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to extract web page textual content in java?
提问by Radi
i am looking for a method to extract text from web page (initially html) using jdk or another library . please help
我正在寻找一种使用 jdk 或其他库从网页(最初是 html)中提取文本的方法。请帮忙
thanks
谢谢
采纳答案by polygenelubricants
Use a HTML parserif at all possible; there are many available for Java.
尽可能使用HTML 解析器;有许多可用于 Java。
Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.
或者您可以像许多人一样使用正则表达式。但是,这通常是不可取的,除非您进行非常简单的处理。
Related questions
相关问题
- Java HTML Parsing
- Which Html Parser is best?
- Any good Java HTML parsers?
- recommendations for a java HTML parser/editor
- What HTML parsing libraries do you recommend in Java
Text extraction:
文本提取:
Tag stripping:
标签剥离:
回答by Pascal Thivent
Use jsoup. This is currently the most elegant library for screen scraping.
使用jsoup。这是目前最优雅的屏幕抓取库。
URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();
I just love its CSS selector syntax.
我只是喜欢它的CSS 选择器语法。
回答by Itay Maman
Here's a short method that nicely wraps these details (based on java.util.Scanner
):
这是一个很好地包装这些细节的简短方法(基于java.util.Scanner
):
public static String get(String url) throws Exception {
StringBuilder sb = new StringBuilder();
for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
sb.append(sc.nextLine()).append('\n');
return sb.toString();
}
And this is how it is used:
这就是它的使用方式:
public static void main(String[] args) throws Exception {
System.out.println(get("http://www.yahoo.com"));
}