java 如何仅从 HTML 页面中提取主要文本内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7021260/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I extract only the main textual content from an HTML page?
提问by Renato Dinhani
Update
更新
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
Boilerpipe 似乎工作得很好,但我意识到我并不只需要主要内容,因为许多页面没有文章,而只需要对整个文本进行一些简短描述的链接(这在新闻门户中很常见)和我不想丢弃这些短裤文字。
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
因此,如果 API 执行此操作,获取不同的文本部分/以与单个文本不同的方式拆分每个部分的块(仅在一个文本中没有用),请报告。
The Question
问题
I download some pages from random sites, and now I want to analyze the textual content of the page.
我从随机站点下载了一些页面,现在我想分析页面的文本内容。
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
问题是网页有很多内容,如菜单、宣传、横幅等。
I want to try to exclude all that is not related with the content of the page.
我想尝试排除所有与页面内容无关的内容。
Taking this page as example, I don't want the menus above neither the links in the footer.
以这个页面为例,我不想要页脚中的链接上方的菜单。
Important:All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
重要提示:所有页面都是 HTML 并且是来自各种不同站点的页面。我需要关于如何排除这些内容的建议。
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
目前,我认为从 HTML 和看起来像一个正确名称(第一个大写字母)的连续单词中排除“菜单”和“横幅”类中的内容。
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
解决方案可以基于文本内容(不带 HTML 标签)或基于 HTML 内容(带 HTML 标签)
Edit:I want to do this inside my Java code, not an external application (if this can be possible).
编辑:我想在我的 Java 代码中执行此操作,而不是在外部应用程序中执行此操作(如果可能的话)。
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
我尝试了一种解析此问题中描述的 HTML 内容的方法:https: //stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
回答by Kurt Kaylor
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
看看锅炉管。它旨在完全满足您的需求,去除网页主要文本内容周围多余的“混乱”(样板、模板)。
There are a few ways to feed HTML into Boilerpipe and extract HTML.
有几种方法可以将 HTML 输入 Boilerpipe 并提取 HTML。
You can use a URL:
您可以使用网址:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
您可以使用字符串:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
还有使用 Reader 的选项,它打开了大量选项。
回答by Christian Kohlschütter
You can also use boilerpipeto segmentthe text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
您还可以使用boilerpipe到段文成全文/非全文本块,而不是只返回他们(基本上,boilerpipe段第一,然后返回一个字符串)之一。
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
假设您可以从 java.io.Reader 访问您的 HTML,只需让boilerpipe 对 HTML 进行分段并为您分类:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock
has some more exciting methods, feel free to play around!
TextBlock
还有一些更刺激的方法,随意玩吧!
回答by Stefan
There appears to be a possible problem with Boilerpipe. Why? Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
Boilerpipe 似乎存在问题。为什么?嗯,它似乎适合某些类型的网页,例如具有单一内容的网页。
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
因此,就 Boilerpipe 而言,可以将网页粗略地分为三种:
- a web page with a single article in it (Boilerpipe worthy!)
- a web with multiple articles in it, such as the front page of the New York times
- a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
- 一个包含一篇文章的网页(Boilerpipe 值得!)
- 包含多篇文章的网络,例如纽约时报的头版
- 一个真正没有任何文章的网页,但有一些关于链接的内容,但也可能有一定程度的混乱。
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
Boilerpipe 适用于案例 #1。但是如果一个人正在做大量的自动化文本处理,那么一个软件如何“知道”它正在处理什么样的网页呢?如果网页本身可以归入这三个类别之一,那么 Boilerpipe 可以应用于案例 #1。案例#2 是一个问题,案例#3 也是一个问题——它可能需要相关网页的聚合来确定什么是混乱的,什么不是。
回答by Aaron Foltz
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p");
and not have to worry about the other elements with random content.
我的第一直觉是采用您最初使用 Jsoup 的方法。至少,您可以使用选择器并仅检索您想要的元素(即Elements posts = doc.select("p");
不必担心其他具有随机内容的元素。
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
关于你的另一篇文章,误报问题是你偏离 Jsoup 的唯一理由吗?如果是这样,您不能只是调整 MIN_WORDS_SEQUENCE 的数量还是对选择器更具选择性(即不检索 div 元素)
回答by getn_outchea
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
专有软件,但它可以很容易地从网页中提取并与java很好地集成。
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
您可以使用提供的应用程序来设计 roboserver api 读取的 xml 文件以解析网页。xml 文件是通过分析您希望在提供的应用程序中解析的页面(相当简单)并应用收集数据的规则(通常,网站遵循相同的模式)来构建的。您可以使用提供的 Java API 设置调度、运行和数据库集成。
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
如果您反对使用软件并自己动手,我建议您不要尝试将 1 条规则应用于所有站点。找到一种分离标签的方法,然后按站点构建
回答by Felipe Hummel
You can use some libs like goose. It works best on articles/news. You can also check javascript code that does similar extraction as goose with the readability bookmarklet
您可以使用一些库,例如goose。它最适用于文章/新闻。您还可以使用可读性书签检查与 goose 进行类似提取的 javascript 代码
回答by David L-R
You could use the textractoapi, it extracts the main 'article' text and there is also the opportunity to extract all othertextual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.
您可以使用textractoapi,它提取主要的“文章”文本,并且还有机会提取所有其他文本内容。通过“减去”这些文本,您可以从主要文本内容中拆分导航文本、预览文本等。
回答by Jared Ng
回答by Tushar Sagar
You can filter the html junk and then parse the required details or use the apis of the existing site. Refer the below link to filter the html, i hope it helps. http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
您可以过滤 html 垃圾,然后解析所需的详细信息或使用现有站点的 api。请参考以下链接来过滤 html,希望对您有所帮助。 http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/