Javascript Readability 使用什么算法从 URL 中提取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3652657/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 05:36:08  来源:igfitidea点击:

What algorithm does Readability use for extracting text from URLs?

javascriptasp.netextraction

提问by user300981

For a while, I've been trying to find a way of intelligently extracting the "relevant" text from a URL by eliminating the text related to ads and all the other clutter.After several months of researching, I gave it up as a problem that cannot be accurately determined. (I've tried different ways but none were reliable)

一段时间以来,我一直试图找到一种方法,通过消除与广告相关的文本和所有其他杂乱内容,从 URL 中智能地提取“相关”文本。经过几个月的研究,我放弃了它作为一个问题这是无法准确确定的。(我尝试过不同的方法,但没有一个是可靠的)

A week back, I stumbled across Readability- a plugin that converts any URL into readable text. It looks pretty accurate to me. My guess is that they somehow have an algorithm that's smart enough to extract the relevant text.

一周前,我偶然发现了Readability——一个可以将任何 URL 转换为可读文本的插件。对我来说它看起来很准确。我的猜测是他们以某种方式拥有一种足够智能的算法来提取相关文本。

Does anyone know how they do it? Or how I could do it reliably?

有谁知道他们是如何做到的?或者我怎么能可靠地做到这一点?

回答by Christian Kohlschütter

Readability mainly consists of heuristics that "just somehow work well" in many cases.

可读性主要包括在许多情况下“以某种方式运行良好”的启发式方法。

I have written some research papers about this topic and I would like to explain the background of why it is easy to come up with a solution that works well and when it gets hard to get close to 100% accuracy.

我已经写了一些关于这个主题的研究论文,我想解释一下为什么很容易想出一个有效的解决方案以及何时难以接近 100% 的准确度的背景。

There seems to be a linguistic law underlying in human language that is also (but not exclusively) manifest in Web page content, which already quite clearly separates two types of text (full-text vs. non-full-text or, roughly, "main content" vs. "boilerplate").

人类语言中似乎有一个语言规律,它也(但不完全)体现在网页内容中,它已经很清楚地将两种类型的文本(全文与非全文或粗略地说,“主要内容”与“样板”)。

To get the main content from HTML, it is in many cases sufficient to keep only the HTML text elements (i.e. blocks of text that are not interrupted by markup) which have more than about 10 words. It appears that humans choose from two types of text ("short" and "long", measured by the number of words they emit) for two different motivations of writing text. I would call them "navigational" and "informational" motivations.

为了从 HTML 中获取主要内容,在很多情况下只保留超过 10 个单词的 HTML 文本元素(即不被标记打断的文本块)就足够了。似乎人类从两种类型的文本(“短”和“长”,根据他们发出的单词数来衡量)中选择,以实现两种不同的写作动机。我会称它们为“导航”和“信息”动机。

If an author wants you to quicklyget what is written, he/she uses "navigational" text, i.e. few words (like "STOP", "Read this", "Click here"). This is the mostly prominent type of text in navigational elements (menus etc.)

如果作者希望您快速了解所写的内容,他/她会使用“导航”文本,即几个词(如“停止”、“阅读本文”、“单击此处”)。这是导航元素(菜单等)中最突出的文本类型

If an author wants you to deeply understand what he/she means, he/she uses many words. This way, ambiguity is removed at the cost of an increase in redundancy. Article-like content usually falls into this class as it has more than only a few words.

如果一个作者想让你深入理解他/她的意思,他/她会用很多词。这样,以增加冗余为代价消除了歧义。类似文章的内容通常属于这一类,因为它不仅仅是几个单词。

While this separation seems to work in a plethora of cases, it is getting tricky with headlines, short sentences, disclaimers, copyright footers etc.

虽然这种分离似乎在很多情况下都有效,但在标题、短句、免责声明、版权页脚等方面变得棘手。

There are more sophisticated strategies, and features, that help separating main content from boilerplate. For example the link density (number of words in a block that are linked versus the overall number of words in the block), the features of the previous/next blocks, the frequency of a particular block text in the "whole" Web, the DOM structure of HTML document, the visual image of the page etc.

有更复杂的策略和功能可以帮助将主要内容与样板分开。例如链接密度(块中链接的单词数与块中的总单词数)、前一个/下一个块的特征、“整个”Web 中特定块文本的频率、 HTML 文档的 DOM 结构,页面的视觉图像等。

You can read my latest article "Boilerplate Detection using Shallow Text Features" to get some insight from a theoretical perspective. You may also watch the video of my paper presentation on VideoLectures.net.

您可以阅读我的最新文章“使用浅文本特征进行样板检测”,以从理论角度获得一些见解。您还可以在 VideoLectures.net 上观看我的论文演示视频。

"Readability" uses some of these features. If you carefully watch the SVN changelog, you will see that the number of strategies varied over time, and so did the extraction quality of Readability. For example, the introduction of link density in December 2009 very much helped improving.

“可读性”使用了其中一些功能。如果你仔细观察 SVN 的更新日志,你会发现策略的数量随着时间的推移而变化,可读性的提取质量也是如此。例如,2009 年 12 月引入的链路密度对改进有很大帮助。

In my opinion, it therefore makes no sense in saying "Readability does it like that", without mentioning the exact version number.

因此,在我看来,说“可读性就是这样”是没有意义的,而不提及确切的版本号。

I have published an Open Source HTML content extraction library called boilerpipe, which provides several different extraction strategies. Depending on the use case, one or the other extractor works better. You can try these extractors on pages on your choice using the companion boilerpipe-web app on Google AppEngine.

我已经发布了一个开源 HTML 内容提取库,叫做boilerpipe,它提供了几种不同的提取策略。根据用例,一个或另一个提取器效果更好。您可以在 Google AppEngine 上使用配套的boilerpipe-web 应用程序在您选择的页面上尝试这些提取器。

To let numbers speak, see the "Benchmarks" page on the boilerpipe wiki which compares some extraction strategies, including boilerpipe, Readability and Apple Safari.

让数字说话,请参阅boilerpipe wiki 上的“基准”页面,该页面比较了一些提取策略,包括boilerpipe、Readability 和Apple Safari。

I should mention that these algorithms assume that the main content is actually full text. There are cases where the "main content" is something else, e.g. an image, a table, a video etc. The algorithms won't work well for such cases.

我应该提到,这些算法假设主要内容实际上是全文。在某些情况下,“主要内容”是其他内容,例如图像、表格、视频等。算法不适用于此类情况。

Cheers,

干杯,

Christian

基督教

回答by Moin Zaman

readability is a javascript bookmarklet. meaning its client side code that manipulates the DOM. Look at the javascript and you should be able to see whats going on.

可读性是一个 javascript 书签。这意味着它的客户端代码操作 DOM。看看javascript,你应该能够看到发生了什么。

Readability's workflow and code:

Readability 的工作流程和代码:

/*
     *  1. Prep the document by removing script tags, css, etc.
     *  2. Build readability's DOM tree.
     *  3. Grab the article content from the current dom tree.
     *  4. Replace the current DOM tree with the new one.
     *  5. Read peacefully.
*/

javascript: (function () {
    readConvertLinksToFootnotes = false;
    readStyle = 'style-newspaper';
    readSize = 'size-medium';
    readMargin = 'margin-wide';
    _readability_script = document.createElement('script');
    _readability_script.type = 'text/javascript';
    _readability_script.src = 'http://lab.arc90.com/experiments/readability/js/readability.js?x=' + (Math.random());
    document.documentElement.appendChild(_readability_script);
    _readability_css = document.createElement('link');
    _readability_css.rel = 'stylesheet';
    _readability_css.href = 'http://lab.arc90.com/experiments/readability/css/readability.css';
    _readability_css.type = 'text/css';
    _readability_css.media = 'all';
    document.documentElement.appendChild(_readability_css);
    _readability_print_css = document.createElement('link');
    _readability_print_css.rel = 'stylesheet';
    _readability_print_css.href = 'http://lab.arc90.com/experiments/readability/css/readability-print.css';
    _readability_print_css.media = 'print';
    _readability_print_css.type = 'text/css';
    document.getElementsByTagName('head')[0].appendChild(_readability_print_css);
})();

And if you follow the JS and CSS files that the above code pulls in you'll get the whole picture:

如果您按照上述代码拉入的 JS 和 CSS 文件进行操作,您将获得全貌:

http://lab.arc90.com/experiments/readability/js/readability.js(this is pretty well commented, interesting reading)

http://lab.arc90.com/experiments/readability/js/readability.js(这是很好的评论,有趣的阅读)

http://lab.arc90.com/experiments/readability/css/readability.css

http://lab.arc90.com/experiments/readability/css/readability.css

回答by slhck

There's no 100% reliable way to do this, of course. You can have a look at the Readability source code here

当然,没有 100% 可靠的方法可以做到这一点。您可以在此处查看可读性源代码

Basically, what they're doing is trying to identify positiveand negativeblocks of text. Positive identifiers (i.e. div IDs) would be something like:

基本上,他们正在做的是试图找出积极的消极的文本块。正标识符(即 div ID)类似于:

  • article
  • body
  • content
  • blog
  • story
  • 文章
  • 身体
  • 内容
  • 博客
  • 故事

Negative identifiers would be:

负标识符将是:

  • comment
  • discuss
  • 评论
  • 讨论

And then they have unlikelyand maybecandidates. What they would do is determine what is most likely to be the main content of the site, see line 678in the readability source. This is done by analyzing mostly the length of paragraphs, their identifiers (see above), the DOM tree (i.e. if the paragraph is a last child node), strip out everything unnecessary, remove formatting, etc.

然后,他们不可能可能的候选人。他们会做的是确定什么最有可能成为网站的主要内容,请参阅678可读性源代码中的行。这是通过主要分析段落的长度、它们的标识符(见上文)、DOM 树(即,如果段落是最后一个子节点)、去除所有不必要的东西、删除格式等来完成的。

The code has 1792 lines. It does seem like a non trivial problem, so maybe you can get your inspirations from there.

该代码有 1792 行。这似乎是一个不平凡的问题,所以也许您可以从那里获得灵感。

回答by user734063

Interesting. I have developed a similar PHP script. It basically scans articles and attaches parts of speech to all text (Brill Tagger). Then, grammatically invalid sentences are instantly eliminated. Then, sudden shifts in pronouns or past tense indicate the article is over, or hasn't started yet. Repeated phrases are searched for and eliminated, like "Yahoo news sports finance" appears ten times in the page. You can also get statistics on the tone with a plethora of word banks relating to various emotions. Sudden changes in tone, from active/negative/financial, to passive/positive/political indicates a boundary. It's endless really, however dig you want to deep.

有趣的。我开发了一个类似的 PHP 脚本。它基本上扫描文章并将词性附加到所有文本(Brill Tagger)。然后,语法上无效的句子被立即消除。然后,代词或过去时的突然变化表明文章已经结束或尚未开始。重复的短语被搜索和消除,如“雅虎新闻体育金融”在页面中出现十次。您还可以通过大量与各种情绪相关的词库来获取有关语气的统计数据。语气的突然变化,从积极/消极/财务,到消极/积极/表明了一个边界。它真的是无穷无尽的,无论你想深入挖掘。

The major issues are links, embedded anomalies, scripting styles and updates.

主要问题是链接、嵌入异常、脚本样式和更新。