Javascript Firefox 阅读器视图如何操作
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30661650/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How does Firefox reader view operate
提问by Martin
Summary
概括
I am looking for the criteria by which I can create a webpage and be [fairly] sure it will appear in the Firefox Reader View, if user desired.
Some sites have this option, some do not. Some with more text do not have this option than others with much less text. Stack Overflow for instance displays only the question rather than any answers in Reader View.
我正在寻找创建网页的标准,并[相当]确定它会出现在Firefox 阅读器视图中,如果用户需要的话。
有些网站有这个选项,有些没有。一些文本较多的没有这个选项,而另一些文本则少得多。例如,Stack Overflow 在阅读器视图中仅显示问题而不是任何答案。
Question
题
I have had my Firefox upgraded from 38.0.1 to 38.0.5 and have found a new feature called ReaderView - which is a sort of overlay which removes "page clutter" and makes text easier to read. Readerview is found in the right hand side of the address bar as a clickable icon on certain pages.
我已将 Firefox 从 38.0.1 升级到 38.0.5,并发现了一个名为 ReaderView 的新功能 - 这是一种消除“页面混乱”并使文本更易于阅读的叠加层。Readerview 位于地址栏的右侧,是某些页面上的可点击图标。
This is fine, but from the programming point of view I want to know how "reader view" works, which criteria of which pages it applies to. I have done some exploration of the Mozilla Firefox website with no clear answers (sod all programming answers of any sort I found), I have of course Googled / Binged this and this only came back with references to Firefox addons - this is not an addon but a staple part of the new Firefox version.
这很好,但从编程的角度来看,我想知道“读者视图”是如何工作的,它适用于哪些页面的哪些标准。我已经对 Mozilla Firefox 网站进行了一些探索,但没有明确的答案(我找到的任何类型的编程答案都是如此),我当然已经用谷歌搜索 / 对此进行了搜索,这仅返回了对 Firefox 插件的引用 - 这不是插件但它是新 Firefox 版本的主要部分。
I made an assumption that readerview used HTML5 and would extract <article>contents but this is not the case as it works on Wikipedia which does not appear to use <article>or similar HTML5 tags, instead the readview extracts certain <div>s and displays them alone. This feature works on some HTML5 pages - such as wikipedia - but then not others.
我假设 readerview 使用 HTML5 并会提取<article>内容,但事实并非如此,因为它适用于似乎没有使用<article>或类似 HTML5 标签的维基百科,而是 readview 提取某些<div>s 并单独显示它们。此功能适用于某些 HTML5 页面(例如维基百科),但不适用于其他页面。
If anyone has any ideas how Firefox ReaderView actually operates and how this operation can be used by website developers, can you share? Or if you can find where this information can be located, can you point me in the right direction - as I have not been able to find this.
如果有人对 Firefox ReaderView 的实际运行方式以及网站开发人员如何使用此操作有任何想法,您可以分享吗?或者如果你能找到这些信息的位置,你能指出我正确的方向吗——因为我一直没能找到这个。
采纳答案by rubo77
You need at least one <p>tag around the text, that you want to see in Reader View, and at least 516 characters in 7 words inside the text.
您需要在<p>文本周围至少有一个标签,您希望在阅读器视图中看到该标签,并且在文本内的 7 个单词中至少需要 516 个字符。
for example this will trigger the ReaderView:
例如,这将触发 ReaderView:
<body>
<p>
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789 123456
</p>
</body>
See my example at https://stackoverflow.com/a/30750212/1069083
回答by Martin
Reading through the gitHub code, this morning, the process is that page elements are listed in a likelyhood order - with <section>,<p>,<div>,<article>at the top of the list (ie most likely).
今天早上通读 gitHub 代码,过程是页面元素按可能的顺序列出 - <section>, <p>, <div>,<article>在列表的顶部(即最有可能)。
Then each of these "nodes" is given a score based on things such as comma counts and class names that apply to the node. This is a somewhat multi-faceted process where scores are added for text chunks but also scores are seemingly reduced for invalid parts or syntax. Scores in sub-parts of "node" are reflected in the score of the node as a whole. ie the parent element contains the scores of all lower elements, I think.
然后,根据适用于该节点的逗号计数和类名称等内容,为这些“节点”中的每一个分配一个分数。这是一个有点多方面的过程,其中为文本块添加分数,但对于无效部分或语法似乎也减少了分数。“节点”子部分的分数反映在节点整体的分数中。即父元素包含所有较低元素的分数,我认为。
This score value decides if the HTML page can be "page viewed" in Firefox.
该分值决定了 HTML 页面是否可以在 Firefox 中“查看页面”。
I am not absolutely clear if the score value is set by Firefox or by the readability function.
我不是很清楚分数值是由 Firefox 设置的还是由可读性功能设置的。
Javascript is really not my strong point,and I think someone else should check over the link provided by Richard ( https://github.com/mozilla/readability) and see if they can provide a more thorough answer.
Javascript 真的不是我的强项,我认为其他人应该检查 Richard ( https://github.com/mozilla/readability)提供的链接,看看他们是否可以提供更全面的答案。
What I did not see but expected to see was score based on amount of text content in a <p>or a <div>(or other) relevant tags.
我没有看到但希望看到的是基于 a<p>或 a <div>(或其他)相关标签中的文本内容量的分数。
Any improvements on this question or answer, please share!!
对这个问题或答案的任何改进,请分享!!
EDIT:
Images in <div>or <figure>tags (HTML5) within the <p>element appear to be retained in the Reader View when the page text content is valid.
编辑:当页面文本内容有效时,元素中的图像<div>或<figure>标签 (HTML5)<p>似乎保留在阅读器视图中。
回答by Sean Bone
I followed Martin's link to the Readability.js GitHub repository, and had a look at the source code. Here's what I make of it.
我跟随 Martin 的链接到 Readability.js GitHub 存储库,并查看了源代码。这就是我的看法。
The algorithm works with paragraph tags. First of all, it tries to identify parts of the page which are definitely not content - like forms and so on - and removes them. Then it goes through the paragraph nodes on the page and assigns a score based on content-richness: it gives them points for things like number of commas, length of content, etc. Notice that a paragraph with fewer than 25 characters is immediately discarded.
该算法适用于段落标签。首先,它会尝试识别页面中绝对不是内容的部分——比如表单等——并删除它们。然后它遍历页面上的段落节点并根据内容的丰富程度分配一个分数:它为诸如逗号数量、内容长度等内容提供分数。请注意,少于 25 个字符的段落将立即被丢弃。
Scores then "bubble up" the DOM tree: each paragraph will add part of it's score to all of it's parent nodes - a direct parent gets the full score added to its total, a grandparent only half, a great-grandparent a third and so on. This allows the algorithm to identify higher-level elements which are likely to be the main content section.
分数然后“冒泡”DOM 树:每个段落都会将它的一部分分数添加到它的所有父节点 - 直接父节点将完整分数添加到其总数中,祖父母只有一半,曾祖父母是三分之一等等在。这允许算法识别可能是主要内容部分的更高级别的元素。
Though this is just Firefox's algorithm, my guess is if it works well for Firefox, it'll work well for other browsers too.
虽然这只是 Firefox 的算法,但我的猜测是,如果它适用于 Firefox,它也适用于其他浏览器。
In order for these Reader View algorithms to work for your website, you want them to correctly identify the content-heavy sections of your page. This means you want the more content-heavy nodes on your page to get high scores in the algorithm.
为了让这些阅读器视图算法适用于您的网站,您希望它们正确识别页面中内容丰富的部分。这意味着您希望页面上内容较多的节点在算法中获得高分。
So here are some rules of thumb to improve the quality of the page in the eyes of these algorithms:
因此,这里有一些经验法则可以提高这些算法眼中的页面质量:
- Use paragraph tags in your content! Many people tend to overlook
them in favor of
<br />tags. While it may look similar, many content-related algorithms (not only Reader View ones) rely heavily on them. - Use HTML5 semantic elements in your markup, like
<article>,<nav>,<section>,<aside>. Even though they're not the only criterion (as you noted in the question), these are very useful to computers reading your page (not just Reader View) to distinguish different sections of your content. Readability.js uses them to guess which nodes are likely or unlikely to contain important content. - Wrap your main content in one container, like an
<article>or<div>element. This will receive score points from all the paragraph tags inside it, and be identified as the main content section. - Keep your DOM tree shallow in content-dense areas. If you have a lot of elements breaking your content up, you're only making life harder for the algorithm: there won't be a single element that stands out as being parent of a lot of content-heavy paragraphs, but many separate ones with low scores.
- 在您的内容中使用段落标签!许多人倾向于忽略它们而倾向于
<br />标签。虽然它看起来很相似,但许多与内容相关的算法(不仅仅是阅读器视图算法)严重依赖它们。 - 在标记中使用 HTML5 语义元素,例如
<article>、<nav>、<section>、<aside>。尽管它们不是唯一的标准(如您在问题中所指出的),但它们对于阅读您的页面(不仅仅是阅读器视图)的计算机以区分您内容的不同部分非常有用。Readability.js 使用它们来猜测哪些节点可能或不可能包含重要内容。 - 将您的主要内容包装在一个容器中,例如一个
<article>或<div>元素。这将从其中的所有段落标签获得分数,并被识别为主要内容部分。 - 在内容密集的区域保持 DOM 树的浅层。如果你有很多元素将你的内容分解,你只会让算法变得更难:不会有一个元素作为大量内容丰富的段落的父元素而脱颖而出,而是许多独立的元素分数低。

