javascript 如何在 node.js 中将 HTML 页面转换为纯文本？

Question

提问by metalaureate

I know this has been asked before but I can't find a good answer for node.js

我知道以前有人问过这个问题，但我找不到 node.js 的好答案

I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched.

我需要服务器端从获取的 HTML 页面中提取纯文本（无标签、脚本等）。

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.

我知道如何使用 jQuery 在客户端执行此操作（获取 body 标记的 .text() 内容），但不知道如何在服务器端执行此操作。

I've tried https://npmjs.org/package/html-to-textbut this doesn't handle scripts.

我试过https://npmjs.org/package/html-to-text但这不能处理脚本。

  var htmlToText = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = htmlToText.fromString(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.

我试过 phantom.js 但找不到只获取纯文本的方法。

Answer 1

采纳答案by hgoebl

Use jsdomand jQuery (server-side).

使用jsdom和 jQuery（服务器端）。

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

使用 jQuery，您可以删除所有脚本、样式、模板等，然后您可以提取文本。

Example

例子

(This is not tested with jsdom and node, only in Chrome)

（这不是用 jsdom 和 node 测试的，只在 Chrome 中测试）

jQuery('script').remove()
jQuery('noscript').remove()
jQuery('body').text().replace(/\s{2,9999}/g, ' ')

Answer 2

回答by Geroj

You can use TextVersionJS (http://textversionjs.com) to generate the plain text version of an HTML string. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

您可以使用 TextVersionJS ( http://textversionjs.com) 生成 HTML 字符串的纯文本版本。它是纯 javascript（带有大量 RegExp），因此您可以在浏览器和 node.js 中使用它。

This library may work for your needs, but it's NOTthe same as getting the text of an element in the browser. Its purpose is to create a text version of an HTML email. This means that things like images are included. For example, given the following HTML and code snippet:

这个库可能适用于您的需求，但它是不是与获取在浏览器中元素的文本。其目的是创建 HTML 电子邮件的文本版本。这意味着包括图像之类的东西。例如，给定以下 HTML 和代码片段：

var textVersion = require("textversionjs");
var htmlText = "<html>" +
                    "<body>" +
                        "Lorem ipsum <a href=\"http://foo.foo\">dolor</a> sic <strong>amet</strong><br />" +
                        "Lorem ipsum <img src=\"http://foo.jpg\" alt=\"foo\" /> sic <pre>amet</pre>" +
                        "<p>Lorem ipsum dolor <br /> sic amet</p>" +
                        "<script>" +
                            "alert(\"nothing\");" +
                        "</script>" +
                    "</body>" +
                "</html>";
var plainText = textVersion.htmlToPlainText(htmlText);

The variable plainTextwill contain this string:

该变量plainText将包含以下字符串：

Lorem ipsum [dolor] (http://foo.foo) sic amet
Lorem ipsum ![foo] (http://foo.jpg) sic amet
Lorem ipsum dolor
sic amet

Note that it does properly ignore script tags. You'll find the latest version of the source codeon GitHub.

请注意，它确实会正确忽略脚本标记。您可以在 GitHub 上找到最新版本的源代码。

Answer 3

回答by Brad

As another answer suggested, use JSDOM, but you don't need jQuery. Try this:

正如另一个答案所建议的那样，使用 JSDOM，但您不需要 jQuery。试试这个：

JSDOM.fragment(sourceHtml).textContent

Answer 4

回答by Grimnoff

Why not just get textContent of the body tag?

为什么不直接获取 body 标签的 textContent 呢？

var body = document.getElementsByTagName('body')[0];
var bodyText = body.textContent;

javascript 如何在 node.js 中将 HTML 页面转换为纯文本？

提问by metalaureate

采纳答案by hgoebl

回答by Geroj

回答by Brad

回答by Grimnoff

相关推荐

最近更新

标签

javascript 如何在 node.js 中将 HTML 页面转换为纯文本？

提问by metalaureate

采纳答案by hgoebl

回答by Geroj

回答by Brad

回答by Grimnoff

相关推荐

javascript 选择值更改时，Ajax呼叫从数据库查询填充表单字段

javascript Facebook - FB.UI 提要和共享对话框

javascript JQuery：如果表头 <th> 有一个类，则将类添加到表单元格 <td>

Javascript：setAttribute 有效，getAttribute 失败

相关推荐

最近更新

标签