javascript 如何在 node.js 中将 HTML 页面转换为纯文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19985667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-27 17:27:28  来源:igfitidea点击:

How to convert HTML page to plain text in node.js?

javascriptnode.jsscreen-scraping

提问by metalaureate

I know this has been asked before but I can't find a good answer for node.js

我知道以前有人问过这个问题,但我找不到 node.js 的好答案

I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched.

我需要服务器端从获取的 HTML 页面中提取纯文本(无标签、脚本等)。

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.

我知道如何使用 jQuery 在客户端执行此操作(获取 body 标记的 .text() 内容),但不知道如何在服务器端执行此操作。

I've tried https://npmjs.org/package/html-to-textbut this doesn't handle scripts.

我试过https://npmjs.org/package/html-to-text但这不能处理脚本。

  var htmlToText = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = htmlToText.fromString(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.

我试过 phantom.js 但找不到只获取纯文本的方法。

采纳答案by hgoebl

Use jsdomand jQuery (server-side).

使用jsdom和 jQuery(服务器端)。

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

使用 jQuery,您可以删除所有脚本、样式、模板等,然后您可以提取文本。

Example

例子

(This is not tested with jsdom and node, only in Chrome)

(这不是用 jsdom 和 node 测试的,只在 Chrome 中测试)

jQuery('script').remove()
jQuery('noscript').remove()
jQuery('body').text().replace(/\s{2,9999}/g, ' ')

回答by Geroj

You can use TextVersionJS (http://textversionjs.com) to generate the plain text version of an HTML string. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

您可以使用 TextVersionJS ( http://textversionjs.com) 生成 HTML 字符串的纯文本版本。它是纯 javascript(带有大量 RegExp),因此您可以在浏览器和 node.js 中使用它。

This library may work for your needs, but it's NOTthe same as getting the text of an element in the browser. Its purpose is to create a text version of an HTML email. This means that things like images are included. For example, given the following HTML and code snippet:

这个库可能适用于您的需求,但它是不是与获取在浏览器中元素的文本。其目的是创建 HTML 电子邮件的文本版本。这意味着包括图像之类的东西。例如,给定以下 HTML 和代码片段:

var textVersion = require("textversionjs");
var htmlText = "<html>" +
                    "<body>" +
                        "Lorem ipsum <a href=\"http://foo.foo\">dolor</a> sic <strong>amet</strong><br />" +
                        "Lorem ipsum <img src=\"http://foo.jpg\" alt=\"foo\" /> sic <pre>amet</pre>" +
                        "<p>Lorem ipsum dolor <br /> sic amet</p>" +
                        "<script>" +
                            "alert(\"nothing\");" +
                        "</script>" +
                    "</body>" +
                "</html>";
var plainText = textVersion.htmlToPlainText(htmlText);

The variable plainTextwill contain this string:

该变量plainText将包含以下字符串:

Lorem ipsum [dolor] (http://foo.foo) sic amet
Lorem ipsum ![foo] (http://foo.jpg) sic amet
Lorem ipsum dolor
sic amet

Note that it does properly ignore script tags. You'll find the latest version of the source codeon GitHub.

请注意,它确实会正确忽略脚本标记。您可以在 GitHub 上找到最新版本的源代码

回答by Brad

As another answer suggested, use JSDOM, but you don't need jQuery. Try this:

正如另一个答案所建议的那样,使用 JSDOM,但您不需要 jQuery。试试这个:

JSDOM.fragment(sourceHtml).textContent

回答by Grimnoff

Why not just get textContent of the body tag?

为什么不直接获取 body 标签的 textContent 呢?

var body = document.getElementsByTagName('body')[0];
var bodyText = body.textContent;