将 HTML 转换为纯文本同时保留换行符(使用 JavaScript)的最便捷方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3813167/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the most convenient way to convert HTML to plain text while preserving line breaks (with JavaScript)?
提问by Danylo Mysak
Basically I just need the effect of copying that HTML from browser window and pasting it in a textarea element.
基本上我只需要从浏览器窗口复制该 HTML 并将其粘贴到 textarea 元素中的效果。
For example I want this:
例如我想要这个:
<p>Some</p>
<div>text<br />Some</div>
<div>text</div>
to become this:
变成这样:
Some
text
Some
text
回答by Tim Down
If that HTML is visible within your web page, you could do it with the user selection (or just a TextRangein IE). This does preserve line breaks, if not necessarily leading and trailing white space.
如果该 HTML 在您的网页中可见,您可以通过用户选择(或仅TextRange在 IE 中)来实现。这确实保留了换行符,如果不一定是前导和尾随空格。
UPDATE 10 December 2012
2012 年 12 月 10 日更新
However, the toString()method of Selectionobjects is not yet standardizedand works inconsistently between browsers, so this approach is based on shaky ground and I don't recommend using it now. I would delete this answer if it weren't accepted.
但是,对象的toString()方法还没有标准化,并且在浏览器之间的工作不一致,所以这种方法是基于不可靠的,我现在不建议使用它。如果没有被接受,我会删除这个答案。Selection
Demo: http://jsfiddle.net/wv49v/
演示:http: //jsfiddle.net/wv49v/
Code:
代码:
function getInnerText(el) {
var sel, range, innerText = "";
if (typeof document.selection != "undefined" && typeof document.body.createTextRange != "undefined") {
range = document.body.createTextRange();
range.moveToElementText(el);
innerText = range.text;
} else if (typeof window.getSelection != "undefined" && typeof document.createRange != "undefined") {
sel = window.getSelection();
sel.selectAllChildren(el);
innerText = "" + sel;
sel.removeAllRanges();
}
return innerText;
}
回答by Kevin Wiskia
I tried to find some code I wrote for this a while back that I used. It worked nicely. Let me outline what it did, and hopefully you could duplicate its behavior.
我试图找到一些我用过的代码。它工作得很好。让我概述它做了什么,希望你能复制它的行为。
- Replace images with alt or title text.
- Replace links with "text[link]"
- Replace things that generally produce vertical white space. h1-h6, div, p, br, hr, etc. (I know, I know. These could actually be inline elements, but it works out well.)
- Strip out the rest of the tags and replace with an empty string.
- 用 alt 或标题文本替换图像。
- 用“文本[链接]”替换链接
- 替换通常会产生垂直空白的东西。h1-h6、div、p、br、hr 等(我知道,我知道。这些实际上可以是内联元素,但效果很好。)
- 去掉其余的标签并用空字符串替换。
You could even expand this more to format things like ordered and unordered lists. It really just depends on how far you'll want to go.
您甚至可以进一步扩展它以格式化有序和无序列表之类的内容。这真的只取决于你想走多远。
EDIT
编辑
Found the code!
找到代码了!
public static string Convert(string template)
{
template = Regex.Replace(template, "<img .*?alt=[\"']?([^\"']*)[\"']?.*?/?>", ""); /* Use image alt text. */
template = Regex.Replace(template, "<a .*?href=[\"']?([^\"']*)[\"']?.*?>(.*)</a>", " []"); /* Convert links to something useful */
template = Regex.Replace(template, "<(/p|/div|/h\d|br)\w?/?>", "\n"); /* Let's try to keep vertical whitespace intact. */
template = Regex.Replace(template, "<[A-Za-z/][^<>]*>", ""); /* Remove the rest of the tags. */
return template;
}
回答by chrmcpn
I made a function based on this answer: https://stackoverflow.com/a/42254787/3626940
我根据这个答案做了一个函数:https: //stackoverflow.com/a/42254787/3626940
function htmlToText(html){
//remove code brakes and tabs
html = html.replace(/\n/g, "");
html = html.replace(/\t/g, "");
//keep html brakes and tabs
html = html.replace(/<\/td>/g, "\t");
html = html.replace(/<\/table>/g, "\n");
html = html.replace(/<\/tr>/g, "\n");
html = html.replace(/<\/p>/g, "\n");
html = html.replace(/<\/div>/g, "\n");
html = html.replace(/<\/h>/g, "\n");
html = html.replace(/<br>/g, "\n"); html = html.replace(/<br( )*\/>/g, "\n");
//parse html into text
var dom = (new DOMParser()).parseFromString('<!doctype html><body>' + html, 'text/html');
return dom.body.textContent;
}
回答by holm50
Based on chrmcpnanswer, I had to convert a basic HTML email template into a plain text version as part of a build script in node.js. I had to use JSDOMto make it work, but here's my code:
根据chrmcpn答案,我必须将基本 HTML 电子邮件模板转换为纯文本版本,作为node.js 中构建脚本的一部分。我不得不使用JSDOM使其工作,但这是我的代码:
const htmlToText = (html) => {
html = html.replace(/\n/g, "");
html = html.replace(/\t/g, "");
html = html.replace(/<\/p>/g, "\n\n");
html = html.replace(/<\/h1>/g, "\n\n");
html = html.replace(/<br>/g, "\n");
html = html.replace(/<br( )*\/>/g, "\n");
const dom = new JSDOM(html);
let text = dom.window.document.body.textContent;
text = text.replace(/ /g, "");
text = text.replace(/\n /g, "\n");
text = text.trim();
return text;
}
回答by Serapth
Three steps.
三个步骤。
First get the html as a string.
Second, replace all <BR /> and <BR> with \r\n.
Third, use the regular expression "<(.|\n)*?>" to replace all markup with "".

