使用 JavaScript 清理 Microsoft Word 粘贴文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2875027/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 02:17:56  来源:igfitidea点击:

Clean Microsoft Word Pasted Text using JavaScript

javascriptms-wordpaste

提问by OneNerd

I am using a 'contenteditable' <div/>and enabling PASTE.

我正在使用“contenteditable”<div/>并启用 PASTE。

It is amazing the amount of markup code that gets pasted in from a clipboard copy from Microsoft Word. I am battling this, and have gotten about 1/2 way there using Prototypes' stripTags()function (which unfortunately does not seem to enable me to keep some tags).

从 Microsoft Word 的剪贴板副本粘贴的标记代码数量惊人。我正在与此作斗争,并且已经使用 Prototypes 的stripTags()功能获得了大约 1/2 的方式(不幸的是,这似乎无法让我保留一些标签)。

However, even after that, I wind up with a mind-blowing amount of unneeded markup code.

然而,即使在那之后,我还是会得到大量不需要的标记代码。

So my question is, is there some function (using JavaScript), or approach I can use that will clean up the majority of this unneeded markup?

所以我的问题是,是否有一些函数(使用 JavaScript)或我可以使用的方法来清除大部分不需要的标记?

采纳答案by OneNerd

Here is the function I wound up writing that does the job fairly well (as far as I can tell anyway).

这是我最后编写的函数,它可以很好地完成工作(据我所知)。

I am certainly open for improvement suggestions if anyone has any. Thanks.

如果有人有任何改进建议,我当然愿意接受。谢谢。

function cleanWordPaste( in_word_text ) {
 var tmp = document.createElement("DIV");
 tmp.innerHTML = in_word_text;
 var newString = tmp.textContent||tmp.innerText;
 // this next piece converts line breaks into break tags
 // and removes the seemingly endless crap code
 newString  = newString.replace(/\n\n/g, "<br />").replace(/.*<!--.*-->/g,"");
 // this next piece removes any break tags (up to 10) at beginning
 for ( i=0; i<10; i++ ) {
  if ( newString.substr(0,6)=="<br />" ) { 
   newString = newString.replace("<br />", ""); 
  }
 }
 return newString;
}

Hope this is helpful to some of you.

希望这对你们中的一些人有所帮助。

回答by Daniel Sellers

I am using this:

我正在使用这个:

$(body_doc).find('body').bind('paste',function(e){
                var rte = $(this);
                _activeRTEData = $(rte).html();
                beginLen = $.trim($(rte).html()).length; 

                setTimeout(function(){
                    var text = $(rte).html();
                    var newLen = $.trim(text).length;

                    //identify the first char that changed to determine caret location
                    caret = 0;

                    for(i=0;i < newLen; i++){
                        if(_activeRTEData[i] != text[i]){
                            caret = i-1;
                            break;  
                        }
                    }

                    var origText = text.slice(0,caret);
                    var newText = text.slice(caret, newLen - beginLen + caret + 4);
                    var tailText = text.slice(newLen - beginLen + caret + 4, newLen);

                    var newText = newText.replace(/(.*(?:endif-->))|([ ]?<[^>]*>[ ]?)|(&nbsp;)|([^}]*})/g,'');

                    newText = newText.replace(/[·]/g,'');

                    $(rte).html(origText + newText + tailText);
                    $(rte).contents().last().focus();
                },100);
            });

body_doc is the editable iframe, if you are using an editable div you could drop out the .find('body') part. Basically it detects a paste event, checks the location cleans the new text and then places the cleaned text back where it was pasted. (Sounds confusing... but it's not really as bad as it sounds.

body_doc 是可编辑的 iframe,如果您使用的是可编辑的 div,您可以删除 .find('body') 部分。基本上它检测粘贴事件,检查位置清除新文本,然后将清除的文本放回粘贴的位置。(听起来令人困惑……但实际上并没有听起来那么糟糕。

The setTimeout is needed because you can't grab the text until it is actually pasted into the element, paste events fire as soon as the paste begins.

setTimeout 是必需的,因为在实际将文本粘贴到元素之前您无法获取文本,粘贴开始后会立即触发粘贴事件。

回答by Todd Main

You can either use the full CKEditorwhich cleans on paste, or look at the source.

您可以使用完整的CKEditor清理粘贴,或者查看源代码

回答by Josh

How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.

有一个“粘贴为纯文本”按钮,显示一个<textarea>,允许用户将文本粘贴在那里怎么样?这样,所有标签都将被删除。这就是我对我的 CMS 所做的;我放弃了清理 Word 烂摊子的尝试。

回答by user759463

This works great to remove any comments from HTML text, including those from Word:

这非常适合从 HTML 文本中删除任何注释,包括来自 Word 的注释:

function CleanWordPastedHTML(sTextHTML) {
  var sStartComment = "<!--", sEndComment = "-->";
  while (true) {
    var iStart = sTextHTML.indexOf(sStartComment);
    if (iStart == -1) break;
    var iEnd = sTextHTML.indexOf(sEndComment, iStart);
    if (iEnd == -1) break;
    sTextHTML = sTextHTML.substring(0, iStart) + sTextHTML.substring(iEnd + sEndComment.length);
  }
  return sTextHTML;
}

回答by ericmotil

Had a similar issue with line-breaks being counted as characters and I had to remove them.

有一个类似的问题,换行符被算作字符,我不得不删除它们。

$(document).ready(function(){

  $(".section-overview textarea").bind({
    paste : function(){
    setTimeout(function(){
      //textarea
      var text = $(".section-overview textarea").val();
      // look for any "\n" occurences and replace them
      var newString = text.replace(/\n/g, '');
      // print new string
      $(".section-overview textarea").val(newString);
    },100);
    }
  });
  
});

回答by rob

I did something like that long ago, where i totally cleaned up the stuff in a rich text editor and converted font tags to styles, brs to p's, etc, to keep it consistant between browsers and prevent certain ugly things from getting in via paste. I took my recursive function and ripped out most of it except for the core logic, this might be a good starting point ("result" is an object that accumulates the result, which probably takes a second pass to convert to a string), if that is what you need:

很久以前我做过类似的事情,我在富文本编辑器中完全清理了这些东西,并将字体标签转换为样式,将 brs 转换为 p 等,以保持浏览器之间的一致性并防止某些丑陋的东西通过粘贴进入。我拿了我的递归函数并删除了除了核心逻辑之外的大部分内容,这可能是一个很好的起点(“结果”是一个累积结果的对象,它可能需要第二遍才能转换为字符串),如果这就是你所需要的:

var cleanDom = function(result, n) {
var nn = n.nodeName;
if(nn=="#text") {
    var text = n.nodeValue;

    }
else {
    if(nn=="A" && n.href)
        ...;
    else if(nn=="IMG" & n.src) {
        ....
        }
    else if(nn=="DIV") {
        if(n.className=="indent")
            ...
        }
    else if(nn=="FONT") {
        }       
    else if(nn=="BR") {
        }

    if(!UNSUPPORTED_ELEMENTS[nn]) {
        if(n.childNodes.length > 0)
            for(var i=0; i<n.childNodes.length; i++) 
                cleanDom(result, n.childNodes[i]);
        }
    }
}

回答by souLTower

Could you paste to a hidden textarea, copy from same textarea, and paste to your target?

你能粘贴到一个隐藏的文本区域,从同一个文本区域复制,然后粘贴到你的目标吗?

回答by Amy B

Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US">for example) and I correct it manually.

不想这么说,但我最终放弃了让 TinyMCE 以我想要的方式处理 Word 垃圾。现在,每次用户输入包含某些 HTML(<span lang="en-US">例如查找)时,我都会收到一封电子邮件,然后我手动更正它。