Html 您如何处理 MS Word 添加的“特殊”字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/832020/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you deal with the "special" characters that MS Word adds?
提问by Darryl Hein
I'm wondering how you clean the special characters that MS Word as, such as m- and n-dashes and curly quotes?
我想知道您如何清理 MS Word 中的特殊字符,例如 m 和 n 破折号以及卷曲引号?
I often find myself copying content from clients from Word and pasting into a static HTML page, but the content ends up with weird characters because the special characters are not converted to their correct ACSII codes and therefore show up as garbled text. (For these basic websites, I'm using Dreamweaver.)
我经常发现自己从 Word 中复制客户端的内容并粘贴到静态 HTML 页面中,但内容最终会出现奇怪的字符,因为特殊字符没有转换为正确的 ACSII 代码,因此显示为乱码。(对于这些基本网站,我使用的是 Dreamweaver。)
I have seen a lot of similar problems when clients copy content from Word into text only fields (mostly textareas). When I put this into a PDF (through PHP) or it shows up on the page it too has garbled text.
当客户将 Word 中的内容复制到纯文本字段(主要是 textarea)时,我看到了很多类似的问题。当我将其放入 PDF(通过 PHP)或它显示在页面上时,它也出现乱码。
How do you deal with this? Is there a cleaning service or program you use?
你如何处理这个问题?是否有您使用的清洁服务或程序?
回答by chazomaticus
With regards to clients posting copy/pasted text from Word in textareas:
关于客户在 textarea 中从 Word 发布复制/粘贴文本:
The most reliable way to ensure that the client sends you text in any particular encoding (thus hopefully doing any conversion from CP-1252 [or whatever Word uses] for you), is to add the accept-charset="..."
attribute to all your <form>
s. E.g.:
确保客户端以任何特定编码向您发送文本(因此希望从 CP-1252 [或任何 Word 使用的] 为您进行任何转换)的最可靠方法是将accept-charset="..."
属性添加到您<form>
的所有s。例如:
<form ... accept-charset="UTF-8">
...
</form>
Most browsers will obey that and make sure any "Word-specific" characters are converted to the appropriate character set before it gets to your website.
大多数浏览器都会遵守这一点,并确保在到达您的网站之前将任何“特定于单词的”字符转换为适当的字符集。
Once invalid text gets to your website, there's very little you can do to fix it reliably, so it's best to simply check all input for being valid in whatever character set you use, and discard any requests that have invalid text. This is necessary even with accept-charset
, because undoubtedly there are some clients out there that will ignore it.
一旦无效文本进入您的网站,您几乎无法可靠地修复它,因此最好简单地检查所有输入在您使用的任何字符集中是否有效,并丢弃任何包含无效文本的请求。即使使用accept-charset
,这也是必要的,因为毫无疑问,有些客户会忽略它。
回答by Rutunj sheladiya
You can use preg_replace
function call to remove all special characters of word or others from your string
您可以使用preg_replace
函数调用从字符串中删除单词或其他的所有特殊字符
preg_replace('/[^\x00-\x7F]+/', '', $str);
回答by Michael Borgwardt
Pay attention to specify an encoding everywhere and use UTF-8, then those "special" characters should survive just fine. But once they've gone through an encoding that can't represent them, the information which character it was originally is lost, so it can't be repaired (except for some specific though probably very common cases like switching between Cp1252 and ISO-8859-1).
注意在任何地方指定编码并使用 UTF-8,那么那些“特殊”字符应该可以正常存在。但是一旦它们经过了不能代表它们的编码,它最初是哪个字符的信息就丢失了,因此无法修复(除了一些特定但可能非常常见的情况,例如在 Cp1252 和 ISO 之间切换- 8859-1)。
回答by Adrien
You might try the Demoroniser.
你可以试试Demoroniser。
回答by JasonPlutext
Make sure Word is configured to use UTF-8 for "Save As.." HTML.
确保 Word 被配置为对“另存为..”HTML 使用 UTF-8。
This is in Options > Word Options > Advanced > Web Options > Encoding
这是在选项> Word 选项> 高级> Web 选项> 编码
回答by Scott
If it's a Word file that's just text (i.e.: no graphics, tables, etc.), you might try Saving As HTML from within Word, copy/pasting the resulting HTML into your document in Dreamweaver, and then use Dreamweaver's "Clean Up Word HTML" function (under the Command menu).
如果它是一个只有文本的 Word 文件(即:没有图形、表格等),您可以尝试从 Word 中另存为 HTML,将生成的 HTML 复制/粘贴到 Dreamweaver 中的文档中,然后使用 Dreamweaver 的“清理 Word” HTML”功能(在命令菜单下)。
As an alternative, you can try fix my HTML, though I've not personally tried it with Word text, so results may vary.
作为替代方案,您可以尝试修复我的 HTML,尽管我没有亲自尝试过使用 Word 文本,因此结果可能会有所不同。