为什么我的 Bash 脚本将 <feff> 添加到文件的开头?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1972362/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 18:48:33  来源:igfitidea点击:

Why is my Bash script adding <feff> to the beginning of files?

linuxbashsedcp

提问by SDGuero

I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:

我编写了一个脚本来清理 .csv 文件,使用 sed 删除一些错误的逗号和错误的引号(不好,意味着它们破坏了我们用来转换这些文件的内部程序):

# remove all commas, and re-insert the good commas using clean.sed
sed -f clean.sed  > .1st

# remove all quotes
sed 's/\"//g' .1st > .tmp

# add the good quotes around good commas
sed 's/\,/\"\,\"/g' .tmp > .tmp1

# add leading quotes
sed 's/^/\"/' .tmp1 > .tmp2

# add trailing quotes
sed 's/$/\"/' .tmp2 > .tmp3

# remove utf characters
sed 's/<feff>//' .tmp3 > .tmp4

# replace original file with new stripped version and delete .tmp files
cp -rf .tmp4 quotes_

Here is clean.sed:

这是clean.sed:

s/\",\"/XXX/g;
:a
s/,//g
ta
s/XXX/\",\"/g;

Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.

然后它删除临时文件,中提琴我们有一个以“quotes”一词开头的新文件,我们可以将其用于其他进程。

My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.

我的问题是:
为什么我必须做一个 sed 语句来删除该临时文件中的 feff 标签?原始文件没有它,但它总是出现在替换中。起初我以为是 cp 导致了这个,但是如果我在 cp 之前放入要删除的 sed 语句,它就不存在了。

Maybe I'm just missing something...

也许我只是错过了一些东西......

采纳答案by Mark Byers

U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.

U+FEFF 是字节顺序标记的代码点。您的文件很可能包含以 UTF-16 格式保存的数据,并且 BOM 已被您的“清理过程”损坏,这很可能需要 ASCII。删除 BOM 可能不是一个好主意,而是首先修复您的脚本以免损坏它。

回答by stinkoid

To get rid of these in GNU emacs:

要在 GNU emacs 中摆脱这些:

  1. Open Emacs
  2. Do a find-file-literally to open the file
  3. Edit off the leading three bytes
  4. Save the file
  1. 打开 Emacs
  2. 执行 find-file-literally 打开文件
  3. 编辑掉前三个字节
  4. 保存文件

There is also a way to convert files with DOS line termination convention to Unix line termination convention.

还有一种方法可以将具有 DOS 行终止约定的文件转换为 Unix 行终止约定。