macos 用 sed 剥离十六进制字节 - 不匹配

Question

提问by G__

I have a text file with two non-ascii bytes (0xFF and 0xFE):

我有一个包含两个非 ascii 字节（0xFF 和 0xFE）的文本文件：

??58832520.3,ABC
348384,DEF

The hex for this file is:

该文件的十六进制是：

FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46

It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).

巧合的是 FF 和 FE 恰好是前导字节（它们存在于我的文件中，尽管似乎总是在一行的开头）。

I am trying to strip these bytes out with sed, but nothing I do seems to match them.

我试图用 sed 去除这些字节，但我所做的一切似乎都无法匹配它们。

$ sed 's/[^a-zA-Z0-9\,]//g' test.csv 
??588325203,ABC
348384,DEF

$ sed 's/[a-zA-Z0-9\,]//g' test.csv 
??.

Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?

主要问题：如何剥离这些字节？
额外问题：上面的两个正则表达式是直接否定，所以逻辑上其中之一必须过滤掉这些字节，对吗？为什么这两个正则表达式都匹配 0xFF 和 0xFE 字节？

Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:

更新：剥离一系列十六进制字节的直接方法（由下面的两个答案建议）似乎从每一行中剥离了第一个“合法”字节并留下我试图摆脱的字节：

$sed 's/[\x80-\xff]//' test.csv
??8832520.3,ABC
48384,DEF

FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.

请注意每行开头缺少的“5”和“3”，并将新的 0A 添加到文件末尾。

Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.

更大的更新：这个问题似乎是系统特定的。这个问题是在 OSX 上观察到的，但是这些建议（包括我上面的原始 sed 语句）在 NetBSD 上按我的预期工作。

A solution: This same task seems easy enough via Perl:

一个解决方案：同样的任务通过 Perl 似乎很容易：

$ perl -pe 's/^\xFF\xFE//' test.csv
58832520.3,ABC
348384,DEF

However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.

但是，我将这个问题悬而未决，因为这只是一种解决方法，并没有解释 sed 的问题所在。

Answer 1

采纳答案by deinst

sed 's/[^ -~]//g'

or as the other answer implies

或者正如另一个答案所暗示的那样

sed 's/[\x80-\xff]//g'

See section 3.9of the sed info pages. The chapter entitled escapes.

请参阅sed 信息页面的第 3.9 节。标题为逃脱的章节。

Editfor OSX, the native lang setting is en_US.UTF-8

为 OSX编辑，本机语言设置为 en_US.UTF-8

try

尝试

LANG='' sed 's/[^ -~]//g' myfile

This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8

这适用于这里的 osx 机器，我不完全确定为什么它在 UTF-8 中不起作用

Answer 2

回答by Gary

This will strip out all lines that begin with the specific bytes FF FE

这将删除以特定字节 FF FE 开头的所有行

sed -e 's/\xff\xfe//g' hexquestion.txt

The reason that your negated regexes aren't working is that the [] specifies a character class. sed is assuming a particular character set, probably ascii. These characters in your file aren't 7 bit ascii characters, as they both begin with F. sed doesn't know how to deal with these. The solution above doesn't use character classes, so it should be more portable between platforms and character sets.

您的否定正则表达式不起作用的原因是 [] 指定了一个字符类。sed 假设一个特定的字符集，可能是 ascii。您文件中的这些字符不是 7 位 ascii 字符，因为它们都以 F 开头。sed 不知道如何处理这些字符。上面的解决方案不使用字符类，因此它应该在平台和字符集之间更具可移植性。

Answer 3

回答by polygenelubricants

The FFand FEbytes at the beginning of your file is what is called a "byte order mark (BOM)". It can appear at the start of Unicode text streams to indicate the endianness of the text. FF FEindicates UTF-16 in Little Endian

文件开头的FF和FE字节称为“字节顺序标记 (BOM)”。它可以出现在 Unicode 文本流的开头，以指示文本的字节序。FF FE表示小端中的 UTF-16

Here's an excerpt from the FAQ:

以下是常见问题解答的摘录：

Q: How I should deal with BOMs?
A: Here are some guidelines to follow:
A particular protocol (e.g. Microsoft conventions for .txtfiles) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

问：我应该如何处理 BOM？
答：以下是一些需要遵循的准则：
特定协议（例如 Microsoft.txt文件约定）可能需要在某些 Unicode 数据流（例如文件）上使用 BOM。当您需要遵守此类协议时，请使用 BOM。
某些协议允许在未标记文本的情况下使用可选的 BOM。在这些情况下，
在已知文本数据流是纯文本但编码未知的情况下，可以将 BOM 用作签名。如果没有 BOM，则编码可以是任何内容。
如果已知文本数据流是纯 Unicode 文本（但不知道是哪种字节序），则可以将 BOM 用作签名。如果没有 BOM，则文本应解释为 big-endian。
一些面向字节的协议要求在文件开头使用 ASCII 字符。如果 UTF-8 与这些协议一起使用，则应避免使用 BOM 作为编码表单签名。
如果数据流的精确类型已知（例如 Unicode big-endian 或 Unicode little-endian），则不应使用 BOM。特别是，当数据流被声明为 UTF-16BE、UTF-16LE、UTF-32BE 或 UTF-32LE 时，不得使用 BOM。

References

参考

unicode.org/FAQ/UTF BOM

unicode.org/FAQ/UTF BOM

也可以看看

回答by dawg

On OS X, the Byte Order Mark is probably being read as a single word. Try either sed 's/^\xfffe//g'or sed 's/^\xfeff//g'depending on endianess.

在 OS X 上，字节顺序标记可能被视为单个单词。尝试sed 's/^\xfffe//g'或sed 's/^\xfeff//g'取决于字节序。

Answer 5

回答by Paused until further notice.

To show that this isn't an issue of the Unicode BOM, but an issue of eight-bit versus seven-bit characters and tied to the locale, try this:

为了表明这不是 Unicode BOM 的问题，而是 8 位与 7 位字符的问题并与区域设置相关，请尝试以下操作：

Show all the bytes:

显示所有字节：

$ printf '123 abc\xff\xfe\x7f\x80' | hexdump -C
00000000  31 32 33 20 61 62 63 ff  fe 7f 80                 |123 abc....|

Have sedremove characters that aren't alpha-numeric in the user's locale. Notice that the space and 0x7f are removed:

有sed不在用户的区域字母数字字符删除。请注意，删除了空格和 0x7f：

$ printf '123 abc\xff\xfe\x7f\x80'|sed 's/[^[:alnum:]]//g' | hexdump -C
00000000  31 32 33 61 62 63 ff fe  80                       |123abc...|

Have sedremove characters that aren't alpha-numeric in the C locale. Notice that only "123abc" remains:

有sed不在C语言环境中的字母数字字符删除。请注意，只剩下“123abc”：

$ printf '123 abc\xff\xfe\x7f\x80'|LANG=C sed 's/[^[:alnum:]]//g' | hexdump -C
00000000  31 32 33 61 62 63                                 |123abc|

Answer 6

回答by bashfu

As an alternative you may used ed(1):

作为替代，您可以使用 ed(1)：

printf '%s\n' H $'g/[\xff\xfe]/s///g' ',p' | ed -s test.csv

printf '%s\n' H $'g/[\xff\xfe]/s///g' wq | ed -s test.csv  # in-place edit

Answer 7

回答by schoetbi

You can get the hex codes with \xff \xfE and replace it by nothing.

您可以使用 \xff \xfE 获取十六进制代码，然后将其替换为空。

macos 用 sed 剥离十六进制字节 - 不匹配

提问by G__

采纳答案by deinst

回答by Gary

回答by polygenelubricants

References

参考

See also

也可以看看

Related questions

相关问题

回答by dawg

回答by Paused until further notice.

回答by bashfu

回答by schoetbi

相关推荐

最近更新

标签

macos 用 sed 剥离十六进制字节 - 不匹配

提问by G__

采纳答案by deinst

回答by Gary

回答by polygenelubricants

References

参考

See also

也可以看看

Related questions

相关问题

回答by dawg

回答by Paused until further notice.

回答by bashfu

回答by schoetbi

相关推荐

macos 如何在 Mac 上的 TextWrangler 中更改文本颜色

macos 可以在 Mac OS X 上使用 DYLD_LIBRARY_PATH 吗？而且，它的动态库搜索算法是什么？

macos 如何创建 AppleScript 应用程序来运行一组终端命令

macos 在 Mac 上设置 Java Swing 应用程序名称

相关推荐

最近更新

标签