bash 用 `sed` 和 `tr` 替换空字节

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42592601/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 15:51:09  来源:igfitidea点击:

Replacing null bytes with `sed` vs `tr`

bashsed

提问by

Bash newbie; using this idiom to generate repeats of a string:

Bash 新手;使用这个习语来生成一个字符串的重复:

echo $(head -c $numrepeats /dev/zero | tr '
echo $(head -c $numrepeats /dev/zero | sed 's/
echo $(head -c $numrepeats /dev/zero | sed 's/\x0/MyString/g' )
/MyString/g' )
' 'S')

I decided I wanted to replace each null byte with more than one character (eg. 'MyString' instead of just 'S'), so I tried the following with sed

我决定用多个字符替换每个空字节(例如,“MyString”而不是“S”),所以我用 sed 尝试了以下操作

echo $(head -c $numrepeats /dev/zero | sed 's/\x00/MyString/g' )

But I just get an empty output. I realized I have to do

但我只是得到一个空的输出。我意识到我必须做

echo -n hi |sed 's/h/t/g' |hexdump -c    (0000000   t   i)

or

或者

echo -n hi |sed 's/h//g' |hexdump -c      (0000000   i)

instead, but I don't understand why. What is the difference between the characters that trand sedmatch? Is it because sedis matching against a regex?

相反,但我不明白为什么。trsed匹配的字符有什么区别?是因为sed匹配正则表达式吗?

EditInteresting discovery that \0in the replacementportion of the 's/regexp/replacement'sedcommand actually behaves the same as &. Still doesn't explain why \0in regexpdoesn't match the nullbyte though (as it does in trand most other regex implementations)

编辑有趣的发现,\0在命令的replacement部分's/regexp/replacement'sed实际上与&. 仍然没有解释为什么\0inregexp与空字节不匹配(正如它在tr和大多数其他正则表达式实现中所做的那样)

采纳答案by linuxfan says Reinstate Monica

From the manual page of tr(1):

从 tr(1) 的手册页:

SETs are specified as strings of characters ... Interpreted sequences are:
\NNN character with octal value NNN (1 to 3 octal digits)

SET 被指定为字符串......解释的序列是:
\NNN 字符与八进制值 NNN(1 到 3 个八进制数字)

For sed(1), the manual page is not so clear, so a few tries can show something:

对于 sed(1),手册页不是很清楚,所以尝试几次可以显示一些内容:

echo -n hi |sed 's/h/
echo -n hi |sed 's/h/
echo -n hi |sed 's/h/\o0/g' |hexdump -c    (0000000  
echo -n hi |sed 's/h/\x0/g' |hexdump -c    (0000000  
$ sed --version
sed (GNU sed) 4.4
Packaged by Cygwin (4.4-1)

$ echo -e "Hello
$ echo -e "Hello##代码##World" | sed 's/\o0/MyString/g'
HelloMyStringWorld
World" | hexdump.exe -c 0000000 H e l l o ##代码## W o r l d \n 000000c $ echo -e "Hello##代码##World" | sed 's/\x0/MyString/g' HelloMyStringWorld $ echo -e "Hello##代码##World" | sed 's/\x00/MyString/g' HelloMyStringWorld
i)
i)
/g' |hexdump -c (0000000 h 0 i)
/g' |hexdump -c (0000000 h i)

Easy. Then:

简单。然后:

##代码##

Empty pattern deletes the match. Again easy. Then:

空模式删除匹配。又容易了。然后:

##代码##

This \0 seems to do nothing. So try

这个 \0 似乎什么都不做。所以试试

##代码##

Oh! Could it take \0 as a reference to the matched part? This would explain also the previous example. sed man page talks about \1 to \9, not \0 (but \0 has a meaning anyway, even in the pattern specification).

哦!是否可以将 \0 作为对匹配部分的引用?这也可以解释前面的例子。sed 手册页谈论的是 \1 到 \9,而不是 \0(但 \0 无论如何都有意义,即使在模式规范中也是如此)。

So, to cut it short: for sed, \0 has a special meaning which is nota NUL char. But it understands octal:

因此,简而言之:对于 sed,\0 具有特殊含义,它不是NUL 字符。但它理解八进制:

##代码##

and hexadecimal:

和十六进制:

##代码##

As pointed out in the comments, tr and sed are different tools, designed differently. Yes, sed uses regexp while tr does not, but this is not the general explanation about \0 is interpreted differently. In the messy world of unix there are, often, some conventions. In the messy world of unix there are, moreoften, exceptions to those conventions.

正如评论中指出的那样, tr 和 sed 是不同的工具,设计不同。是的,sed 使用正则表达式而 tr 不使用,但这不是关于 \0 的一般解释,解释方式不同。在混乱的 unix 世界中,通常有一些约定。在混乱的 Unix 世界中,更多情况下,这些约定有例外。

回答by gregory

Specious question: there is no trand sedper se. Rather there are versions of these programs across time and os platforms. Generally speaking UNIX's history is a rapid florescence of variation; more specifically trwas released for Version 4 Unix in 1973, while sedfirst appeared in Version 7 Unix in 1979. From the get-go, these were written by different authors, on different os, for different shells, with different purposes (note: Bash was written much latter in 1989 and is NOT the "owner" of either of these utilities). And, things only get more varied and complex in terms of how these programs independently evolved, were maintained (again by different authors), how/which bugs were fixed, etc. While much effort has been made of late to standardize core utilities, assuming that sedand trwould treat characters in the exact same way is failing to grok the history, the troublesome lack of standards as well the strangely beneficial plurality of UNIX itself.

似是而非的问题:没有trsed本身。相反,这些程序有跨时间和跨操作系统平台的版本。一般来说,UNIX 的历史是一个快速变化的花期;更具体地说,tr是在 1973 年为第 4 版 Unix 发布的,而sed在 1979 年首次出现在第 7 版 Unix 中。os,用于不同的外壳,具有不同的目的(注意:Bash 是在 1989 年后期编写的,并且不是这些实用程序中的任何一个的“所有者”)。而且,在这些程序如何独立发展、维护(同样由不同的作者)、如何/哪些错误被修复等方面,事情只会变得更加多样化和复杂。虽然最近已经做了很多努力来标准化核心实用程序,假设这sedtr将视字符完全相同的方式失败神交历史,麻烦缺乏标准,以及诡异的有益多个UNIX本身。

回答by Scheff

The latter two commands in the question does work:

问题中的后两个命令确实有效:

##代码##

Octal sequences have to be prefixed by \o(thanks, Benjamin W., for this hint):

八进制序列必须以\o(感谢Benjamin W. 的提示)为前缀:

##代码##

Thus, there must be another issue in the OP.

因此,OP 中肯定还有另一个问题。