Bash:将文本文件拆分为以非字母数字字符作为分隔符的单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3791567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash: Split text-file into words with non-alphanumeric characters as delimiters
提问by Sv1
Lets say "textfile" contains the following:
假设“ textfile”包含以下内容:
lorem$ipsum-is9simply the.dummy text%of-printing
and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:
并且您想在单独的行上打印每个单词。但是,单词不仅应由空格定义,还应由所有非字母数字字符定义。所以结果应该是这样的:
lorem
ipsum
is9simply
the
dummy
text
of
printing
How can I accomplish this using the Bash shell?
如何使用 Bash shell 完成此操作?
Some notes:
一些注意事项:
This is not a homework question.
The simpler case when the words should be determined only by spaces, is easy. Just writing:
for i in `cat textfile`; do echo $i; done;will do the trick, and return:
lorem$ipsum-is9simply the.dummy text%of-printingFor splitting words by non-alphanumeric characters I have seen solutions that use the IFS environmental variable (links below), but I would like to avoid using IFS for two reasons: 1) it would require (I think) setting the IFS to a long list of non-alphanumeric characters. 2) I find it kind of ugly.
Here are the two related Q&As I found
How do I split a string on a delimiter in Bash?
How to split a line into words separated by one or more spaces in bash?
这不是家庭作业问题。
当单词应该仅由空格确定时,更简单的情况很容易。只是写:
for i in `cat textfile`; do echo $i; done;会做的伎俩,并返回:
lorem$ipsum-is9simply the.dummy text%of-printing对于按非字母数字字符拆分单词,我已经看到使用 IFS 环境变量的解决方案(下面的链接),但我想避免使用 IFS 有两个原因:1)它需要(我认为)将 IFS 设置为 long非字母数字字符列表。2)我觉得有点丑。
这是我发现的两个相关问答
如何在 Bash 中用分隔符拆分字符串?
如何在bash中将一行拆分为由一个或多个空格分隔的单词?
回答by Jonathan Leffler
Use the trcommand:
使用tr命令:
tr -cs 'a-zA-Z0-9' '\n' <textfile
The '-c' is for the complement of the specified characters; the '-s' squeezes out duplicates of the replacements; the 'a-zA-Z0-9'is the set of alphanumeric characters (maybe add _too?); the '\n' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characters than the list above):
' -c' 用于指定字符的补码;' -s' 挤出替换的重复项;的'a-zA-Z0-9'是一组字母数字字符(也许添加_吗?); '\n' 是替换字符(换行符)。您还可以使用对区域设置敏感的字符类(并且可能包含比上面列表更多的字符):
tr -cs '[:alnum:]' '\n' <textfile
回答by DigitalRoss
$ awk -f splitter.awk < textfile
$ cat splitter.awk
{
count0 = split(##代码##, asplit, "[^a-zA-Z0-9]")
for(i = 1; i <= count0; ++i) { print asplit[i] }
}

