Bash:将文本文件拆分为以非字母数字字符作为分隔符的单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3791567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 22:41:46  来源:igfitidea点击:

Bash: Split text-file into words with non-alphanumeric characters as delimiters

parsingbashscripting

提问by Sv1

Lets say "textfile" contains the following:

假设“ textfile”包含以下内容:

lorem$ipsum-is9simply the.dummy text%of-printing

and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:

并且您想在单独的行上打印每个单词。但是,单词不仅应由空格定义,还应由所有非字母数字字符定义。所以结果应该是这样的:

 lorem
 ipsum  
 is9simply  
 the  
 dummy  
 text  
 of  
 printing

How can I accomplish this using the Bash shell?

如何使用 Bash shell 完成此操作?




Some notes:


一些注意事项:

  • This is not a homework question.

  • The simpler case when the words should be determined only by spaces, is easy. Just writing:

    for i in `cat textfile`; do echo $i; done;
    

    will do the trick, and return:

     lorem$ipsum-is9simply
     the.dummy
     text%of-printing
    

    For splitting words by non-alphanumeric characters I have seen solutions that use the IFS environmental variable (links below), but I would like to avoid using IFS for two reasons: 1) it would require (I think) setting the IFS to a long list of non-alphanumeric characters. 2) I find it kind of ugly.

  • Here are the two related Q&As I found
    How do I split a string on a delimiter in Bash?
    How to split a line into words separated by one or more spaces in bash?

  • 这不是家庭作业问题。

  • 当单词应该仅由空格确定时,更简单的情况很容易。只是写

    for i in `cat textfile`; do echo $i; done;
    

    会做的伎俩,并返回:

     lorem$ipsum-is9simply
     the.dummy
     text%of-printing
    

    对于按非字母数字字符拆分单词,我已经看到使用 IFS 环境变量的解决方案(下面的链接),但我想避免使用 IFS 有两个原因:1)它需要(我认为)将 IFS 设置为 long非字母数字字符列表。2)我觉得有点丑。

  • 这是我发现的两个相关问答
    如何在 Bash 中用分隔符拆分字符串?
    如何在bash中将一行拆分为由一个或多个空格分隔的单词?

回答by Jonathan Leffler

Use the trcommand:

使用tr命令:

tr -cs 'a-zA-Z0-9' '\n' <textfile

The '-c' is for the complement of the specified characters; the '-s' squeezes out duplicates of the replacements; the 'a-zA-Z0-9'is the set of alphanumeric characters (maybe add _too?); the '\n' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characters than the list above):

' -c' 用于指定字符的补码;' -s' 挤出替换的重复项;的'a-zA-Z0-9'是一组字母数字字符(也许添加_吗?); '\n' 是替换字符(换行符)。您还可以使用对区域设置敏感的字符类(并且可能包含比上面列表更多的字符):

tr -cs '[:alnum:]' '\n' <textfile

回答by DigitalRoss

$ awk -f splitter.awk < textfile

$ cat splitter.awk
{
  count0 = split(##代码##, asplit, "[^a-zA-Z0-9]")
  for(i = 1; i <= count0; ++i) { print asplit[i] }
}