Bash：将文本文件拆分为以非字母数字字符作为分隔符的单词

Question

提问by Sv1

Lets say "textfile" contains the following:

假设“ textfile”包含以下内容：

lorem$ipsum-is9simply the.dummy text%of-printing

and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:

并且您想在单独的行上打印每个单词。但是，单词不仅应由空格定义，还应由所有非字母数字字符定义。所以结果应该是这样的：

 lorem
 ipsum  
 is9simply  
 the  
 dummy  
 text  
 of  
 printing

How can I accomplish this using the Bash shell?

如何使用 Bash shell 完成此操作？

Some notes:

一些注意事项：

This is not a homework question.
The simpler case when the words should be determined only by spaces, is easy. Just writing:
```
for i in `cat textfile`; do echo $i; done;
```
will do the trick, and return:
```
 lorem$ipsum-is9simply
 the.dummy
 text%of-printing
```
For splitting words by non-alphanumeric characters I have seen solutions that use the IFS environmental variable (links below), but I would like to avoid using IFS for two reasons: 1) it would require (I think) setting the IFS to a long list of non-alphanumeric characters. 2) I find it kind of ugly.
Here are the two related Q&As I found
How do I split a string on a delimiter in Bash?
How to split a line into words separated by one or more spaces in bash?

这不是家庭作业问题。
当单词应该仅由空格确定时，更简单的情况很容易。只是写：
```
for i in `cat textfile`; do echo $i; done;
```
会做的伎俩，并返回：
```
 lorem$ipsum-is9simply
 the.dummy
 text%of-printing
```
对于按非字母数字字符拆分单词，我已经看到使用 IFS 环境变量的解决方案（下面的链接），但我想避免使用 IFS 有两个原因：1）它需要（我认为）将 IFS 设置为 long非字母数字字符列表。2）我觉得有点丑。
这是我发现的两个相关问答
如何在 Bash 中用分隔符拆分字符串？
如何在bash中将一行拆分为由一个或多个空格分隔的单词？

Answer 1

回答by Jonathan Leffler

Use the trcommand:

使用tr命令：

tr -cs 'a-zA-Z0-9' '\n' <textfile

The '-c' is for the complement of the specified characters; the '-s' squeezes out duplicates of the replacements; the 'a-zA-Z0-9'is the set of alphanumeric characters (maybe add _too?); the '\n' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characters than the list above):

' -c' 用于指定字符的补码；' -s' 挤出替换的重复项；的'a-zA-Z0-9'是一组字母数字字符（也许添加_吗？）; '\n' 是替换字符（换行符）。您还可以使用对区域设置敏感的字符类（并且可能包含比上面列表更多的字符）：

tr -cs '[:alnum:]' '\n' <textfile

Answer 2

回答by DigitalRoss

$ awk -f splitter.awk < textfile

$ cat splitter.awk
{
  count0 = split(##代码##, asplit, "[^a-zA-Z0-9]")
  for(i = 1; i <= count0; ++i) { print asplit[i] }
}

Bash：将文本文件拆分为以非字母数字字符作为分隔符的单词

提问by Sv1

回答by Jonathan Leffler

回答by DigitalRoss

相关推荐

最近更新

标签

Bash：将文本文件拆分为以非字母数字字符作为分隔符的单词

提问by Sv1

回答by Jonathan Leffler

回答by DigitalRoss

相关推荐

您如何返回到源 bash 脚本？

bash 将 sudo 密码存储为脚本中的变量 - 安全吗？

bash bash脚本将字符串附加到同一目录中的多个文件

bash 如何创建带有“shell 提示”的 CLI 程序？

相关推荐

最近更新

标签