bash 每个单词单独一行

Question

提问by Nathan Pk

I have a sentence like

我有一句话像

This is for example

这是例如

I want to write this to a file such that each word in this sentence is written to a separate line.

我想将此写入文件，以便将这句话中的每个单词写入单独的行。

How can I do this in shell scripting?

我怎样才能在 shell 脚本中做到这一点？

Answer 1

回答by sampson-chen

A couple ways to go about it, choose your favorite!

几种方法，选择你最喜欢的！

echo "This is for example" | tr ' ' '\n' > example.txt

or simply do this to avoid using echounnecessarily:

或者只是这样做以避免echo不必要地使用：

tr ' ' '\n' <<< "This is for example" > example.txt

The <<<notation is used with a herestring

该<<<符号与此处字符串一起使用

Or, use sedinstead of tr:

或者，使用sed代替tr：

sed "s/ /\n/g" <<< "This is for example" > example.txt

For still more alternatives, check others' answers =)

如需更多选择，请查看其他人的答案 =)

Answer 2

回答by Sepero

$ echo "This is for example" | xargs -n1
This
is
for
example

Answer 3

回答by Gilles Quenot

Try using :

尝试使用：

string="This is for example"

printf '%s\n' $string > filename.txt

or taking advantage of bashword-splitting

或利用bash 分词

string="This is for example"

for word in $string; do
    echo "$word"
done > filename.txt

Answer 4

回答by Jonathan Leffler

example="This is for example"
printf "%s\n" $example

Answer 5

回答by koola

Try use:

尝试使用：

str="This is for example"
echo -e ${str// /\n} > file.out

Output

输出

> cat file.out 
This
is
for
example

Answer 6

回答by Pryftan

N.B. I wrote this in a few drafts simplifying the regexp so if there's any inconsistency that's probably why.

注意，我在一些简化正则表达式的草稿中写了这个，所以如果有任何不一致，这可能就是原因。

Do you care about punctuation marks? For example in some invocations you would see e.g. a 'word' like (etc)as that exactly with the parentheses. Or the word would be 'parentheses.' rather than 'parentheses'. If you're parsing a file with proper sentences that could be a problem esp if you're wanting to sort by word or even get a word count for each word.

你关心标点符号吗？例如，在某些调用中，您会看到例如“单词”之类的（等）与括号完全相同。或者这个词是“括号”。而不是“括号”。如果您使用正确的句子解析文件，这可能是一个问题，尤其是如果您想按单词排序甚至获取每个单词的单词计数。

There are ways to deal with this but there are some caveats and certainly there's room for improvement. These happen to do with numbers, dashes (in numbers) and decimal points/dots (in numbers). Perhaps having an exact set of rules would help resolve this but the below examples can give you some things to work on. I have made some contrived input examples to demonstrate these flaws (or whatever you wish to call them).

有办法解决这个问题，但有一些注意事项，当然还有改进的余地。这些碰巧与数字、破折号（数字）和小数点/点（数字）有关。也许拥有一套确切的规则将有助于解决这个问题，但以下示例可以为您提供一些工作。我已经制作了一些人为的输入示例来演示这些缺陷（或任何您想称呼的缺陷）。

$ echo "This is an example sentence with punctuation marks and digits i.e. , . ; \! 7 8 9" | grep -o -E '\<[A-Za-z0-9.]*\>'
This
is
an
example
sentence
with
punctuation
marks
and
digits
i.e
7
8
9

As you can see the i.e.`turns out to be just i.eand the punctuation marks otherwise are not shown. Okay but this leaves out things like version numbers in the form of major.minor.revision-release e.g. 0.0.1-1; can this be shown too? Yes:

如您所见，ie`结果只是ie，否则不会显示标点符号。好的，但是这遗漏了诸如Major.minor.revision-release 形式的版本号之类的东西，例如0.0.1-1；这也能显示吗？是的：

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[-A-Za-z0-9.]*\>'
The
current
version
is
0.0.1-1
The
previous
version
was
current
from
2017-2018

Observe that the sentences do not end with a full stop. What happens if you add a space between the years and the dash? You won't have the dash but each year will be on its own line:

请注意，句子不以句号结尾。如果在年份和破折号之间添加一个空格会发生什么？你不会有破折号，但每年都会有自己的行：

$ echo "2017 - 2018" | grep -o -E '\<[-A-Za-z0-9.]*\>'
2017
2018

The question then becomes if you want -by themselves to be counted; by the very nature of separating words you won't have the years as a single string if there are spaces. Because it's not a word by itself I would think not.

那么问题就变成了你是否希望-自己被计算在内；由于分隔单词的本质，如果有空格，您将不会将年份作为单个字符串。因为它本身不是一个词，所以我认为不是。

I am sure these could be simplified further. In addition if you don't want any punctuation or numbers at all you could change it to:

我相信这些可以进一步简化。此外，如果您根本不需要任何标点符号或数字，则可以将其更改为：

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is
The
previous
version
was
current
from

If you wanted to have the numbers:

如果你想拥有数字：

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
The
previous
version
was
current
from
2017
2018

As for 'words' with both letters and numbers that's another thing that might or might not be of consideration but demonstrating the above:

至于同时包含字母和数字的“单词”，这是另一件可能会或可能不会被考虑但证明上述内容的事情：

$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
test1

Outputs them. But the following does not (because it doesn't consider numbers at all):

输出它们。但以下没有（因为它根本不考虑数字）：

$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is

It's quite easy to disregard punctuation marks but in some cases there might be need or desire for them. In the case of e.g.I suppose you could use say sed to change lines like e.gto e.g.but that would be a personal preference, I guess.

忽略标点符号很容易，但在某些情况下可能需要或希望使用它们。在eg的情况下，我想您可以使用 say sed 将行更改为eg到eg但这将是个人喜好，我猜。

I can summarise how it works but only just; I'm far too tired to even consider much:

我可以总结它是如何工作的，但只是；我太累了，无法考虑太多：

How does it work?

它是如何工作的？

I will only explain the invocation grep -o -E '\<[-A-Za-z0-9.]*\>'but much of it is the same in the others (the vertical bar/pipe symbol in extended grep allows for more than one pattern):

我将只解释调用，grep -o -E '\<[-A-Za-z0-9.]*\>'但在其他调用中大部分是相同的（扩展 grep 中的垂直条/管道符号允许多个模式）：

The -ooption is for only printing matches rather than the entire line. The -Eis for extended grep (could just as well have used egrep). As for the regexp itself:

该-o选项仅用于打印匹配项而不是整行。该-E是扩展grep的（也可以同样使用了egrep的）。至于正则表达式本身：

The <\and \>are word boundaries (beginning and ending respectively - you can specify only one if you want); I believe the -woption is the same as specifying both but maybe the invocation is a bit different (I don't actually know).

在<\与\>字界限（开头和结尾分别为-您可以指定只有当你想要一个）; 我相信该-w选项与指定两者相同，但调用可能有点不同（我实际上不知道）。

The '\<[-A-Za-z0-9.]*\>'says dashes, upper and lower case letters and a dot zero or more times. As for why then it turns e.g.to .e.gI at this time can only say it is the pattern but I do not have the faculties to consider it more.

该'\<[-A-Za-z0-9.]*\>'说破折号，大小写字母和点零次或多次。至于为什么那么它变成如要.eg我在这个时候只能说它是模式，但我没有院系更多的考虑。

Bonus script for word frequency count

词频计数奖励脚本

#!/bin/bash

if [ $# -eq 0 ]; then
    echo "Usage: $(basename $ cat example 
The current version is 0.0.1-1 but the previous version was non-existent.

This sentence contains an abbreviation i.e. e.g. (so actually two abbreviations).

This sentence has no numbers and no punctuation  
$ ./wordfreq example 
** example: 
   2 version
   2 sentence
   2 no
   2 This
   1 was
   1 two
   1 the
   1 so
   1 punctuation
   1 previous
   1 numbers
   1 non-existent
   1 is
   1 i.e
   1 has
   1 e.g
   1 current
   1 contains
   1 but
   1 and
   1 an
   1 actually
   1 abbreviations
   1 abbreviation
   1 The
   1 0.0.1-1
) <FILE> [FILE...]"
    exit 1
fi

for file do
    if [ -e "${file}" ]
    then
        echo "** ${file}: "
        grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|sort | uniq -c | sort -rn
    else
    echo >&2 ": file not found"
    continue
    fi
done

Example:

例子：

    grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|tr '[A-Z]' '[a-z]'|sort | uniq -c | sort -rn

N.B. I didn't transliterate upper case to lower case so the words 'The' and 'the' show up as different words. If you wanted them to be all lower case you could change the grep invocation in the script to be piped to tr before sorting:

注意，我没有将大写字母音译为小写字母，因此“The”和“the”这两个词显示为不同的词。如果您希望它们全部为小写，您可以在排序之前更改脚本中的 grep 调用以通过管道传输到 tr：

> output_file

Oh and since you asked if you want to write it to a file you can just add to the command line (this is for the raw invocation):

哦，既然您询问是否要将其写入文件，您只需将其添加到命令行（这是用于原始调用）：

$ ./wordfreq file1 file2 file3 > output_file

For the script you would use it like:

对于脚本，您可以像这样使用它：

##代码##

bash 每个单词单独一行

提问by Nathan Pk

回答by sampson-chen

回答by Sepero

回答by Gilles Quenot

回答by Jonathan Leffler

回答by koola

回答by Pryftan

N.B. I wrote this in a few drafts simplifying the regexp so if there's any inconsistency that's probably why.

注意，我在一些简化正则表达式的草稿中写了这个，所以如果有任何不一致，这可能就是原因。

How does it work?

它是如何工作的？

Bonus script for word frequency count

词频计数奖励脚本

相关推荐

最近更新

标签

bash 每个单词单独一行

提问by Nathan Pk

回答by sampson-chen

回答by Sepero

回答by Gilles Quenot

回答by Jonathan Leffler

回答by koola

回答by Pryftan

N.B. I wrote this in a few drafts simplifying the regexp so if there's any inconsistency that's probably why.

注意，我在一些简化正则表达式的草稿中写了这个，所以如果有任何不一致，这可能就是原因。

How does it work?

它是如何工作的？

Bonus script for word frequency count

词频计数奖励脚本

相关推荐

bash 如何在管道中使用 GNU Time

当无法断言作业控制时，如何告诉 bash 不要发出警告“无法设置终端进程组”和“此外壳中没有作业控制”？

bash 我应该如何将 xml 转换为 csv

Bash：检查，如果未运行则运行进程

相关推荐

最近更新

标签