bash 按行长(包括空格)对文本文件进行排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5917576/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 20:29:26  来源:igfitidea点击:

Sort a text file by line length including spaces

bashsortingtextawk

提问by gnarbarian

I have a CSV file that looks like this

我有一个像这样的 CSV 文件

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56

I need to sort it by line length including spaces. The following command doesn't include spaces, is there a way to modify it so it will work for me?

我需要按行长(包括空格)对其进行排序。以下命令不包含空格,有没有办法修改它以便它对我有用?

cat $@ | awk '{ print length, 
cat testfile | awk '{ print length, 
cat testfile | awk '{ print length, 
echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{="hello"; print}'
}' | sort -n | cut -d" " -f2-
}' | sort -n -s | cut -d" " -f2-
}' | sort -n | awk '{=""; print
hello   awk   world
hello awk world
}'

回答by neillb

Answer

回答

  =    # force record to be reconstituted
 print 
aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g
# or whatever else with
perl -e 'print sort { length($a) <=> length($b) } <>'

Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:

或者,对任何等长行进行原始(可能是无意的)子排序:

awk '{print length, 
perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
}' your-file | sort -n | cut -d " " -f2-

In both cases, we have solved your stated problem by moving away from awk for your final cut.

在这两种情况下,我们都通过远离 awk 进行最终剪辑来解决您提出的问题。

Lines of matching length - what to do in the case of a tie:

匹配长度的线 - 在平局的情况下该怎么做:

The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s(--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.

该问题没有具体说明是否需要对匹配长度的行进行进一步排序。我认为这是不需要的,并建议使用-s( --stable) 来防止这些行相互排序,并将它们保持在它们在输入中出现的相对顺序中。

(Those who want more control of sorting these ties might look at sort's --keyoption.)

(那些想要更多地控制对这些关系进行排序的人可能会查看 sort 的--key选项。)

Why the question's attempted solution fails (awk line-rebuilding):

为什么问题的尝试解决方案失败(awk line-rebuilding):

It is interesting to note the difference between:

有趣的是注意到以下之间的区别:

declare -a sorted

while read line; do
  if [ -z "${sorted[${#line}]}" ] ; then          # does line length already exist?
    sorted[${#line}]="$line"                      # element for new length
  else
    sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
  fi
done < data.csv

for key in ${!sorted[*]}; do                      # iterate over existing indices
  echo -e "${sorted[$key]}"                       # echo lines with equal length
done

They yield respectively

他们分别产生

awk '{ printf "%d:%s\n", length(
awk '{ print length(
{
  c = length
  m[c] = m[c] ? m[c] RS 
awk '{ print length, 
cat testfile | awk '{l=##代码##; gsub(/7/, "7\"7\"7", l); cmd=sprintf("echo 7%s7 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, ##代码## }}' | sort -ns | cut -d" " -f2-
}' lines.txt | sort -g | cut -d" " -f2-
: ##代码## } END { for (c in m) print m[c] }
), ##代码##;}' "$@" | sort -n | sed 's/^[0-9]* //'
), ##代码##;}' "$@" | sort -n | sed 's/^[0-9]*://'

The relevant section of (gawk's) manualonly mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:

(gawk's) 手册相关部分仅提到当您更改一个字段时,awk 将重建整个 $0(基于分隔符等)。我想这不是疯狂的行为。它有这个:

"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"

“最后,有时可以方便地强制 awk 重建整个记录,使用字段和 OFS 的当前值。要做到这一点,请使用看似无害的赋值:”

##代码##

"This forces awk to rebuild the record."

“这迫使 awk 重建记录。”

Test input including some lines of equal length:

测试输入包括一些等长的行:

##代码##

回答by Caleb

The AWK solution from neillbis great if you really want to use awkand it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort()function with a custom caparison routine to iterate over the input lines. Here is a one liner:

来自 neillbAWK 解决方案非常棒,如果你真的想使用awk它,它解释了为什么它在那里很麻烦,但如果你想要快速完成工作而不关心你在做什么,一个解决方案是使用Perl 的sort()函数带有一个自定义的 caparison 例程来迭代输入行。这是一个单班轮:

##代码##

You can put this in your pipeline wherever you need it, either receiving STDIN (from cator a shell redirect) or just give the filename to perl as another argument and let it open the file.

你可以把它放在你需要的管道中,要么接收 STDIN(来自cat或 shell 重定向),要么只是将文件名作为另一个参数提供给 perl 并让它打开文件。

In my case I needed the longest lines first, so I swapped out $aand $bin the comparison.

在我的情况下,我首先需要最长的行,所以我换了出来$a$b进行了比较。

回答by anubhava

Try this command instead:

试试这个命令:

##代码##

回答by Chris Koknat

Benchmark results

基准测试结果

Below are the results of a benchmark across solutions from ather answers to this question.

以下是针对此问题的其他答案的跨解决方案基准测试结果。

Test method

测试方法

  • 10 sequential runs on a fast machine, averaged
  • Perl 5.24
  • awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
  • The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
  • 在一台快速机器上连续运行 10 次,取平均值
  • Perl 5.24
  • awk 3.1.5(gawk 4.1.0 倍快约 2%)
  • 输入文件是一个550MB,600万行的怪物(英国国家语料库txt)

Results

结果

  1. Caleb's perlsolutiontook 11.2 seconds
  2. my perlsolutiontook 11.6 seconds
  3. neillb's awksolution#1 took 20 seconds
  4. neillb's awksolution#2 took 23 seconds
  5. anubhava's awksolutiontook 24 seconds
  6. Jonathan's awksolutiontook 25 seconds
  7. Fretz's bashsolutiontakes 400x longer than the awksolutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
  1. Caleb 的perl解决方案耗时 11.2 秒
  2. 我的perl解决方案花了 11.6 秒
  3. neillb 的awk解决方案#1 花了 20 秒
  4. neillb 的awk解决方案#2 用了 23 秒
  5. anubhava 的awk解决方案用了 24 秒
  6. 乔纳森的awk解决方案用了 25 秒
  7. Fretz 的bash解决方案解决方案花费的时间长 400awk(使用 100000 行的截断测试用例)。它工作正常,只是需要永远。

Extra perloption

额外perl选项

Also, I've added another Perl solution:

另外,我添加了另一个 Perl 解决方案:

##代码##

回答by Fritz G. Mehner

Pure Bash:

纯重击:

##代码##

回答by Jonathan Leffler

The length()function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).

length()函数确实包含空格。我只会对您的管道进行细微调整(包括避免UUOC)。

##代码##

The sedcommand directly removes the digits and colon added by the awkcommand. Alternatively, keeping your formatting from awk:

sed命令直接删除命令添加的数字和冒号awk。或者,保持您的格式awk

##代码##

回答by Steven Penny

With POSIX Awk:

使用 POSIX awk:

##代码##

Example

例子

回答by Markus Amalthea Magnuson

I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sortthe -g(general-numeric-sort) flag instead of -n(numeric-sort):

我发现如果您的文件包含以数字开头的行,这些解决方案将不起作用,因为它们将与所有计数的行一起按数字排序。该解决方案是给sort所述-g(通用数字排序)标志,而不是-n(数字排序):

##代码##

回答by Michael Yuniverg

1) pure awk solution. Let's suppose that line length cannot be more > 1024 then

1)纯awk解决方案。假设行长不能大于 1024

cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'

猫文件名| awk '开始 {min = 1024; s = "";} {l = 长度($0); 如果(l < min){min = l; s = $0;}} END {print s}'

2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:

2) 一个 liner bash 解决方案,假设所有行只有 1 个单词,但可以针对所有行具有相同单词数的任何情况进行修改:

LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1

LINES=$(cat 文件名); 对于 $LINES 中的 k;做 printf "$k "; 回声 $k | wc -L; 完成 | 排序 -k2 | 头-n 1 | 剪切 -d " " -f1

回答by Quinn Comendant

Here is a multibyte-compatible method of sorting lines by length. It requires:

这是一种按长度对行进行排序的多字节兼容方法。这个需要:

  1. wc -mis available to you (macOS has it).
  2. Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
  3. testfilehas a character encoding matching your locale (e.g., UTF-8).
  1. wc -m可供您使用(macOS 有)。
  2. 您当前的语言环境支持多字节字符,例如,通过设置LC_ALL=UTF-8. 您可以在 .bash_profile 中设置它,也可以简单地在以下命令之前添加它。
  3. testfile具有与您的语言环境匹配的字符编码(例如,UTF-8)。

Here's the full command:

这是完整的命令:

##代码##

Explaining part-by-part:

逐个解释:

  • l=$0; gsub(/\047/, "\047\"\047\"\047", l);← makes of a copy of each line in awk variable land double-escapes every 'so the line can safely be echoed as a shell command (\047is a single-quote in octal notation).
  • cmd=sprintf("echo \047%s\047 | wc -m", l);← this is the command we'll execute, which echoes the escaped line to wc -m.
  • cmd | getline c;← executes the command and copies the character count value that is returned into awk variable c.
  • close(cmd);← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
  • sub(/ */, "", c);← trims white space from the character count value returned by wc.
  • { print c, $0 }← prints the line's character count value, a space, and the original line.
  • | sort -ns← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
  • | cut -d" " -f2-← removes the prepended character count values.
  • l=$0; gsub(/\047/, "\047\"\047\"\047", l);← 制作 awk 变量中每一行的副本,并对每一行进行l双转义,'以便该行可以安全地作为 shell 命令回显(\047是八进制表示法中的单引号)。
  • cmd=sprintf("echo \047%s\047 | wc -m", l);← 这是我们将要执行的命令,它将转义的行回显到wc -m.
  • cmd | getline c;← 执行命令并将返回的字符计数值复制到 awk 变量中c
  • close(cmd);← 关闭 shell 命令的管道以避免达到系统对一个进程中打开文件数的限制。
  • sub(/ */, "", c);← 从返回的字符计数值中修剪空白wc
  • { print c, $0 }← 打印行的字符计数值、空格和原始行。
  • | sort -ns← 以数字 ( -n)对行进行排序(通过预先添加的字符计数值),并保持稳定的排序顺序 ( -s)。
  • | cut -d" " -f2-← 删除前置字符计数值。

It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.

它很慢(在快速的 Macbook Pro 上每秒只有 160 行),因为它必须为每一行执行一个子命令。

Alternatively, just do this solely with gawk(as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).

或者,仅使用gawk(从 3.1.5 版开始,gawk 是多字节感知的)执行此操作,这会明显更快。执行所有转义和双引号以安全地通过来自 awk 的 shell 命令传递行是很麻烦的,但这是我能找到的唯一不需要安装其他软件的方法(默认情况下,gawk 在苹果系统)。