bash 按行长(包括空格)对文本文件进行排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5917576/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sort a text file by line length including spaces
提问by gnarbarian
I have a CSV file that looks like this
我有一个像这样的 CSV 文件
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56 AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't include spaces, is there a way to modify it so it will work for me?
我需要按行长(包括空格)对其进行排序。以下命令不包含空格,有没有办法修改它以便它对我有用?
cat $@ | awk '{ print length, cat testfile | awk '{ print length, cat testfile | awk '{ print length, echo "hello awk world" | awk '{print}'
echo "hello awk world" | awk '{="hello"; print}'
}' | sort -n | cut -d" " -f2-
}' | sort -n -s | cut -d" " -f2-
}' | sort -n | awk '{=""; print hello awk world
hello awk world
}'
回答by neillb
Answer
回答
= # force record to be reconstituted
print aa A line with MORE spaces
bb The very longest line in the file
ccb
9 dd equal len. Orig pos = 1
500 dd equal len. Orig pos = 2
ccz
cca
ee A line with some spaces
1 dd equal len. Orig pos = 3
ff
5 dd equal len. Orig pos = 4
g
# or whatever else with perl -e 'print sort { length($a) <=> length($b) } <>'
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
或者,对任何等长行进行原始(可能是无意的)子排序:
awk '{print length, perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
}' your-file | sort -n | cut -d " " -f2-
In both cases, we have solved your stated problem by moving away from awk for your final cut.
在这两种情况下,我们都通过远离 awk 进行最终剪辑来解决您提出的问题。
Lines of matching length - what to do in the case of a tie:
匹配长度的线 - 在平局的情况下该怎么做:
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s
(--stable
) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.
该问题没有具体说明是否需要对匹配长度的行进行进一步排序。我认为这是不需要的,并建议使用-s
( --stable
) 来防止这些行相互排序,并将它们保持在它们在输入中出现的相对顺序中。
(Those who want more control of sorting these ties might look at sort's --key
option.)
(那些想要更多地控制对这些关系进行排序的人可能会查看 sort 的--key
选项。)
Why the question's attempted solution fails (awk line-rebuilding):
为什么问题的尝试解决方案失败(awk line-rebuilding):
It is interesting to note the difference between:
有趣的是注意到以下之间的区别:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done
They yield respectively
他们分别产生
awk '{ printf "%d:%s\n", length(awk '{ print length({
c = length
m[c] = m[c] ? m[c] RS awk '{ print length, cat testfile | awk '{l=##代码##; gsub(/7/, "7\"7\"7", l); cmd=sprintf("echo 7%s7 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, ##代码## }}' | sort -ns | cut -d" " -f2-
}' lines.txt | sort -g | cut -d" " -f2-
: ##代码##
} END {
for (c in m) print m[c]
}
), ##代码##;}' "$@" | sort -n | sed 's/^[0-9]* //'
), ##代码##;}' "$@" | sort -n | sed 's/^[0-9]*://'
The relevant section of (gawk's) manualonly mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
(gawk's) 手册的相关部分仅提到当您更改一个字段时,awk 将重建整个 $0(基于分隔符等)。我想这不是疯狂的行为。它有这个:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
“最后,有时可以方便地强制 awk 重建整个记录,使用字段和 OFS 的当前值。要做到这一点,请使用看似无害的赋值:”
##代码##"This forces awk to rebuild the record."
“这迫使 awk 重建记录。”
Test input including some lines of equal length:
测试输入包括一些等长的行:
##代码##回答by Caleb
The AWK solution from neillbis great if you really want to use awk
and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort()
function with a custom caparison routine to iterate over the input lines. Here is a one liner:
来自 neillb的AWK 解决方案非常棒,如果你真的想使用awk
它,它解释了为什么它在那里很麻烦,但如果你想要快速完成工作而不关心你在做什么,一个解决方案是使用Perl 的sort()
函数带有一个自定义的 caparison 例程来迭代输入行。这是一个单班轮:
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat
or a shell redirect) or just give the filename to perl as another argument and let it open the file.
你可以把它放在你需要的管道中,要么接收 STDIN(来自cat
或 shell 重定向),要么只是将文件名作为另一个参数提供给 perl 并让它打开文件。
In my case I needed the longest lines first, so I swapped out $a
and $b
in the comparison.
在我的情况下,我首先需要最长的行,所以我换了出来$a
并$b
进行了比较。
回答by anubhava
Try this command instead:
试试这个命令:
##代码##回答by Chris Koknat
Benchmark results
基准测试结果
Below are the results of a benchmark across solutions from ather answers to this question.
以下是针对此问题的其他答案的跨解决方案基准测试结果。
Test method
测试方法
- 10 sequential runs on a fast machine, averaged
- Perl 5.24
- awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
- The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
- 在一台快速机器上连续运行 10 次,取平均值
- Perl 5.24
- awk 3.1.5(gawk 4.1.0 倍快约 2%)
- 输入文件是一个550MB,600万行的怪物(英国国家语料库txt)
Results
结果
- Caleb's
perl
solutiontook 11.2 seconds - my
perl
solutiontook 11.6 seconds - neillb's
awk
solution#1 took 20 seconds - neillb's
awk
solution#2 took 23 seconds - anubhava's
awk
solutiontook 24 seconds - Jonathan's
awk
solutiontook 25 seconds - Fretz's
bash
solutiontakes 400x longer than theawk
solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
- Caleb 的
perl
解决方案耗时 11.2 秒 - 我的
perl
解决方案花了 11.6 秒 - neillb 的
awk
解决方案#1 花了 20 秒 - neillb 的
awk
解决方案#2 用了 23 秒 - anubhava 的
awk
解决方案用了 24 秒 - 乔纳森的
awk
解决方案用了 25 秒 - Fretz 的
bash
解决方案比解决方案花费的时间长 400倍awk
(使用 100000 行的截断测试用例)。它工作正常,只是需要永远。
Extra perl
option
额外perl
选项
Also, I've added another Perl solution:
另外,我添加了另一个 Perl 解决方案:
##代码##回答by Fritz G. Mehner
Pure Bash:
纯重击:
##代码##回答by Jonathan Leffler
The length()
function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
该length()
函数确实包含空格。我只会对您的管道进行细微调整(包括避免UUOC)。
The sed
command directly removes the digits and colon added by the awk
command. Alternatively, keeping your formatting from awk
:
该sed
命令直接删除命令添加的数字和冒号awk
。或者,保持您的格式awk
:
回答by Markus Amalthea Magnuson
I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort
the -g
(general-numeric-sort) flag instead of -n
(numeric-sort):
我发现如果您的文件包含以数字开头的行,这些解决方案将不起作用,因为它们将与所有计数的行一起按数字排序。该解决方案是给sort
所述-g
(通用数字排序)标志,而不是-n
(数字排序):
回答by Michael Yuniverg
1) pure awk solution. Let's suppose that line length cannot be more > 1024 then
1)纯awk解决方案。假设行长不能大于 1024
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
猫文件名| awk '开始 {min = 1024; s = "";} {l = 长度($0); 如果(l < min){min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
2) 一个 liner bash 解决方案,假设所有行只有 1 个单词,但可以针对所有行具有相同单词数的任何情况进行修改:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1
LINES=$(cat 文件名); 对于 $LINES 中的 k;做 printf "$k "; 回声 $k | wc -L; 完成 | 排序 -k2 | 头-n 1 | 剪切 -d " " -f1
回答by Quinn Comendant
Here is a multibyte-compatible method of sorting lines by length. It requires:
这是一种按长度对行进行排序的多字节兼容方法。这个需要:
wc -m
is available to you (macOS has it).- Your current locale supports multi-byte characters, e.g., by setting
LC_ALL=UTF-8
. You can set this either in your .bash_profile, or simply by prepending it before the following command. testfile
has a character encoding matching your locale (e.g., UTF-8).
wc -m
可供您使用(macOS 有)。- 您当前的语言环境支持多字节字符,例如,通过设置
LC_ALL=UTF-8
. 您可以在 .bash_profile 中设置它,也可以简单地在以下命令之前添加它。 testfile
具有与您的语言环境匹配的字符编码(例如,UTF-8)。
Here's the full command:
这是完整的命令:
##代码##Explaining part-by-part:
逐个解释:
l=$0; gsub(/\047/, "\047\"\047\"\047", l);
← makes of a copy of each line in awk variablel
and double-escapes every'
so the line can safely be echoed as a shell command (\047
is a single-quote in octal notation).cmd=sprintf("echo \047%s\047 | wc -m", l);
← this is the command we'll execute, which echoes the escaped line towc -m
.cmd | getline c;
← executes the command and copies the character count value that is returned into awk variablec
.close(cmd);
← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.sub(/ */, "", c);
← trims white space from the character count value returned bywc
.{ print c, $0 }
← prints the line's character count value, a space, and the original line.| sort -ns
← sorts the lines (by prepended character count values) numerically (-n
), and maintaining stable sort order (-s
).| cut -d" " -f2-
← removes the prepended character count values.
l=$0; gsub(/\047/, "\047\"\047\"\047", l);
← 制作 awk 变量中每一行的副本,并对每一行进行l
双转义,'
以便该行可以安全地作为 shell 命令回显(\047
是八进制表示法中的单引号)。cmd=sprintf("echo \047%s\047 | wc -m", l);
← 这是我们将要执行的命令,它将转义的行回显到wc -m
.cmd | getline c;
← 执行命令并将返回的字符计数值复制到 awk 变量中c
。close(cmd);
← 关闭 shell 命令的管道以避免达到系统对一个进程中打开文件数的限制。sub(/ */, "", c);
← 从返回的字符计数值中修剪空白wc
。{ print c, $0 }
← 打印行的字符计数值、空格和原始行。| sort -ns
← 以数字 (-n
)对行进行排序(通过预先添加的字符计数值),并保持稳定的排序顺序 (-s
)。| cut -d" " -f2-
← 删除前置字符计数值。
It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
它很慢(在快速的 Macbook Pro 上每秒只有 160 行),因为它必须为每一行执行一个子命令。
Alternatively, just do this solely with gawk
(as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
或者,仅使用gawk
(从 3.1.5 版开始,gawk 是多字节感知的)执行此操作,这会明显更快。执行所有转义和双引号以安全地通过来自 awk 的 shell 命令传递行是很麻烦的,但这是我能找到的唯一不需要安装其他软件的方法(默认情况下,gawk 在苹果系统)。