Linux 查找文本文件中最长的单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8962466/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 04:09:55  来源:igfitidea点击:

Finding the longest word in a text file

linuxbashunix

提问by Mildred Shimz

I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememeand if I want to find the length I can use echo {#a}its word I would echo ${a}. But I want to apply it on this below

我正在尝试制作一个简单的脚本,使用 bash 在文本文件中查找最大的单词及其数量/长度。我知道当我使用 awk 时它简单而直接,但我想尝试使用这种方法......假设我知道是否a=wmememememe以及如果我想找到我可以使用echo {#a}它的词的长度echo ${a}。但我想在下面应用它

for i in `cat so.txt` do

Where so.txt contains words, I hope it makes sense.

so.txt 包含单词的地方,我希望它是有道理的。

采纳答案by Paused until further notice.

Normally, you'd want to use a while readloop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.

通常,您希望使用while read循环而不是for i in $(cat),但由于您希望拆分所有单词,因此在这种情况下它可以正常工作。

#!/bin/bash
longest=0
for word in $(<so.txt)
do
    len=${#word}
    if (( len > longest ))
    then
        longest=$len
        longword=$word
    fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"

回答by Rob Wouters

longest=""
for word in $(cat so.txt); do
    if [ ${#word} -gt ${#longest} ]; then
        longest=$word
    fi
done

echo $longest

回答by jbleners

for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1

回答by Fritz G. Mehner

Another solution:

另一种解决方案:

for item in  $(cat "$infile"); do
  length[${#item}]=$item          # use word length as index
done
maxword=${length[@]: -1}          # select last array element

printf  "longest word '%s', length %d" ${maxword} ${#maxword}

回答by jaypal singh

awkscript:

awk脚本:

#!/usr/bin/awk -f

# Initialize two variables
BEGIN {
  maxlength=0;
  maxword=0
} 

# Loop through each word on the line
{
  for(i=1;i<=NF;i++) 

  # Assign the maxlength variable if length of word found is greater. Also, assign
  # the word to maxword variable.
  if (length($i)>maxlength) 
  {
    maxlength=length($i); 
    maxword=$i;
  }
}

# Print out the maxword and the maxlength  
END {
  print maxword,maxlength;
}

Textfile:

文本文件:

[jaypal:~/Temp] cat textfile 
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language 
consisting of a set of actions to be taken against textual data (either in files or data streams) 
for the purpose of producing formatted reports. 
The language used by awk extensively uses the string datatype, 
associative arrays (that is, arrays indexed by key strings), and regular expressions.

Test:

测试:

[jaypal:~/Temp] ./script.awk textfile 
data_extraction 15

回答by BlessedKey

bash one liner.

打击一个班轮。

cat YOUR_FILENAME | sed 's/ /\n/g' | sort | uniq | awk '{print length, 
$ cat /usr/share/dict/words | \
    xargs -n1 -I '{}' -d '\n'   sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
    sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize
}' | sort -nr | head
  1. print the file (via cat)
  2. split the words (via sed)
  3. remove duplicates (via sort | uniq)
  4. prefix each word with it's length (awk)
  5. sort the list by the word length
  6. print the words with greatest length.
  1. 打印文件(通过 cat)
  2. 拆分单词(通过 sed)
  3. 删除重复项(通过 sort | uniq)
  4. 用它的长度作为每个单词的前缀(awk)
  5. 按字长对列表进行排序
  6. 打印长度最大的单词。

yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.

是的,这将比上述一些解决方案慢,但它也不需要记住 bash for 循环的语义。

回答by jimis

Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:

由于大量的叉子而缓慢,但纯 shell 不需要 awk 或特殊的 bash 功能:

cat /usr/share/dict/words | tr '\n' '
# Usage: longcount <  textfile
longcount () 
{ 
    declare -a c;
    while read x; do
        c[${#x}]="$x";
    done;
    echo ${#c[@]} "${c[${#c[@]}]}"
}
' | \ xargs -0 -I {} -n1 -P4 sh -c 'echo ${#1} ""' wordcount {} | \ sort -n | tail

You can easily parallelize, e.g. to 4 CPUs by providing -P4to xargs.

通过提供-P4给 xargs ,您可以轻松地并行化,例如并行化到 4 个 CPU 。

EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -dargument.

编辑:修改为使用某些词典具有的单引号。现在它需要 GNU xargs 因为-d参数。

EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0option to xargs. I also added -P4to compute on 4 cores:

EDIT2:为了好玩,这里是另一个处理各种特殊字符的版本,但需要-0选择xargs. 我还添加-P4了在 4 个内核上计算:

longcount < /usr/share/dict/words

回答by agc

  1. Relatively speedy bashfunction using no external utils:

    23 electroencephalograph's
    

    Example:

    tr "'" '_'  < /usr/share/dict/words |
    xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} ""' | 
    sort -n | tail | tr '_' "'"
    

    Output:

    # Usage: longcount <  textfile
    longcount () 
    { 
        declare -a c;
        while read x; do
            c[${#x}]="$x";
        done;
        echo ${#c[@]} "${c[${#c[@]}]}"
    }
    
  2. 'Modified POSIXshell version of jimis' xargs-based answer; still very slow, takes two or three minutes:

    longcount < /usr/share/dict/words
    

    Note the leading and trailing trbit to get around GNUxargsdifficulty with single quotes.

  1. bash不使用外部工具的相对快速的功能:

    23 electroencephalograph's
    

    例子:

    tr "'" '_'  < /usr/share/dict/words |
    xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} ""' | 
    sort -n | tail | tr '_' "'"
    

    输出:

    ##代码##
  2. '基于jimis'POSIXshell 版本的修改;还是很慢,需要两三分钟:xargs

    ##代码##

    注意 使用单引号tr绕过GNUxargs困难的前导和尾随位。