如何使用linux命令从纯文本文件中删除重复的单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/952268/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove duplicate words from a plain text file using linux command
提问by cupakob
I have a plain text file with words, which are separated by comma, for example:
我有一个带有单词的纯文本文件,用逗号分隔,例如:
word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3
i want to delete the duplicates and to become:
我想删除重复项并成为:
word1, word2, word3, word4, word5, word6, word7
Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....
有任何想法吗?我认为, egrep 可以帮助我,但我不确定,如何准确地使用它....
采纳答案by Randy Orrison
Assuming that the words are one per line, and the file is already sorted:
假设每行一个单词,并且文件已经排序:
uniq filename
If the file's not sorted:
如果文件未排序:
sort filename | uniq
If they're not one per line, and you don't mind them being one per line:
如果它们不是每行一个,并且您不介意它们每行一个:
tr -s [:space:] \n < filename | sort | uniq
That doesn't remove punctuation, though, so maybe you want:
但是,这不会删除标点符号,所以也许您想要:
tr -s [:space:][:punct:] \n < filename | sort | uniq
But that removes the hyphen from hyphenated words. "man tr" for more options.
但这会从带连字符的单词中删除连字符。“man tr”以获得更多选项。
回答by Paul Sonier
回答by Oliver N.
ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename
?
ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename
?
I'll admit the two kinds of quotations are ugly.
我承认这两种引语都很丑陋。
回答by Beano
I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.
我假设您希望单词在一行中是唯一的,而不是在整个文件中。如果是这种情况,那么下面的 Perl 脚本就可以解决问题。
while (<DATA>)
{
chomp;
my %seen = ();
my @words = split(m!,\s*!);
@words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } @words;
print join(", ", @words), "\n";
}
__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3
If you want uniqueness over the whole file, you can just move the %seen
hash outside the while (){}
loop.
如果您想要整个文件的唯一性,您可以将%seen
散列移到while (){}
循环之外。
回答by Ryan Bright
Creating a unique list is pretty easy thanks to uniq
, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:
多亏了uniq
,创建一个唯一的列表非常容易,尽管大多数 Unix 命令都喜欢每行一个条目而不是逗号分隔的列表,所以我们必须首先将其转换为:
$ sed 's/, /\n/g' filename | sort | uniq
word1
word2
word3
word4
word5
word6
word7
The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)
更难的部分是将它再次放在一行上,用逗号作为分隔符而不是终止符。我使用 perl one-liner 来做到这一点,但如果有人有更惯用的东西,请编辑我。:)
$ sed 's/, /\n/g' filename | sort | uniq | perl -e '@a = <>; chomp @a; print((join ", ", @a), "\n")'
word1, word2, word3, word4, word5, word6, word7
回答by Rob Wells
And don't forget the -c
option for the uniq
utility if you're interested in getting a count of the words as well.
如果您也有兴趣计算单词数,请不要忘记实用程序的-c
选项uniq
。
回答by mamboking
Here's an awk script that will leave each line in tact, only removing the duplicate words:
这是一个 awk 脚本,它将保留每一行,只删除重复的单词:
BEGIN {
FS=", "
}
{
for (i=1; i <= NF; i++)
used[$i] = 1
for (x in used)
printf "%s, ",x
printf "\n"
split("", used)
}
回答by sudon't
Came across this thread while trying to solve much the same problem. I had concatenated several files containing passwords, so naturally there were a lot of doubles. Also, many non-standard characters. I didn't really need them sorted, but it seemed that was gonna be necessary for uniq.
在尝试解决几乎相同的问题时遇到了这个线程。我已经连接了几个包含密码的文件,所以自然会有很多双打。此外,许多非标准字符。我真的不需要对它们进行排序,但似乎这对 uniq 来说是必要的。
I tried:
我试过:
sort /Users/me/Documents/file.txt | uniq -u
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t3tonnement' and `t3tonner'
Tried:
尝试:
sort -u /Users/me/Documents/file.txt >> /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t3tonnement' and `t3tonner'.
And even tried passing it through cat first, just so I could see if we were getting a proper input.
甚至尝试先通过 cat 传递它,这样我就可以看看我们是否得到了正确的输入。
cat /Users/me/Documents/file.txt | sort | uniq -u > /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `zon1s' and `zoologie'.
I'm not sure what's happening. The strings "t\203tonnement" and "t\203tonner" aren't found in the file, though "t/203" and "tonnement" are found, but on separate, non-adjoining lines. Same with "zon\351s".
我不确定发生了什么。在文件中没有找到字符串“t\203tonnement”和“t\203tonner”,虽然找到了“t/203”和“tonnement”,但是在单独的、不相邻的行上。与“zon\351s”相同。
What finally worked for me was:
最终对我有用的是:
awk '!x[cat filename | tr " " "\n" | sort
]++' /Users/me/Documents/file.txt > /Users/me/Documents/file2.txt
It also preserved words whose only difference was case, which is what I wanted. I didn't need the list sorted, so it was fine that it wasn't.
它还保留了唯一不同的是大小写的单词,这正是我想要的。我不需要对列表进行排序,所以没有排序也很好。
回答by Biffinum
i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing
我今天遇到了同样的问题..一个包含 238,000 个单词的单词列表,但其中大约 40, 000 个是重复的。我已经通过做将它们放在单独的行中
cat filename | uniq > newfilename .
to remove the duplicates I simply did
删除我只是做的重复项
##代码##Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB
完美无误,现在我的文件从 1.45MB 减少到 1.01MB
回答by meysam
open file with vim (vim filename
) and run sort command with unique flag (:sort u
).
使用 vim ( vim filename
)打开文件并运行带有唯一标志 ( :sort u
) 的排序命令。