如何使用linux命令从纯文本文件中删除重复的单词

Question

提问by cupakob

I have a plain text file with words, which are separated by comma, for example:

我有一个带有单词的纯文本文件，用逗号分隔，例如：

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

i want to delete the duplicates and to become:

我想删除重复项并成为：

word1, word2, word3, word4, word5, word6, word7

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

有任何想法吗？我认为， egrep 可以帮助我，但我不确定，如何准确地使用它....

Answer 1

采纳答案by Randy Orrison

Assuming that the words are one per line, and the file is already sorted:

假设每行一个单词，并且文件已经排序：

uniq filename

If the file's not sorted:

如果文件未排序：

sort filename | uniq

If they're not one per line, and you don't mind them being one per line:

如果它们不是每行一个，并且您不介意它们每行一个：

tr -s [:space:] \n < filename | sort | uniq

That doesn't remove punctuation, though, so maybe you want:

但是，这不会删除标点符号，所以也许您想要：

tr -s [:space:][:punct:] \n < filename | sort | uniq

But that removes the hyphen from hyphenated words. "man tr" for more options.

但这会从带连字符的单词中删除连字符。“man tr”以获得更多选项。

Answer 2

回答by Paul Sonier

I'd think you'll want to replace the spaces with newlines, use the uniqcommand to find unique lines, then replace the newlines with spaces again.

我认为您需要用换行符替换空格，使用uniq命令查找唯一行，然后再次用空格替换换行符。

Answer 3

回答by Oliver N.

ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename?

I'll admit the two kinds of quotations are ugly.

我承认这两种引语都很丑陋。

Answer 4

回答by Beano

I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.

我假设您希望单词在一行中是唯一的，而不是在整个文件中。如果是这种情况，那么下面的 Perl 脚本就可以解决问题。

while (<DATA>)
{
    chomp;
    my %seen = ();
    my @words = split(m!,\s*!);
    @words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } @words;
    print join(", ", @words), "\n";
}

__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3

If you want uniqueness over the whole file, you can just move the %seenhash outside the while (){}loop.

如果您想要整个文件的唯一性，您可以将%seen散列移到while (){}循环之外。

Answer 5

回答by Ryan Bright

Creating a unique list is pretty easy thanks to uniq, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:

多亏了uniq，创建一个唯一的列表非常容易，尽管大多数 Unix 命令都喜欢每行一个条目而不是逗号分隔的列表，所以我们必须首先将其转换为：

$ sed 's/, /\n/g' filename | sort | uniq
word1
word2
word3
word4
word5
word6
word7

The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)

更难的部分是将它再次放在一行上，用逗号作为分隔符而不是终止符。我使用 perl one-liner 来做到这一点，但如果有人有更惯用的东西，请编辑我。:)

$ sed 's/, /\n/g' filename | sort | uniq | perl -e '@a = <>; chomp @a; print((join ", ", @a), "\n")'
word1, word2, word3, word4, word5, word6, word7

Answer 6

回答by Rob Wells

And don't forget the -coption for the uniqutility if you're interested in getting a count of the words as well.

如果您也有兴趣计算单词数，请不要忘记实用程序的-c选项uniq。

Answer 7

回答by mamboking

Here's an awk script that will leave each line in tact, only removing the duplicate words:

这是一个 awk 脚本，它将保留每一行，只删除重复的单词：

BEGIN { 
     FS=", " 
} 
{ 
    for (i=1; i <= NF; i++) 
        used[$i] = 1
    for (x in used)
        printf "%s, ",x
    printf "\n"
    split("", used)
}

Answer 8

回答by sudon't

Came across this thread while trying to solve much the same problem. I had concatenated several files containing passwords, so naturally there were a lot of doubles. Also, many non-standard characters. I didn't really need them sorted, but it seemed that was gonna be necessary for uniq.

在尝试解决几乎相同的问题时遇到了这个线程。我已经连接了几个包含密码的文件，所以自然会有很多双打。此外，许多非标准字符。我真的不需要对它们进行排序，但似乎这对 uniq 来说是必要的。

I tried:

我试过：

sort /Users/me/Documents/file.txt | uniq -u
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t3tonnement' and `t3tonner'

Tried:

尝试：

sort -u /Users/me/Documents/file.txt >> /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t3tonnement' and `t3tonner'.

And even tried passing it through cat first, just so I could see if we were getting a proper input.

甚至尝试先通过 cat 传递它，这样我就可以看看我们是否得到了正确的输入。

cat /Users/me/Documents/file.txt | sort | uniq -u > /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `zon1s' and `zoologie'.

I'm not sure what's happening. The strings "t\203tonnement" and "t\203tonner" aren't found in the file, though "t/203" and "tonnement" are found, but on separate, non-adjoining lines. Same with "zon\351s".

我不确定发生了什么。在文件中没有找到字符串“t\203tonnement”和“t\203tonner”，虽然找到了“t/203”和“tonnement”，但是在单独的、不相邻的行上。与“zon\351s”相同。

What finally worked for me was:

最终对我有用的是：

awk '!x[cat filename | tr " " "\n" | sort 
]++' /Users/me/Documents/file.txt > /Users/me/Documents/file2.txt

It also preserved words whose only difference was case, which is what I wanted. I didn't need the list sorted, so it was fine that it wasn't.

它还保留了唯一不同的是大小写的单词，这正是我想要的。我不需要对列表进行排序，所以没有排序也很好。

Answer 9

回答by Biffinum

i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing

我今天遇到了同样的问题..一个包含 238,000 个单词的单词列表，但其中大约 40, 000 个是重复的。我已经通过做将它们放在单独的行中

cat filename | uniq > newfilename .

to remove the duplicates I simply did

删除我只是做的重复项

##代码##

Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB

完美无误，现在我的文件从 1.45MB 减少到 1.01MB

Answer 10

回答by meysam

open file with vim (vim filename) and run sort command with unique flag (:sort u).

使用 vim ( vim filename)打开文件并运行带有唯一标志 ( :sort u) 的排序命令。

如何使用linux命令从纯文本文件中删除重复的单词

提问by cupakob

采纳答案by Randy Orrison

回答by Paul Sonier

回答by Oliver N.

回答by Beano

回答by Ryan Bright

回答by Rob Wells

回答by mamboking

回答by sudon't

回答by Biffinum

回答by meysam

相关推荐

最近更新

标签

如何使用linux命令从纯文本文件中删除重复的单词

提问by cupakob

采纳答案by Randy Orrison

回答by Paul Sonier

回答by Oliver N.

回答by Beano

回答by Ryan Bright

回答by Rob Wells

回答by mamboking

回答by sudon't

回答by Biffinum

回答by meysam

相关推荐

Linux命令移动目录

C# 如何在技术规范中记录 WCF Web 服务？

Linux 目录权限读写不删除

C# 使用 Visual Studio Express 版本连接到 SQL Server

相关推荐

最近更新

标签