windows 批量删除文本文件中的重复行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11689689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Batch to remove duplicate rows from text file
提问by Rocshy
Is it possible to remove duplicate rows from a text file? If yes, how?
是否可以从文本文件中删除重复的行?如果是,如何?
回答by dbenham
Sure can, but like most text file processing with batch, it is not pretty, and it is not particularly fast.
当然可以,但像大多数文本文件处理一样,它并不漂亮,也不是特别快。
This solution ignores case when looking for duplicates, and it sorts the lines. The name of the file is passed in as the 1st and only argument to the batch script.
此解决方案在查找重复项时忽略大小写,并对行进行排序。文件名作为批处理脚本的第一个也是唯一的参数传入。
@echo off
setlocal disableDelayedExpansion
set "file=%~1"
set "sorted=%file%.sorted"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^
::The 2 blank lines above are critical, do not remove
sort "%file%" >"%sorted%"
>"%deduped%" (
set "prev="
for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%sorted%") do (
set "ln=%%A"
setlocal enableDelayedExpansion
if /i "!ln!" neq "!prev!" (
endlocal
(echo %%A)
set "prev=%%A"
) else endlocal
)
)
>nul move /y "%deduped%" "%file%"
del "%sorted%"
This solution is case sensitive and it leaves the lines in the original order (except for duplicates of course). Again the name of the file is passed in as the 1st and only argument.
此解决方案区分大小写,并按原始顺序保留行(当然,重复项除外)。文件名再次作为第一个也是唯一的参数传入。
@echo off
setlocal disableDelayedExpansion
set "file=%~1"
set "line=%file%.line"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^
::The 2 blank lines above are critical, do not remove
>"%deduped%" (
for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%file%") do (
set "ln=%%A"
setlocal enableDelayedExpansion
>"%line%" (echo !ln:\=\!)
>nul findstr /xlg:"%line%" "%deduped%" || (echo !ln!)
endlocal
)
)
>nul move /y "%deduped%" "%file%"
2>nul del "%line%"
EDIT
编辑
Both solutions above strip blank lines. I didn't think blank lines were worth preserving when talking about distinct values.
以上两种解决方案都去除了空白行。在谈论不同的值时,我认为空行不值得保留。
I've modified both solutions to disable the FOR /F "EOL" option so that all non-blank lines are preserved, regardless what the 1st character is. The modified code sets the EOL option to a linefeed character.
我修改了这两个解决方案以禁用 FOR /F "EOL" 选项,以便保留所有非空行,无论第一个字符是什么。修改后的代码将 EOL 选项设置为换行符。
New solution 2016-04-13: JSORT.BAT
新解决方案 2016-04-13:JSORT.BAT
You can use my JSORT.BAT hybrid JScript/batch utilityto efficiently sort and remove duplicate lines with a simple one liner (plus a MOVE to overwrite the original file with the final result). JSORT is pure script that runs natively on any Windows machine from XP onward.
您可以使用我的JSORT.BAT 混合 JScript/批处理实用程序,通过一个简单的一行代码(加上一个 MOVE 以使用最终结果覆盖原始文件)来有效地排序和删除重复行。JSORT 是纯脚本,可以在 XP 以后的任何 Windows 机器上本地运行。
@jsort file.txt /u >file.txt.new
@move /y file.txt.new file.txt >nul
回答by PA.
you may use uniq
http://en.wikipedia.org/wiki/Uniqfrom UnxUtilshttp://sourceforge.net/projects/unxutils/
您可以使用来自UnxUtils http://sourceforge.net/projects/unxutils/ 的http://en.wikipedia.org/wiki/Uniquniq
回答by genetix
set "file=%CD%\%1"
sort "%file%">"%file%.sorted"
del /q "%file%"
FOR /F "tokens=*" %%A IN (%file%.sorted) DO (
SETLOCAL EnableDelayedExpansion
if not [%%A]==[!LN!] (
set "ln=%%A"
echo %%A>>"%file%"
)
)
ENDLOCAL
del /q "%file%.sorted"
This should work exactly the same. That dbenham example seemed way too hardcore for me, so, tested my own solution. usage ex.: filedup.cmd filename.ext
这应该完全一样。那个 dbenham 示例对我来说似乎太核心了,因此,测试了我自己的解决方案。用法示例:filedup.cmd filename.ext
回答by Aacini
The Batch file below do what you want:
下面的批处理文件做你想要的:
@echo off
setlocal EnableDelayedExpansion
set "prevLine="
for /F "delims=" %%a in (theFile.txt) do (
if "%%a" neq "!prevLine!" (
echo %%a
set "prevLine=%%a"
)
)
If you need a more efficient method, try this Batch-JScript hybrid script that is developed as a filter, that is, similar to Unix uniq
program. Save it with .bat extension, like uniq.bat
:
如果你需要更高效的方法,可以试试这个 Batch-JScript 混合脚本,它是作为过滤器开发的,也就是类似于 Unixuniq
程序。使用 .bat 扩展名保存它,例如uniq.bat
:
@if (@CodeSection == @Batch) @then
@CScript //nologo //E:JScript "%~F0" & goto :EOF
@end
var line, prevLine = "";
while ( ! WScript.Stdin.AtEndOfStream ) {
line = WScript.Stdin.ReadLine();
if ( line != prevLine ) {
WScript.Stdout.WriteLine(line);
prevLine = line;
}
}
Both programs were copied from this post.
回答by Magoo
Pure batch - 3 effective lines.
纯批次 - 3 条有效线。
@ECHO OFF
SETLOCAL
:: remove variables starting $
FOR /F "delims==" %%a In ('set $ 2^>Nul') DO SET "%%a="
FOR /f "delims=" %%a IN (q34223624.txt) DO SET $%%a=Y
(FOR /F "delims=$=" %%a In ('set $ 2^>Nul') DO ECHO %%a)>u:\resultfile.txt
GOTO :EOF
Works happily if the data does not contain characters to which batch has a sensitivity.
如果数据不包含批次敏感的字符,则工作愉快。
"q34223624.txt" because question 34223624 contained this data
"q34223624.txt" 因为问题 34223624 包含此数据
1.1.1.1
1.1.1.1
1.1.1.1
1.2.1.2
1.2.1.2
1.2.1.2
1.3.1.3
1.3.1.3
1.3.1.3
on which it works perfectly.
它完美地工作。
回答by JasonXA
Did come across this issue and had to resolve it myself because the use was particulate to my need. I needed to find duplicate URL's and order of lines was relevant so it needed to be preserved. The lines of text should not contain any double quotes, should not be very long and sorting cannot be used.
确实遇到了这个问题并且不得不自己解决它,因为使用是我需要的微粒。我需要找到重复的 URL 并且行的顺序是相关的,所以需要保留它。文本行不应包含任何双引号,不应很长且不能使用排序。
Thus I did this:
因此我这样做了:
setlocal enabledelayedexpansion
type nul>unique.txt
for /F "tokens=*" %%i in (list.txt) do (
find "%%i" unique.txt 1>nul
if !errorlevel! NEQ 0 (
echo %%i>>unique.txt
)
)
Auxiliary: if the text does contain double quotes then the FIND needs to use a filtered set variable as described in this post: Escape double quotes in parameter
辅助:如果文本确实包含双引号,则 FIND 需要使用过滤设置变量,如本文所述:转义参数中的双引号
So instead of:
所以而不是:
find "%%i" unique.txt 1>nul
it would be more like:
它更像是:
set test=%%i
set test=!test:"=""!
find "!test!" unique.txt 1>nul
Thus find will look like find """what""" file and %%i will be unchanged.
因此 find 看起来像 find """what""" 文件,%%i 将保持不变。
回答by JasonXA
I have used a fake "array" to accomplish this
我使用了一个假的“数组”来完成这个
@echo off
:: filter out all duplicate ip addresses
REM you file would take place of %1
set file=%1%
if [%1]==[] goto :EOF
setlocal EnableDelayedExpansion
set size=0
set cond=false
set max=0
for /F %%a IN ('type %file%') do (
if [!size!]==[0] (
set cond=true
set /a size="size+1"
set arr[!size!]=%%a
) ELSE (
call :inner
if [!cond!]==[true] (
set /a size="size+1"
set arr[!size!]=%%a&& ECHO > NUL
)
)
)
break> %file%
:: destroys old output
for /L %%b in (1,1,!size!) do echo !arr[%%b]!>> %file%
endlocal
goto :eof
:inner
for /L %%b in (1,1,!size!) do (
if "%%a" neq "!arr[%%b]!" (set cond=true) ELSE (set cond=false&&goto :break)
)
:break
the use of the label for the inner loop is something specific to cmd.exe and is the only way I have been successful nesting for loops within each other. Basically this compares each new value that is being passed as a delimiter and if there is no match then the program will add the value into memory. When it is done it will destroy the target files contents and replace them with the unique strings
内循环标签的使用是 cmd.exe 特有的,并且是我成功将循环嵌套在彼此内部的唯一方法。基本上,这会比较作为分隔符传递的每个新值,如果没有匹配项,则程序会将值添加到内存中。完成后,它将销毁目标文件内容并用唯一的字符串替换它们
回答by aschipfl
Some time ago I found an unexpectly simple solution, but this unfortunately only works on Windows 10: the sort
commandfeatures some undocumented options that can be adopted:
前段时间我发现了一个意外简单的解决方案,但不幸的是,这只适用于 Windows 10:该sort
命令具有一些可以采用的未记录选项:
/UNIQ[UE]
to output only unique lines;/C[ASE_SENSITIVE]
to sort case-sensitively;
/UNIQ[UE]
只输出唯一的行;/C[ASE_SENSITIVE]
区分大小写排序;
So use the following line of code to remove duplicate lines (remove /C
to do that in a case-insensitive manner):
因此,使用以下代码行删除重复行(/C
以不区分大小写的方式删除):
sort /C /UNIQUE "incoming.txt" /O "outgoing.txt"
This removes duplicate lines from the text in incoming.txt
and provides the result in outgoing.txt
. Regard that the original order is of course not going to be preserved (because, well, this is the main purpose of sort
).
这会从 中的文本中删除重复行并在 中incoming.txt
提供结果outgoing.txt
。考虑到原始顺序当然不会被保留(因为,这是 的主要目的sort
)。
However, you sould use these options with care as there might be some (un)known issues with them, because there is possibly a good reason for them not to be documented (so far).
但是,您应该谨慎使用这些选项,因为它们可能存在一些(未知)已知问题,因为可能有充分的理由不记录它们(到目前为止)。