windows 如何在Windows中拆分大文本文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31786287/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 19:11:37  来源:igfitidea点击:

How to split large text file in windows?

windowstextcmdsplitsize

提问by Albin

I have a log file with size of 2.5 GB. Is there any way to split this file into smaller files using windows command prompt?

我有一个大小为 2.5 GB 的日志文件。有没有办法使用 Windows 命令提示符将此文件拆分为较小的文件?

回答by Josh Withee

If you have installed Git for Windows, you should have Git Bash installed, since that comes with Git.

如果您已经安装了 Windows 版 Git,则应该安装 Git Bash,因为 Git 附带了它。

Use the splitcommand in Git Bash to split a file:

使用splitGit Bash 中的命令来拆分文件:

  • into files of size 500MB each: split myLargeFile.txt -b 500m

  • into files with 10000 lines each: split myLargeFile.txt -l 10000

  • 成每个大小为 500MB 的文件: split myLargeFile.txt -b 500m

  • 到每个 10000 行的文件中: split myLargeFile.txt -l 10000

Tips:

提示:

  • If you don't have Git/Git Bash, download at https://git-scm.com/download

  • If you lost the shortcut to Git Bash, you can run it using C:\Program Files\Git\git-bash.exe

  • 如果您没有 Git/Git Bash,请在https://git-scm.com/download下载

  • 如果您丢失了 Git Bash 的快捷方式,您可以使用 C:\Program Files\Git\git-bash.exe

That's it!

就是这样!



I always like examples though...

不过我总是喜欢例子......

Example:

例子:

enter image description here

在此处输入图片说明

You can see in this image that the files generated by splitare named xaa, xab, xac, etc.

您可以在此图像,通过生成的文件中看到的split被命名为xaaxabxac,等。

These names are made up of a prefix and a suffix, which you can specify. Since I didn't specify what I want the prefix or suffix to look like, the prefix defaulted to x, and the suffix defaulted to a two-character alphabetical enumeration.

这些名称由您可以指定的前缀和后缀组成。由于我没有指定我希望前缀或后缀的外观,因此前缀默认为x,后缀默认为两个字符的字母枚举。

Another Example:

另一个例子:

This example demonstrates

这个例子演示了

  • using a filename prefix of MySlice(instead of the default x),
  • the -dflag for using numerical suffixes (instead of aa, ab, ac, etc...),
  • and the option -a 5to tell it I want the suffixes to be 5 digits long:
  • 使用文件名前缀MySlice(而不是默认的x),
  • 所述-d用于使用数字后缀标志(而不是aaabac等等),
  • 以及-a 5告诉它我希望后缀长度为 5 位数的选项:

enter image description here

在此处输入图片说明

回答by bill

Set Arg = WScript.Arguments
set WshShell = createObject("Wscript.Shell")
Set Inp = WScript.Stdin
Set Outp = Wscript.Stdout
    Set rs = CreateObject("ADODB.Recordset")
    With rs
        .Fields.Append "LineNumber", 4 

        .Fields.Append "Txt", 201, 5000 
        .Open
        LineCount = 0
        Do Until Inp.AtEndOfStream
            LineCount = LineCount + 1
            .AddNew
            .Fields("LineNumber").value = LineCount
            .Fields("Txt").value = Inp.readline
            .UpDate
        Loop

        .Sort = "LineNumber ASC"

        If LCase(Arg(1)) = "t" then
            If LCase(Arg(2)) = "i" then
                .filter = "LineNumber < " & LCase(Arg(3)) + 1
            ElseIf LCase(Arg(2)) = "x" then
                .filter = "LineNumber > " & LCase(Arg(3))
            End If
        ElseIf LCase(Arg(1)) = "b" then
            If LCase(Arg(2)) = "i" then
                .filter = "LineNumber > " & LineCount - LCase(Arg(3))
            ElseIf LCase(Arg(2)) = "x" then
                .filter = "LineNumber < " & LineCount - LCase(Arg(3)) + 1
            End If
        End If

        Do While not .EOF
            Outp.writeline .Fields("Txt").Value

            .MoveNext
        Loop
    End With

Cut

filter cut {t|b} {i|x} NumOfLines

Cuts the number of lines from the top or bottom of file.

从文件的顶部或底部减少行数。

t - top of the file
b - bottom of the file
i - include n lines
x - exclude n lines

Example

例子

cscript /nologo filter.vbs cut t i 5 < "%systemroot%\win.ini"

Another way This outputs lines 5001+, adapt for your use. This uses almost no memory.

另一种方式这输出线5001+,适应您的使用。这几乎不使用内存。

Do Until Inp.AtEndOfStream
         Count = Count + 1
         If count > 5000 then
            OutP.WriteLine Inp.Readline
         End If
Loop

回答by Zimba

Of course there is! Win CMD can do a lot more than just split text files :)

当然有!Win CMD 可以做的不仅仅是拆分文本文件:)

Split a text file into separate files of 'max' lines each:

将文本文件拆分为每个“最大”行的单独文件:

Split text file (max lines each):
: Initialize
set input=file.txt
set max=10000

set /a line=1 >nul
set /a file=1 >nul
set out=!file!_%input%
set /a max+=1 >nul

echo Number of lines in %input%:
find /c /v "" < %input%

: Split file
for /f "tokens=* delims=[" %i in ('type "%input%" ^| find /v /n ""') do (

if !line!==%max% (
set /a line=1 >nul
set /a file+=1 >nul
set out=!file!_%input%
echo Writing file: !out!
)

REM Write next file
set a=%i
set a=!a:*]=]!
echo:!a:~1!>>out!
set /a line+=1 >nul
)

If above code hangs or crashes, this example code splits files faster (by writing data to intermediate files instead of keeping everything in memory):

如果上述代码挂起或崩溃,此示例代码会更快地拆分文件(通过将数据写入中间文件而不是将所有内容保存在内存中):

eg. To split a file with 7,600 lines into smaller files of maximum 3000 lines.

例如。将包含 7,600 行的文件拆分为最多 3000 行的较小文件。

  1. Generate regexp string/pattern files with setcommand to be fed to /gflag of findstr
  1. 生成正则表达式字符串/模式文件,其中包含set要馈送到/g标志的命令findstr

list1.txt

列表1.txt

\[[0-9]\]
\[[0-9][0-9]\]
\[[0-9][0-9][0-9]\]
\[[0-2][0-9][0-9][0-9]\]

\[[0-9]\]
\[[0-9][0-9]\]
\[[0-9][0-9][0-9]\]
\[[0-2][ 0-9][0-9][0-9]\]

list2.txt

列表2.txt

\[[3-5][0-9][0-9][0-9]\]

\[[3-5][0-9][0-9][0-9]\]

list3.txt

列表3.txt

\[[6-9][0-9][0-9][0-9]\]

\[[6-9][0-9][0-9][0-9]\]

  1. Split the file into smaller files:
  1. 将文件拆分为较小的文件:
type "%input%" | find /v /n "" | findstr /b /r /g:list1.txt > file1.txt
type "%input%" | find /v /n "" | findstr /b /r /g:list2.txt > file2.txt
type "%input%" | find /v /n "" | findstr /b /r /g:list3.txt > file3.txt
type "%input%" | find /v /n "" | findstr /b /r /g:list1.txt > file1.txt
type "%input%" | find /v /n "" | findstr /b /r /g:list2.txt > file2.txt
type "%input%" | find /v /n "" | findstr /b /r /g:list3.txt > file3.txt
  1. remove prefixed line numbers for eachfile split:
    eg. for the 1st file:
  1. 删除每个文件拆分的前缀行号:
    例如。对于第一个文件:
for /f "tokens=* delims=[" %i in ('type "%cd%\file1.txt"') do (
set a=%i
set a=!a:*]=]!
echo:!a:~1!>>file_1.txt)
for /f "tokens=* delims=[" %i in ('type "%cd%\file1.txt"') do (
set a=%i
set a=!a:*]=]!
echo:!a:~1!>>file_1.txt)

Notes:
Works with leading whitespace, blank lines & whitespace lines.

注意:
适用于前导空白、空白行和空白行。

Tested on Win 10 x64 CMD, on 4.4GB text file, 5651982 lines.

在 Win 10 x64 CMD、4.4GB 文本文件、5651982 行上测试。

回答by Shaina Raza

you can split using a third party software http://www.hjsplit.org/, for example give yours input that could be upto 9GB and then split, in my case I split 10 MB each enter image description here

您可以使用第三方软件http://www.hjsplit.org/进行拆分,例如,输入最多 9GB 的输入,然后拆分,在我的情况下,我每个拆分 10 MB 在此处输入图片说明

回答by Wintermute

You can use the command splitfor this task. For example this command entered into the command prompt

您可以将命令split用于此任务。例如这个命令进入命令提示符

split YourLogFile.txt -b 500m

creates several files with a size of 500 MByte each. This will take several minutes for a file of your size. You can rename the output files (by default called "xaa", "xab",... and so on) to *.txt to open it in the editor of your choice.

创建多个大小为 500 MB 的文件。对于您这样大小的文件,这将需要几分钟时间。您可以将输出文件(默认称为“xaa”、“xab”等)重命名为 *.txt 以在您选择的编辑器中打开它。

Make sure to check the help file for the command. You can also split the log file by number of lines or change the name of your output files.

确保检查该命令的帮助文件。您还可以按行数拆分日志文件或更改输出文件的名称。

(tested on Windows 7 64 bit)

(在 Windows 7 64 位上测试)