bash 如何将文件分成相等的部分,而不会破坏单独的行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7764755/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a file into equal parts, without breaking individual lines?
提问by Abdel
I was wondering if it was possible to split a file into equal parts (edit:= all equal except for the last), without breaking the line? Using the split command in Unix, lines may be broken in half. Is there a way to, say, split up a file in 5 equal parts, but have it still only consist of whole lines (it's no problem if one of the files is a little larger or smaller)? I know I could just calculate the number of lines, but I have to do this for a lot of files in a bash script. Many thanks!
我想知道是否可以将文件分成相等的部分(编辑:= 除最后一个外都相等),而不会断线?在 Unix 中使用 split 命令,行可能会被分成两半。有没有办法,比如说,将一个文件分成 5 个相等的部分,但它仍然只包含整行(如果其中一个文件大一点或小一点也没有问题)?我知道我可以只计算行数,但是我必须对 bash 脚本中的很多文件执行此操作。非常感谢!
回答by paxdiablo
If you mean an equal number of lines,split
has an option for this:
如果您的意思是行数相等,split
则可以选择:
split --lines=75
If you need to know what that 75
should really be for N
equal parts, its:
如果您需要知道75
对于N
相等的部分应该是什么,它的:
lines_per_part = int(total_lines + N - 1) / N
where total lines can be obtained with wc -l
.
可以用 获得总行数wc -l
。
See the following script for an example:
有关示例,请参见以下脚本:
#!/usr/bin/bash
# Configuration stuff
fspec=qq.c
num_files=6
# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))
# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} xyzzy.
# Debug information
echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*
This outputs:
这输出:
Total lines = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total
More recent versions of split
allow you to specify a number of CHUNKS
with the -n/--number
option. You can therefore use something like:
更新的版本split
允许您CHUNKS
使用-n/--number
选项指定多个。因此,您可以使用以下内容:
split --number=l/6 ${fspec} xyzzy.
(that's ell-slash-six
, meaning lines
, not one-slash-six
).
(即ell-slash-six
,意思lines
,不是one-slash-six
)。
That will give you roughly equal files in terms of size, with no mid-line splits.
这将使您在大小方面大致相同的文件,没有中线分割。
I mention that last point because it doesn't give you roughly the same number of linesin each file, more the same number of characters.
我提到最后一点是因为它不会为您提供每个文件中大致相同的行数,而是更多相同的字符数。
So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won'tget four lines in every file.
因此,如果您有一个 20 个字符的行和 19 个 1 个字符的行(总共 20 行)并拆分为五个文件,那么您很可能不会在每个文件中得到四行。
回答by jbr
The script isn't even necessary, split(1)supports the wanted feature out of the box:split -l 75 auth.log auth.log.
The above command splits the file in chunks of 75 lines a piece, and outputs file on the form: auth.log.aa, auth.log.ab, ...
该脚本甚至不是必需的,split(1)支持开箱即用的所需功能:split -l 75 auth.log auth.log.
上面的命令将文件分成 75 行的块,并在表单上输出文件:auth.log.aa, auth.log.ab, ...
wc -l
on the original file and output gives:
wc -l
在原始文件和输出上给出:
321 auth.log
75 auth.log.aa
75 auth.log.ab
75 auth.log.ac
75 auth.log.ad
21 auth.log.ae
642 total
回答by user3769065
split was updated in coreutils release 8.8 (announced 22 Dec 2010) with the --number option to generate a specific number of files. The option --number=l/n generates n files without splitting lines.
split 在 coreutils 8.8 版(2010 年 12 月 22 日发布)中更新,使用 --number 选项生成特定数量的文件。选项 --number=l/n 生成 n 个文件而不拆分行。
http://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html#split-invocationhttp://savannah.gnu.org/forum/forum.php?forum_id=6662
http://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html#split-invocation http://savannah.gnu.org/forum/forum.php?forum_id=6662
回答by Kuf
A simple solution for a simple question:
一个简单问题的简单解决方案:
split -n l/5 your_file.txt
no need for scripting here.
这里不需要编写脚本。
From the manfile, CHUNKS may be:
从man文件中,CHUNKS may be:
l/N split into N files without splitting lines
Update
更新
Not all unix dist include this flag. For example, it will not work in OSX. To use it, you can consider replacing the Mac OS X utilities with GNU core utilities.
并非所有的 unix dist 都包含这个标志。例如,它在 OSX 中不起作用。要使用它,您可以考虑将 Mac OS X 实用程序替换为 GNU 核心实用程序。
回答by Jose Ricardo Bustos M.
I made a bash script, that given a number of parts as input, split a file
我制作了一个 bash 脚本,将多个部分作为输入,拆分文件
#!/bin/sh
parts_total="";
input="";
parts=$((parts_total))
for i in $(seq 0 $((parts_total-2))); do
lines=$(wc -l "$input" | cut -f 1 -d" ")
#n is rounded, 1.3 to 2, 1.6 to 2, 1 to 1
n=$(awk -v lines=$lines -v parts=$parts 'BEGIN {
n = lines/parts;
rounded = sprintf("%.0f", n);
if(n>rounded){
print rounded + 1;
}else{
print rounded;
}
}');
head -$n "$input" > split${i}
tail -$((lines-n)) "$input" > .tmp${i}
input=".tmp${i}"
parts=$((parts-1));
done
mv .tmp$((parts_total-2)) split$((parts_total-1))
rm .tmp*
I used head
and tail
commands, and store in tmp files, for split the files
我使用head
和tail
命令,并存储在 tmp 文件中,用于拆分文件
#10 means 10 parts
sh mysplitXparts.sh input_file 10
or with awk, where 0.1 is 10% => 10 parts, or 0.334 is 3 parts
或使用 awk,其中 0.1 为 10% => 10 份,或 0.334 为 3 份
awk -v size=$(wc -l < input) -v perc=0.1 '{
nfile = int(NR/(size*perc));
if(nfile >= 1/perc){
nfile--;
}
print > "split_"nfile
}' input
回答by Prabu
var dict = File.ReadLines("test.txt")
.Where(line => !string.IsNullOrWhitespace(line))
.Select(line => line.Split(new char[] { '=' }, 2, 0))
.ToDictionary(parts => parts[0], parts => parts[1]);
or
enter code here
line="[email protected][email protected]";
string[] tokens = line.Split(new char[] { '=' }, 2, 0);
ans:
tokens[0]=to
token[1][email protected][email protected]"