Linux 如何根据第一列的内容拆分巨大的 csv 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9489078/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a huge csv file based on content of first column?
提问by daydreamer
- I have a 250MB+ huge csv file to upload
- file format is
group_id, application_id, reading
and data could look like
- 我有一个 250MB+ 的巨大 csv 文件要上传
- 文件格式是
group_id, application_id, reading
,数据可能看起来像
1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 ..... n, x, 0.3(lets say)
1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 ..... n, x, 0.3(lets say)
- I want to divide the file based on
group_id
, so output should be n files wheren=group_id
- 我想根据 分割文件
group_id
,所以输出应该是 n 个文件,其中n=group_id
Output
输出
File 1 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9
File 1 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9
and
和
File2 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 .....
File2 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 .....
and
和
File n n, x, 0.3(lets say)
File n n, x, 0.3(lets say)
How can I do this effectively?
我怎样才能有效地做到这一点?
采纳答案by Fred Foo
If the file is already sorted by group_id
, you can do something like:
如果文件已按 排序group_id
,您可以执行以下操作:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("foo.csv")),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
回答by aweis
回答by DNA
How about:
怎么样:
- Read the input file a line at a time
split()
each line on,
to get thegroup_id
- For each new group_id you find, open an output file
- add each groupid to a set/dict as you find them so you can keep track
- write the line to the appropriate file
- Done!
- 一次读取输入文件一行
split()
每一行都,
得到group_id
- 对于您找到的每个新 group_id,打开一个输出文件
- 当您找到每个 groupid 时,将它们添加到 set/dict 中,以便您可以跟踪
- 将该行写入适当的文件
- 完毕!
回答by senderle
If the rows are sorted by group_id
, then itertools.groupby
would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv
to load the file (in case you didn't already know about it).
如果行按 排序group_id
,则itertools.groupby
在这里很有用。因为它是一个迭代器,所以您不必将整个文件加载到内存中;您仍然可以逐行写入每个文件。使用csv
加载该文件(如果你还不知道这件事)。
回答by Odomontois
Here some food for though for you:
这里有一些食物给你:
import csv
from collections import namedtuple
csvfile = namedtuple('scvfile',('file','writer'))
class CSVFileCollections(object):
def __init__(self,prefix,postfix):
self.prefix = prefix
self.files = {}
def __getitem__(self,item):
if item not in self.files:
file = open(self.prefix+str(item)+self.postfix,'wb')
writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
self.files[item] = csvfile(file,writer)
return self.files[item].writer
def __enter__(self): pass
def __exit__(self, exc_type, exc_value, traceback):
for csvfile in self.files.values() : csvfile.file.close()
with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
reader = csv.reader(readFile, delimiter=",", quotechar="'")
for row in reader:
writer = output[row[0]]
writer.writerow(row)
回答by Mike
Sed one-liner:
Sed单线:
sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile
The only down-side is that you need to put in n-e
statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.
唯一的缺点是您需要放入n 个-e
语句(由省略号表示,不应出现在最终版本中)。所以这条单线可能是一条很长的线。
The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!
不过,好处是它只遍历文件一次,不假设排序,也不需要 python。另外,它是一个单人班轮!
回答by Zsolt Botykai
awk
is capable:
awk
有能力:
awk -F "," '{print ##代码## >> ("FILE" )}' HUGE.csv