Linux 如何根据第一列的内容拆分巨大的 csv 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9489078/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 04:53:51  来源:igfitidea点击:

How to split a huge csv file based on content of first column?

pythonlinuxunixubuntu

提问by daydreamer

  • I have a 250MB+ huge csv file to upload
  • file format is group_id, application_id, readingand data could look like
  • 我有一个 250MB+ 的巨大 csv 文件要上传
  • 文件格式是group_id, application_id, reading,数据可能看起来像
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)  
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)  
  • I want to divide the file based on group_id, so output should be n files where n=group_id
  • 我想根据 分割文件group_id,所以输出应该是 n 个文件,其中n=group_id

Output

输出

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

and

File n
n, x, 0.3(lets say)  
File n
n, x, 0.3(lets say)  

How can I do this effectively?

我怎样才能有效地做到这一点?

采纳答案by Fred Foo

If the file is already sorted by group_id, you can do something like:

如果文件已按 排序group_id,您可以执行以下操作:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

回答by aweis

If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.

如果它们按组 id 排序,您可以使用 csv 模块迭代文件中的行并输出它。您可以在此处找到有关该模块的信息

回答by DNA

How about:

怎么样:

  • Read the input file a line at a time
  • split()each line on ,to get the group_id
  • For each new group_id you find, open an output file
    • add each groupid to a set/dict as you find them so you can keep track
  • write the line to the appropriate file
  • Done!
  • 一次读取输入文件一行
  • split()每一行都,得到group_id
  • 对于您找到的每个新 group_id,打开一个输出文件
    • 当您找到每个 groupid 时,将它们添加到 set/dict 中,以便您可以跟踪
  • 将该行写入适当的文件
  • 完毕!

回答by senderle

If the rows are sorted by group_id, then itertools.groupbywould be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csvto load the file (in case you didn't already know about it).

如果行按 排序group_id,则itertools.groupby在这里很有用。因为它是一个迭代器,所以您不必将整个文件加载到内存中;您仍然可以逐行写入每个文件。使用csv加载该文件(如果你还不知道这件事)。

回答by Odomontois

Here some food for though for you:

这里有一些食物给你:

import csv
from collections import namedtuple

csvfile = namedtuple('scvfile',('file','writer'))

class CSVFileCollections(object):

    def __init__(self,prefix,postfix):
        self.prefix = prefix
        self.files = {}

    def __getitem__(self,item):
        if item not in self.files:
            file = open(self.prefix+str(item)+self.postfix,'wb')
            writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
            self.files[item] = csvfile(file,writer) 
        return self.files[item].writer

    def __enter__(self): pass

    def __exit__(self, exc_type, exc_value, traceback):
        for csvfile in self.files.values() : csvfile.file.close()


with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
    reader = csv.reader(readFile, delimiter=",", quotechar="'")
    for row in reader:
        writer = output[row[0]]
        writer.writerow(row)

回答by Mike

Sed one-liner:

Sed单线:

sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile 

The only down-side is that you need to put in n-estatements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

唯一的缺点是您需要放入n 个-e语句(由省略号表示,不应出现在最终版本中)。所以这条单线可能是一条很长的线。

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

不过,好处是它只遍历文件一次,不假设排序,也不需要 python。另外,它是一个单人班轮!

回答by Zsolt Botykai

awkis capable:

awk有能力:

 awk -F "," '{print ##代码## >> ("FILE" )}' HUGE.csv