Linux 如何根据第一列的内容拆分巨大的 csv 文件？

Question

提问by daydreamer

I have a 250MB+ huge csv file to upload
file format is group_id, application_id, readingand data could look like

我有一个 250MB+ 的巨大 csv 文件要上传
文件格式是group_id, application_id, reading，数据可能看起来像

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)

I want to divide the file based on group_id, so output should be n files where n=group_id

我想根据分割文件group_id，所以输出应该是 n 个文件，其中n=group_id

Output

输出

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

和

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

and

和

File n
n, x, 0.3(lets say)

File n
n, x, 0.3(lets say)

How can I do this effectively?

我怎样才能有效地做到这一点？

Answer 1

采纳答案by Fred Foo

If the file is already sorted by group_id, you can do something like:

如果文件已按排序group_id，您可以执行以下操作：

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

Answer 2

回答by aweis

If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.

如果它们按组 id 排序，您可以使用 csv 模块迭代文件中的行并输出它。您可以在此处找到有关该模块的信息。

Answer 3

回答by DNA

How about:

怎么样：

Read the input file a line at a time
split()each line on ,to get the group_id
For each new group_id you find, open an output file
- add each groupid to a set/dict as you find them so you can keep track
write the line to the appropriate file
Done!

一次读取输入文件一行
split()每一行都,得到group_id
对于您找到的每个新 group_id，打开一个输出文件
- 当您找到每个 groupid 时，将它们添加到 set/dict 中，以便您可以跟踪
将该行写入适当的文件
完毕！

Answer 4

回答by senderle

If the rows are sorted by group_id, then itertools.groupbywould be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csvto load the file (in case you didn't already know about it).

如果行按排序group_id，则itertools.groupby在这里很有用。因为它是一个迭代器，所以您不必将整个文件加载到内存中；您仍然可以逐行写入每个文件。使用csv加载该文件（如果你还不知道这件事）。

Answer 5

回答by Odomontois

Here some food for though for you:

这里有一些食物给你：

import csv
from collections import namedtuple

csvfile = namedtuple('scvfile',('file','writer'))

class CSVFileCollections(object):

    def __init__(self,prefix,postfix):
        self.prefix = prefix
        self.files = {}

    def __getitem__(self,item):
        if item not in self.files:
            file = open(self.prefix+str(item)+self.postfix,'wb')
            writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
            self.files[item] = csvfile(file,writer) 
        return self.files[item].writer

    def __enter__(self): pass

    def __exit__(self, exc_type, exc_value, traceback):
        for csvfile in self.files.values() : csvfile.file.close()


with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
    reader = csv.reader(readFile, delimiter=",", quotechar="'")
    for row in reader:
        writer = output[row[0]]
        writer.writerow(row)

Answer 6

回答by Mike

Sed one-liner:

Sed单线：

sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile

The only down-side is that you need to put in n-estatements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

唯一的缺点是您需要放入n 个-e语句（由省略号表示，不应出现在最终版本中）。所以这条单线可能是一条很长的线。

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

不过，好处是它只遍历文件一次，不假设排序，也不需要 python。另外，它是一个单人班轮！

Answer 7

回答by Zsolt Botykai

awkis capable:

awk有能力：

 awk -F "," '{print ##代码## >> ("FILE" )}' HUGE.csv

Linux 如何根据第一列的内容拆分巨大的 csv 文件？

提问by daydreamer

采纳答案by Fred Foo

回答by aweis

回答by DNA

回答by senderle

回答by Odomontois

回答by Mike

回答by Zsolt Botykai

相关推荐

最近更新

标签

Linux 如何根据第一列的内容拆分巨大的 csv 文件？

提问by daydreamer

采纳答案by Fred Foo

回答by aweis

回答by DNA

回答by senderle

回答by Odomontois

回答by Mike

回答by Zsolt Botykai

相关推荐

C# 正则表达式提取 html 正文

如何在 Linux 中使用 CMake 和 Kdevelop 编译 GLUT + OpenGL 项目？

如何在 C# 中没有标志变量的情况下跳出 2 个循环？

Linux 如何构建原生（命令行）可执行文件以在 Android 上运行？

相关推荐

最近更新

标签