pandas 按python中列的值拆分大csv文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33609322/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split big csv file by the value of a column in python
提问by Mohamed Ali JAMAOUI
I have a csv large file that I cannot handle in memory with python. I am splitting it into multiple chunks after grouping by the value of a specific column, using the following logic:
我有一个 csv 大文件,无法在内存中使用 python 处理。在按特定列的值分组后,我将其拆分为多个块,使用以下逻辑:
def splitDataFile(self, data_file):
self.list_of_chunk_names = []
csv_reader = csv.reader(open(data_file, "rb"), delimiter="|")
columns = csv_reader.next()
for key,rows in groupby(csv_reader, lambda row: (row[1])):
file_name = "data_chunk"+str(key)+".csv"
self.list_of_chunk_names.append(file_name)
with open(file_name, "w") as output:
output.write("|".join(columns)+"\n")
for row in rows:
output.write("|".join(row)+"\n")
print "message: list of chunks ", self.list_of_chunk_names
return
The logic is working but it's slow. I am wondering how can I optimize this? For instance with pandas?
逻辑正在运行,但速度很慢。我想知道如何优化这个?例如Pandas?
Edit
编辑
Further explanation: I am not looking for a simple splitting to same size chunks (like each one having 1000 rows), I want to split by the value of a column, that's why I am using groupby.
进一步解释:我不是在寻找一个简单的拆分为相同大小的块(比如每个块有 1000 行),我想按列的值拆分,这就是我使用 groupby 的原因。
采纳答案by Mohamed Ali JAMAOUI
I am going with something like the following, where I am iterating over the unique values of the column to split by, to filter the data chunks.
我将使用类似于以下内容的方法,在其中迭代要拆分的列的唯一值,以过滤数据块。
def splitWithPandas(data_file, split_by_column):
values_to_split_by = pd.read_csv(data_file, delimiter="|", usecols=[split_by_column])
values_to_split_by.drop_duplicates()
values_to_split_by = pd.unique(values_to_split_by.values.ravel())
for i in values_to_split_by:
iter_csv = pd.read_csv(data_file, delimiter="|", chunksize=100000)
df = pd.concat([chunk[chunk[split_by_column] == i] for chunk in iter_csv])
df.to_csv("data_chunk_"+i, sep="|", index=False)
回答by bigOther
Use this Python 3program:
使用这个Python 3程序:
#!/usr/bin/env python3
import binascii
import csv
import os.path
import sys
from tkinter.filedialog import askopenfilename, askdirectory
from tkinter.simpledialog import askinteger
def split_csv_file(f, dst_dir, keyfunc):
csv_reader = csv.reader(f)
csv_writers = {}
for row in csv_reader:
k = keyfunc(row)
if k not in csv_writers:
csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k),
mode='w', newline=''))
csv_writers[k].writerow(row)
def get_args_from_cli():
input_filename = sys.argv[1]
column = int(sys.argv[2])
dst_dir = sys.argv[3]
return (input_filename, column, dst_dir)
def get_args_from_gui():
input_filename = askopenfilename(
filetypes=(('CSV', '.csv'),),
title='Select CSV Input File')
column = askinteger('Choose Table Column', 'Table column')
dst_dir = askdirectory(title='Select Destination Directory')
return (input_filename, column, dst_dir)
if __name__ == '__main__':
if len(sys.argv) == 1:
input_filename, column, dst_dir = get_args_from_gui()
elif len(sys.argv) == 4:
input_filename, column, dst_dir = get_args_from_cli()
else:
raise Exception("Invalid number of arguments")
with open(input_filename, mode='r', newline='') as f:
split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')
# if the column has funky values resulting in invalid filenames
# replace the line from above with:
# split_csv_file(f, dst_dir, lambda r: binascii.b2a_hex(r[column-1].encode('utf-8')).decode('utf-8')+'.csv')
Save it as split-csv.py
and run it from Explorer or from the command
line.
将其另存为split-csv.py
并从资源管理器或命令行运行。
For example to split superuser.csv
based off column 1 and write the
output files under dstdir
use:
例如,superuser.csv
根据第 1 列拆分并写入正在dstdir
使用的输出文件:
python split-csv.py data.csv 1 dstdir
If you run it without arguments, a Tkinter based GUI will prompt you to choose the input file, the column (1 based index) and the destination directory.
如果不带参数运行它,基于 Tkinter 的 GUI 将提示您选择输入文件、列(基于 1 的索引)和目标目录。
回答by David Z
I suspectthat your biggest bottleneck is opening and closing a file handle every time you process a new block of rows. A better approach, as long as the number of files you write to is not too large, is to keep all the files open. Here's an outline:
我怀疑您最大的瓶颈是每次处理新的行块时打开和关闭文件句柄。一个更好的方法,只要您写入的文件数量不是太大,就是保持所有文件打开。这是一个大纲:
def splitDataFile(self, data_file):
open_files = dict()
input_file = open(data_file, "rb")
try:
...
csv_reader = csv.reader(input_file, ...)
...
for key, rows in groupby(csv_reader, lambda row: (row[1])):
...
try:
output = open_files[key]
except KeyError:
output = open(file_name, "w")
output.write(...)
...
finally:
for open_file in open_files.itervalues():
open_file.close()
input_file.close()
Of course, if you only have one group with any given key, this will not help. (Actually it may make things worse, because you wind up holding a bunch of files open unnecessarily.) The more often you wind up writing to a single file, the more of a benefit you'll get from this change.
当然,如果您只有一个具有任何给定键的组,这将无济于事。(实际上,这可能会使事情变得更糟,因为您最终会不必要地打开一堆文件。)您最终写入单个文件的次数越多,您从这种更改中获得的好处就越多。
You can combine this with pandas, if you want, and use the chunking features of read_csv
or read_table
to handle the input processing.
如果需要,您可以将其与 Pandas 结合使用,并使用read_csv
或的分块功能read_table
来处理输入处理。
回答by Gabriel
You will probably get the best performance by using the builtin chunking features of pandas (the chunksize
keyword arg to read_csv
),
通过使用 pandas 的内置分块功能(chunksize
关键字 arg to read_csv
),您可能会获得最佳性能,
http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html
For example,
例如,
reader = pd.read_table('my_data.csv', chunksize=4)
for chunk in reader:
print(chunk)
EDIT:
编辑:
This might get you somewhere,
这可能会让你到达某个地方,
import pandas as pd
group_col_indx = 1
group_col = pd.read_csv('test.csv', usecols=[group_col_indx])
keys = group_col.iloc[:,0].unique()
for key in keys:
df_list = []
reader = pd.read_csv('test.csv', chunksize=2)
for chunk in reader:
good_rows = chunk[chunk.iloc[:,group_col_indx] == key]
df_list.append(good_rows)
df_key = pd.concat(df_list)