Pandas 按列将 CSV 拆分为多个 CSV(或 DataFrame)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48007017/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas split CSV into multiple CSV's (or DataFrames) by a column
提问by Elias Cort Aguelo
I'm very lost with a problem and some help or tips will be appreciated.
我对一个问题感到非常困惑,将不胜感激一些帮助或提示。
The problem: I've a csv file with a column with the possibility of multiple values like:
问题:我有一个 csv 文件,其中有一列可能有多个值,例如:
Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1
Orange;Green;something2
Apple;Red;something2
Apple;Red;something3
I've loaded the data into a dataframe and i need to split that dataframe into multiple dataframes based on the value of the column "The_evil_column":
我已将数据加载到数据帧中,我需要根据“The_evil_column”列的值将该数据帧拆分为多个数据帧:
df1
Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1
df2
Fruit;Color;The_evil_column
Orange;Green;something2
Apple;Red;something2
df3
Fruit;Color;The_evil_column
Apple;Red;something3
After reading some posts i'm even more confused and i need some tip about this please.
阅读一些帖子后,我更加困惑,我需要一些关于此的提示。
回答by MaxU
you can generate a dictionary of DataFrames:
您可以生成一个 DataFrame 字典:
d = {g:x for g,x in df.groupby('The_evil_column')}
In [95]: d.keys()
Out[95]: dict_keys(['something1', 'something2', 'something3'])
In [96]: d['something1']
Out[96]:
Fruit Color The_evil_column
0 Apple Red something1
1 Apple Green something1
2 Orange Orange something1
or a list of DataFrames:
或数据帧列表:
In [103]: l = [x for _,x in df.groupby('The_evil_column')]
In [104]: l[0]
Out[104]:
Fruit Color The_evil_column
0 Apple Red something1
1 Apple Green something1
2 Orange Orange something1
In [105]: l[1]
Out[105]:
Fruit Color The_evil_column
3 Orange Green something2
4 Apple Red something2
In [106]: l[2]
Out[106]:
Fruit Color The_evil_column
5 Apple Red something3
UPDATE:
更新:
In [111]: g = pd.read_csv(filename, sep=';').groupby('The_evil_column')
In [112]: g.ngroups # number of unique values in the `The_evil_column` column
Out[112]: 3
In [113]: g.apply(lambda x: x.to_csv(r'c:\temp\{}.csv'.format(x.name)))
Out[113]:
Empty DataFrame
Columns: []
Index: []
will produce 3 files:
将产生 3 个文件:
In [115]: glob.glob(r'c:\temp\something*.csv')
Out[115]:
['c:\temp\something1.csv',
'c:\temp\something2.csv',
'c:\temp\something3.csv']
回答by Bart?omiej
you can just filter the frame by the value of the column:
您可以通过列的值过滤框架:
frame=pd.read_csv('file.csv',delimiter=';')
frame['The_evil_column']=='something1'
this returns:
这将返回:
0 True
1 True
2 True
3 False
4 False
5 False
Name: The_evil_column, dtype: bool
Therefore you access these columns:
因此,您可以访问这些列:
frame1 = frame[frame['The_evil_column']=='something1']
Later you can drop the column:
稍后您可以删除该列:
frame1 = frame1.drop('The_evil_column', axis=1)
回答by Rahul Chawla
Simpler but less efficient way is:
更简单但效率较低的方法是:
data = pd.read_csv('input.csv')
out = []
for evil_element in list(set(list(data['The_evil_column']))):
out.append(data[data['The_evil_column']==evil_element])
out
will have list of all data dataframes.
out
将有所有数据数据框的列表。