pandas 将数据帧拆分为单独的 CSV 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46124699/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Splitting a dataframe into separate CSV files
提问by Steve Dallas
I have a fairly large csv, looking like this:
我有一个相当大的 csv,看起来像这样:
+---------+---------+
| Column1 | Column2 |
+---------+---------+
| 1 | 93644 |
| 2 | 63246 |
| 3 | 47790 |
| 3 | 39644 |
| 3 | 32585 |
| 1 | 19593 |
| 1 | 12707 |
| 2 | 53480 |
+---------+---------+
My intent is to
我的意图是
- Add a new column
- Insert a specific value into that column, 'NewColumnValue', on each row of the csv
- Sort the file based on the value in Column1
- Split the original CSV into new files based on the contents of 'Column1', removing the header
- 添加新列
- 在 csv 的每一行上将特定值插入到该列“NewColumnValue”中
- 根据 Column1 中的值对文件进行排序
- 根据“Column1”的内容将原始 CSV 拆分为新文件,删除标题
For example, I want to end up with multiple files that look like:
例如,我想最终得到多个如下所示的文件:
+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+
+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+
+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+
I have managed to do this using separate .py files:
我已经使用单独的 .py 文件设法做到了这一点:
Step1
第1步
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)
Step2
第2步
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
But I'd really like to learn how to accomplish everything in a single .py file. I tried this:
但我真的很想学习如何在单个 .py 文件中完成所有工作。我试过这个:
# -*- coding: utf-8 -*-
#This processes a large CSV file.
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
but instead of working as intended, it's giving me multiple CSVs named after each column header.
但它没有按预期工作,而是给了我多个以每个列标题命名的 CSV。
Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.
发生这种情况是因为我在使用单独的 .py 文件时删除了标题行而我没有在这里这样做吗?我不确定在拆分文件以删除标题时需要执行什么操作。
回答by cs95
Why not just groupby Column1
and save each group?
为什么不只是 groupbyColumn1
并保存每个组?
df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)
Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue
for i, g in df.groupby('Column1'):
g.to_csv('{}.csv'.format(i), header=False, index_label=False)
Thanks to Unatiel for the improvement. header=False
will not write headers and index_label=False
will not write an index column.
感谢 Unatiel 的改进。header=False
不会写标题,index_label=False
也不会写索引列。
This creates 3 files:
这将创建 3 个文件:
1.csv
2.csv
3.csv
Each having data corresponding to each Column1
group.
每个都有对应于每个Column1
组的数据。
回答by SnoProblem
You don't need to switch to itertools
for the filtering, pandas
has all of the necessary functionality built-in.
您不需要切换到itertools
过滤,pandas
内置了所有必要的功能。
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1') # Sorting isn't needed
df['NewColumn'] = 'NewColumnValue'
for key in df['Column1'].unique(): # For each value in Column1
# These two steps can be combined into a single call
# I'll separate for clarity:
# 1) filter the dataframe on the unique value
dw = df[df['Column1']==key]
# 2) write the resulting dataframe without headers
dw.to_csv("%s.csv" % key, header=False)
回答by Unatiel
pandas.DataFrame
supports a method to write it's data as a csv to_csv()
. You have no need for csv
module in this case.
pandas.DataFrame
支持将其数据写入 csv 的方法to_csv()
。csv
在这种情况下,您不需要模块。
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1').set_index('Column1')
df['NewColumn'] = 'NewColumnValue'
for key in df.index.unique():
df.loc[key].to_csv('%d.csv' % int(key), header=False)
for key df.index.unique():
will loop over every unique value in the index. In your example, it will loop over (1, 2 , 3)
. header=False
willmake sure the header isn't written to the output file.
for key df.index.unique():
将遍历索引中的每个唯一值。在您的示例中,它将循环遍历(1, 2 , 3)
. header=False
将确保标头不会写入输出文件。
And to explain why you get the wrong output in your example, try print(list(df))
. This should output all the columns in df. This is why for key, rows in csv.reader((df)):
iterates over the columns in df.
并解释为什么在示例中得到错误输出,请尝试print(list(df))
. 这应该输出 df 中的所有列。这就是for key, rows in csv.reader((df)):
迭代 df 中的列的原因。
Actually, you should get 1 csv for every column in your dataframe, and their contents are likely something like ,[NAME_OF_COLUMN]
or maybe ,<itertools.... object at 0x.....>
.
其实,你应该得到1个CSV在你的数据帧的每一列,其含量可能像,[NAME_OF_COLUMN]
或者可能,<itertools.... object at 0x.....>
。