pandas 将数据帧拆分为单独的 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46124699/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:25:22  来源:igfitidea点击:

Splitting a dataframe into separate CSV files

pythonpandasdataframegroup-bypandas-groupby

提问by Steve Dallas

I have a fairly large csv, looking like this:

我有一个相当大的 csv,看起来像这样:

+---------+---------+
| Column1 | Column2 |
+---------+---------+
|       1 |   93644 |
|       2 |   63246 |
|       3 |   47790 |
|       3 |   39644 |
|       3 |   32585 |
|       1 |   19593 |
|       1 |   12707 |
|       2 |   53480 |
+---------+---------+

My intent is to

我的意图是

  1. Add a new column
  2. Insert a specific value into that column, 'NewColumnValue', on each row of the csv
  3. Sort the file based on the value in Column1
  4. Split the original CSV into new files based on the contents of 'Column1', removing the header
  1. 添加新列
  2. 在 csv 的每一行上将特定值插入到该列“NewColumnValue”中
  3. 根据 Column1 中的值对文件进行排序
  4. 根据“Column1”的内容将原始 CSV 拆分为新文件,删除标题

For example, I want to end up with multiple files that look like:

例如,我想最终得到多个如下所示的文件:

+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+

+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+

+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+

I have managed to do this using separate .py files:

我已经使用单独的 .py 文件设法做到了这一点:

Step1

第1步

# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)

Step2

第2步

import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

But I'd really like to learn how to accomplish everything in a single .py file. I tried this:

但我真的很想学习如何在单个 .py 文件中完成所有工作。我试过这个:

# -*- coding: utf-8 -*-
#This processes a large CSV file.  
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0 
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

but instead of working as intended, it's giving me multiple CSVs named after each column header.

但它没有按预期工作,而是给了我多个以每个列标题命名的 CSV。

Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.

发生这种情况是因为我在使用单独的 .py 文件时删除了标题行而我没有在这里这样做吗?我不确定在拆分文件以删除标题时需要执行什么操作。

回答by cs95

Why not just groupby Column1and save each group?

为什么不只是 groupbyColumn1并保存每个组?

df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)

   Column1  Column2       NewColumn
0        1    93644  NewColumnValue
5        1    19593  NewColumnValue
6        1    12707  NewColumnValue
1        2    63246  NewColumnValue
7        2    53480  NewColumnValue
2        3    47790  NewColumnValue
3        3    39644  NewColumnValue
4        3    32585  NewColumnValue


for i, g in df.groupby('Column1'):
    g.to_csv('{}.csv'.format(i), header=False, index_label=False)

Thanks to Unatiel for the improvement. header=Falsewill not write headers and index_label=Falsewill not write an index column.

感谢 Unatiel 的改进header=False不会写标题,index_label=False也不会写索引列。

This creates 3 files:

这将创建 3 个文件:

1.csv
2.csv
3.csv

Each having data corresponding to each Column1group.

每个都有对应于每个Column1组的数据。

回答by SnoProblem

You don't need to switch to itertoolsfor the filtering, pandashas all of the necessary functionality built-in.

您不需要切换到itertools过滤,pandas内置了所有必要的功能。

# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')  # Sorting isn't needed
df['NewColumn'] = 'NewColumnValue'
for key in df['Column1'].unique():  # For each value in Column1
    # These two steps can be combined into a single call
    # I'll separate for clarity:  
    # 1) filter the dataframe on the unique value
    dw = df[df['Column1']==key]   
    # 2) write the resulting dataframe without headers
    dw.to_csv("%s.csv" % key, header=False)  

回答by Unatiel

pandas.DataFramesupports a method to write it's data as a csv to_csv(). You have no need for csvmodule in this case.

pandas.DataFrame支持将其数据写入 csv 的方法to_csv()csv在这种情况下,您不需要模块。

import pandas as pd

df = pd.read_csv('source.csv')
df = df.sort_values('Column1').set_index('Column1')
df['NewColumn'] = 'NewColumnValue'
for key in df.index.unique():
    df.loc[key].to_csv('%d.csv' % int(key), header=False)

for key df.index.unique():will loop over every unique value in the index. In your example, it will loop over (1, 2 , 3). header=Falsewillmake sure the header isn't written to the output file.

for key df.index.unique():将遍历索引中的每个唯一值。在您的示例中,它将循环遍历(1, 2 , 3). header=False将确保标头不会写入输出文件。

And to explain why you get the wrong output in your example, try print(list(df)). This should output all the columns in df. This is why for key, rows in csv.reader((df)):iterates over the columns in df.

并解释为什么在示例中得到错误输出,请尝试print(list(df)). 这应该输出 df 中的所有列。这就是for key, rows in csv.reader((df)):迭代 df 中的列的原因。

Actually, you should get 1 csv for every column in your dataframe, and their contents are likely something like ,[NAME_OF_COLUMN]or maybe ,<itertools.... object at 0x.....>.

其实,你应该得到1个CSV在你的数据帧的每一列,其含量可能像,[NAME_OF_COLUMN]或者可能,<itertools.... object at 0x.....>