Python 将多个 CSV 文件中的列合并为一个文件

Question

提问by Stylize

I have a bunch of CSV files (only two in the example below). Each CSV file has 6 columns. I want to go into each CSV file, copy the first two columns and add them as new columns to an existing CSV file.

我有一堆 CSV 文件（在下面的示例中只有两个）。每个 CSV 文件有 6 列。我想进入每个 CSV 文件，复制前两列并将它们作为新列添加到现有的 CSV 文件中。

Thus far I have:

到目前为止，我有：

import csv

f = open('combined.csv')
data = [item for item in csv.reader(f)]
f.close()

for x in range(1,3): #example has 2 csv files, this will be automated
    n=0
    while n<2:
        f=open(str(x)+".csv")
        new_column=[item[n] for item in csv.reader(f)]
        f.close()
        #print d

        new_data = []

        for i, item in enumerate(data):
            try:
                item.append(new_column[i])
                print i
            except IndexError, e:
                item.append("")
            new_data.append(item)

        f = open('combined.csv', 'w')
        csv.writer(f).writerows(new_data)
        f.close()
        n=n+1

This works, it is not pretty, but it works. However, I have three minor annoyances:

这有效，它不漂亮，但有效。但是，我有三个小烦恼：

I open each CSV file twice (once for each column), that is hardly elegant
When I print the combined.csvfile, it prints an empty row following each row?
I have to provide a combined.csvfile that has at least as many rows in it as the largest file I may have. Since I do not really know what that number may be, that kinda sucks

我打开每个 CSV 文件两次（每列一次），这并不优雅
当我打印combined.csv文件时，它在每行后面打印一个空行？
我必须提供一个combined.csv文件，该文件的行数至少与我可能拥有的最大文件一样多。因为我真的不知道那个数字可能是多少，所以有点糟糕

As always, any help is much appreciated!!

与往常一样，非常感谢任何帮助！！

As requested: 1.csv looks like (mock data)

根据要求：1.csv 看起来像（模拟数据）

1,a
2,b
3,c
4,d

2.csv looks like

2.csv 看起来像

5,e
6,f
7,g
8,h
9,i

the combined.csv file should look like

combine.csv 文件应该看起来像

1,a,5,e
2,b,6,f
3,c,7,g
4,d,8,h
,,9,i

Answer 1

采纳答案by unutbu

import csv
import itertools as IT

filenames = ['1.csv', '2.csv']
handles = [open(filename, 'rb') for filename in filenames]    
readers = [csv.reader(f, delimiter=',') for f in handles]

with  open('combined.csv', 'wb') as h:
    writer = csv.writer(h, delimiter=',', lineterminator='\n', )
    for rows in IT.izip_longest(*readers, fillvalue=['']*2):
        combined_row = []
        for row in rows:
            row = row[:2] # select the columns you want
            if len(row) == 2:
                combined_row.extend(row)
            else:
                combined.extend(['']*2)
        writer.writerow(combined_row)

for f in handles:
    f.close()

The line for rows in IT.izip_longest(*readers, fillvalue=['']*2):can be understood with an example:

该行for rows in IT.izip_longest(*readers, fillvalue=['']*2):可以通过一个例子来理解：

In [1]: import itertools as IT

In [2]: readers = [(1,2,3), ('a','b','c','d'), (10,20,30,40)]

In [3]: list(IT.izip_longest(readers[0], readers[1], readers[2]))
Out[3]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

As you can see, IT.izip_longestbehaves very much like zip, except that it does not stop until the longest iterable is consumed. It fills in missing items with Noneby default.

如您所见，IT.izip_longest 的行为与非常相似zip，不同之处在于它在最长的迭代被消耗之前不会停止。None默认情况下，它会填充缺失的项目。

Now what happens if there were more than 3 items in readers? We would want to write

现在如果有超过 3 个项目会发生什么readers？我们想写

list(IT.izip_longest(readers[0], readers[1], readers[2], ...))

but that's laborious and if we did not know len(readers)in advance, we wouldn't even be able to replace the ellipsis (...) with something explicit.

但这很费力，如果我们len(readers)事先不知道，我们甚至无法...用明确的内容替换省略号 ( )。

Python has a solution for this: the star (aka argument unpacking) syntax:

Python对此有一个解决方案：星号（又名参数解包）语法：

In [4]: list(IT.izip_longest(*readers))
Out[4]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

Notice the result Out[4]is identical to the result Out[3].

请注意，结果Out[4]与结果相同Out[3]。

The *readerstells Python to unpack the items in readersand send them along as individual arguments to IT.izip_longest. This is how Python allows us to send an arbitrary number of arguments to a function.

该*readers告诉Python来解压的项目readers，并沿着并将其作为单独的参数IT.izip_longest。这就是 Python 允许我们向函数发送任意数量的参数的方式。

Answer 2

回答by georg

Here's an example (I'm using string io instead of files for simplicity, but that's not essential):

这是一个示例（为简单起见，我使用字符串 io 而不是文件，但这不是必需的）：

a = u"""
1,a
2,b
3,c
4,d
"""
b = u"""
5,e
6,f
7,g
8,h
9,i
"""
c = u"""
11,x
12,y
13,z
"""

import io, csv, itertools

data = []
expand = lambda it, size: it + [[''] * len(it[0])] * size

for f in [a, b, c]:
    with io.StringIO(f.strip()) as fp:
        d = list(csv.reader(fp))
        t = len(d) - len(data)
        data = d if not data else [
            x + y for x, y in itertools.izip_longest(
                expand(data, t), expand(d, -t))]

for r in data:
    print ','.join(r)    

# 1,a,5,e,11,x
# 2,b,6,f,12,y
# 3,c,7,g,13,z
# 4,d,8,h,,
# ,,9,i,,

with real files (named 1.csv, 2.csv etc) the main loop will look like this:

对于真实文件（命名为 1.csv、2.csv 等），主循环将如下所示：

for n in range(...):
    with open(str(n) + '.csv') as fp:
        d = list(csv.reader(fp))
        t = len(d) - len(data)
        data = d if not data else [
            x + y for x, y in itertools.izip_longest(
                expand(data, t), expand(d, -t))]

Answer 3

回答by steveha

Here is a program I wrote to solve your problem. It makes a class that holds the information about each CSV file to read, including which columns you want from it. Then there is simply a list of CSV files to read, and a line is read from each.

这是我为解决您的问题而编写的程序。它创建了一个类，用于保存有关要读取的每个 CSV 文件的信息，包括您希望从中获取哪些列。然后只有一个要读取的 CSV 文件列表，并从每个文件中读取一行。

Since you said it needs to keep returning rows until all the input files are read, it returns dummy values for input files that have reached the end. It keeps reading rows until all input files are done.

由于您说它需要在读取所有输入文件之前一直返回行，因此它返回已到达末尾的输入文件的虚拟值。它会一直读取行，直到完成所有输入文件。

Also, this program only needs to hold one row at a time in memory. So it could process even large CSV files without needing much memory.

此外，该程序一次只需要在内存中保存一行。因此，它甚至可以处理大型 CSV 文件，而无需太多内存。

Originally I had a dummy value of -1 for missing data. Now I see you added an example and you just want no value. I've changed the program from using -1 to using an empty string when there is no data.

最初，对于缺失的数据，我的虚拟值为 -1。现在我看到你添加了一个例子，你只是想要没有价值。当没有数据时，我已将程序从使用 -1 更改为使用空字符串。

One of the design goals was to make this extendable. Right now you need the first two columns, but what if you later need columns 0, 3, and 7 from one of the files? So each file has a list with the columns to take.

设计目标之一是使其可扩展。现在您需要前两列，但如果您稍后需要其中一个文件中的第 0、3 和 7 列怎么办？因此，每个文件都有一个包含要获取的列的列表。

I didn't actually write the code to rename the output file to the original filename but that is easy to add.

我实际上并没有编写将输出文件重命名为原始文件名的代码，但这很容易添加。

Ideally this whole thing would be wrapped up into a class, where you can iterate a class instance and get back one row put together using columns from all input files. I didn't take the extra time to do that, but if you will be using this over the long term you might want to do that. Also, I never bothered to close any of the input files, since I figure the program will end after we write the output file and everything will close then; but ideally we should close all files after we use them!

理想情况下，这整个事情将被包装到一个类中，您可以在其中迭代一个类实例并使用所有输入文件中的列将一行放在一起。我没有花额外的时间来做这件事，但如果你将长期使用它，你可能想要这样做。此外，我从不费心关闭任何输入文件，因为我认为程序将在我们写入输出文件后结束，然后一切都会关闭；但理想情况下，我们应该在使用完所有文件后关闭它们！

import csv

fname_in = "combined.csv"
fname_out = "combined.tmp"

lst_other_fnames = [str(x) + ".csv" for x in range(1, 3)]

no_data = ''
def _no_data_list(columns):
    return [no_data for _ in columns]

class DataCsvFile(object):
    def __init__(self, fname, columns=None):
        self.fname = fname
        self.f = open(fname)
        self.reader = csv.reader(self.f)
        self.columns = columns
        self.done = False
    def next_columns(self):
        if self.done:
            return _no_data_list(self.columns)

        try:
            item = next(self.reader)
        except StopIteration:
            self.done = True
            return _no_data_list(self.columns)

        return [item[i] for i in self.columns]

# want all columns from original file
data_csv_files = [DataCsvFile(fname_in, range(5))]

# build list of filenames and columns: want first two columns from each
data_csv_files.extend(DataCsvFile(fname, range(2)) for fname in lst_other_fnames)


with open(fname_out, "w") as out_f:
    writer = csv.writer(out_f)

    while True:
        values = []
        for df in data_csv_files:
            columns = df.next_columns()
            values.extend(columns)
        if not all(df.done for df in data_csv_files):
            writer.writerow(values)
        else:
            break

Answer 4

回答by DSM

These days it seems almost obligatory for someone to give a pandas-based solution to any data processing problem in Python. So here's mine:

如今，对于Python 中的任何数据处理问题，似乎几乎必须有人提供基于Pandas的解决方案。所以这是我的：

import pandas as pd

to_merge = ['{}.csv'.format(i) for i in range(4)]
dfs = []
for filename in to_merge:
    # read the csv, making sure the first two columns are str
    df = pd.read_csv(filename, header=None, converters={0: str, 1: str})
    # throw away all but the first two columns
    df = df.ix[:,:1]
    # change the column names so they won't collide during concatenation
    df.columns = [filename + str(cname) for cname in df.columns]
    dfs.append(df)

# concatenate them horizontally
merged = pd.concat(dfs,axis=1)
# write it out
merged.to_csv("merged.csv", header=None, index=None)

which for the files

对于文件

~/coding/pand/merge$ cat 0.csv 
0,a,6,5,3,7
~/coding/pand/merge$ cat 1.csv 
1,b,7,6,7,0
2,c,0,1,8,7
3,d,6,8,4,5
4,e,8,4,2,4
~/coding/pand/merge$ cat 2.csv 
5,f,6,2,9,1
6,g,0,3,2,7
7,h,6,5,1,9
~/coding/pand/merge$ cat 3.csv 
8,i,9,1,7,1
9,j,0,9,3,9

gives

给

In [21]: !cat merged.csv
0,a,1,b,5,f,8,i
,,2,c,6,g,9,j
,,3,d,7,h,,
,,4,e,,,,

In [22]: pd.read_csv("merged.csv", header=None)
Out[22]: 
    0    1  2  3   4    5   6    7
0   0    a  1  b   5    f   8    i
1 NaN  NaN  2  c   6    g   9    j
2 NaN  NaN  3  d   7    h NaN  NaN
3 NaN  NaN  4  e NaN  NaN NaN  NaN

which I think is the right alignment.

我认为这是正确的对齐方式。

Python 将多个 CSV 文件中的列合并为一个文件

提问by Stylize

采纳答案by unutbu

回答by georg

回答by steveha

回答by DSM

相关推荐

最近更新

标签

Python 将多个 CSV 文件中的列合并为一个文件

提问by Stylize

采纳答案by unutbu

回答by georg

回答by steveha

回答by DSM

相关推荐

如何在python中的if语句中测试异常？

Python Numpy索引切片不丢失维度信息

Python 解包参数：只有命名参数可以跟在 *expression 之后

Ruby 相当于 Python 的 `s="hello, %s.Where is %s?" ％（“约翰”，“玛丽”）`

相关推荐

最近更新

标签