合并多个 CSV 文件而不重复标题（使用 Python）

Question

提问by dotpy_novice

I am a beginner with Python. I have multiple CSV files (more than 10), and all of them have same number of columns. I would like to merge all of them into a single CSV file, where I will not have headers repeated.

我是 Python 的初学者。我有多个 CSV 文件（超过 10 个），并且所有文件的列数都相同。我想将它们全部合并到一个 CSV 文件中，我不会在其中重复标题。

So essentially I need to have just the first row with all the headers and from then I need all the rows from all CSV files merged. How do I do this?

所以基本上我只需要第一行包含所有标题，然后我需要合并所有 CSV 文件中的所有行。我该怎么做呢？

Here's what I tried so far.

这是我到目前为止所尝试的。

import glob
import csv



with open('output.csv','wb') as fout:
    wout = csv.writer(fout,delimiter=',') 
    interesting_files = glob.glob("*.csv") 
    for filename in interesting_files: 
        print 'Processing',filename 
    # Open and process file
        h = True
        with open(filename,'rb') as fin:
                fin.next()#skip header
        for line in csv.reader(fin,delimiter=','):
                wout.writerow(line)

Answer 1

采纳答案by m.wasowski

While I think that the best answer is the one from @valentin, you can do this without using csvmodule at all:

虽然我认为最好的答案是来自 @valentin 的答案，但您可以完全不使用csv模块来做到这一点：

import glob

interesting_files = glob.glob("*.csv") 

header_saved = False
with open('output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header = next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

Answer 2

回答by valentin

If you are on a linux system:

如果您使用的是 linux 系统：

head -1 director/one_file.csv > output csv   ## writing the header to the final file
tail -n +2  director/*.csv >> output.csv  ## writing the content of all csv starting with second line into final file

Answer 3

回答by Padraic Cunningham

Your indentation is wrong, you need to put the loop inside the with block. You can also pass the file object to writer.writerows.

您的缩进是错误的，您需要将循环放在 with 块中。您还可以将文件对象传递给 writer.writerows。

import csv
with open('output.csv','wb') as fout:
    wout = csv.writer(fout)
    interesting_files = glob.glob("*.csv")
    for filename in interesting_files:
        print 'Processing',filename
        with open(filename,'rb') as fin:
                next(fin) # skip header
                wout.writerows(fin)

Answer 4

回答by P.R.

If you dont mind the overhead, you could use pandas which is shipped with common python distributions. If you plan do more with speadsheet tables, I recommend using pandas rather than trying to write your own libraries.

如果您不介意开销，您可以使用常见的 Python 发行版附带的 Pandas。如果您计划使用电子表格做更多事情，我建议您使用 Pandas 而不是尝试编写自己的库。

import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

full_df.to_csv('output.csv')

Just a little more on pandas. Because it is made to deal with spreadsheet like data, it knows the first line is a header. When reading a CSV it separates the data table from the header which is kept as metadata of the dataframe, the standard datatype in pandas. If you concat several of these dataframesit concatenates only the dataparts if their headers are the same. If the headers are not the same it fails and gives you an error. Probably a good thing in case your directory is polluted with CSV files from another source.

关于熊猫的更多信息。因为它是用来处理像数据一样的电子表格，所以它知道第一行是标题。读取 CSV 时，它将数据表与标头分开，标头作为 .pandas 中dataframe的标准数据类型的元数据保留。如果您dataframes连接其中的几个，它只会连接数据部分（如果它们的标头相同）。如果标题不相同，它会失败并给你一个错误。如果您的目录被其他来源的 CSV 文件污染，这可能是一件好事。

Another thing: I just added sorted()around the interesting_files. I assume your files are named in order and this order should be kept. I am not sure about glob, but the osfunctions are not necessarily returning files sorted by their name.

另一件事：我只是sorted()在interesting_files. 我假设你的文件是按顺序命名的，应该保持这个顺序。我不确定 glob，但os函数不一定返回按名称排序的文件。

Answer 5

回答by Jean-Fran?ois Fabre

Your attempt is almost working, but the issues are:

您的尝试几乎奏效，但问题是：

you're opening the file for reading but closing it before writing the rows.
you're never writing the title. You have to write it once
Also you have to excludeoutput.csv from the "glob" else the output is also in input!

您正在打开文件进行读取，但在写入行之前将其关闭。
你永远不会写标题。你必须写一次
此外，您必须从“glob”中排除output.csv，否则输出也在输入中！

Here's the corrected code, passing the csv object direcly to csv.writerowsmethod for shorter & faster code. Also writing the title from the first file to the output file.

这是更正后的代码，将 csv 对象直接传递给csv.writerows方法以获得更短和更快的代码。还将标题从第一个文件写入输出文件。

import glob
import csv

output_file = 'output.csv'
header_written = False

with open(output_file,'w',newline="") as fout:  # just "wb" in python 2
    wout = csv.writer(fout,delimiter=',')
    # filter out output
    interesting_files = [x for x in glob.glob("*.csv") if x != output_file]
    for filename in interesting_files:
        print('Processing {}'.format(filename))
        with open(filename) as fin:
            cr = csv.reader(fin,delmiter=",")
            header = cr.next() #skip header
            if not header_written:
                wout.writerow(header)
                header_written = True
            wout.writerows(cr)

Note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, they miserably fail, botching the title line/repeating part of it several time, efficiently corrupting the file.

请注意，使用原始逐行处理的解决方案忽略了一个重要点：如果标题是多行的，它们会悲惨地失败，将标题行搞砸/多次重复其中的一部分，从而有效地破坏文件。

csv module (or pandas, too) handle those cases gracefully.

csv 模块（或大熊猫）可以优雅地处理这些情况。

合并多个 CSV 文件而不重复标题（使用 Python）

提问by dotpy_novice

采纳答案by m.wasowski

回答by valentin

回答by Padraic Cunningham

回答by P.R.

回答by Jean-Fran?ois Fabre

相关推荐

最近更新

标签

合并多个 CSV 文件而不重复标题（使用 Python）

提问by dotpy_novice

采纳答案by m.wasowski

回答by valentin

回答by Padraic Cunningham

回答by P.R.

回答by Jean-Fran?ois Fabre

相关推荐

Python 组合熊猫中的行

Python 对于scrapy/selenium，有没有办法返回上一页？

Python 文本文件读取和打印数据

Python scikit-learn 中处理 nan/null 的分类器

相关推荐

最近更新

标签