合并多个 CSV 文件而不重复标题(使用 Python)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30335474/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Merging multiple CSV files without headers being repeated (using Python)
提问by dotpy_novice
I am a beginner with Python. I have multiple CSV files (more than 10), and all of them have same number of columns. I would like to merge all of them into a single CSV file, where I will not have headers repeated.
我是 Python 的初学者。我有多个 CSV 文件(超过 10 个),并且所有文件的列数都相同。我想将它们全部合并到一个 CSV 文件中,我不会在其中重复标题。
So essentially I need to have just the first row with all the headers and from then I need all the rows from all CSV files merged. How do I do this?
所以基本上我只需要第一行包含所有标题,然后我需要合并所有 CSV 文件中的所有行。我该怎么做呢?
Here's what I tried so far.
这是我到目前为止所尝试的。
import glob
import csv
with open('output.csv','wb') as fout:
wout = csv.writer(fout,delimiter=',')
interesting_files = glob.glob("*.csv")
for filename in interesting_files:
print 'Processing',filename
# Open and process file
h = True
with open(filename,'rb') as fin:
fin.next()#skip header
for line in csv.reader(fin,delimiter=','):
wout.writerow(line)
采纳答案by m.wasowski
While I think that the best answer is the one from @valentin, you can do this without using csv
module at all:
虽然我认为最好的答案是来自 @valentin 的答案,但您可以完全不使用csv
模块来做到这一点:
import glob
interesting_files = glob.glob("*.csv")
header_saved = False
with open('output.csv','wb') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
fout.write(header)
header_saved = True
for line in fin:
fout.write(line)
回答by valentin
If you are on a linux system:
如果您使用的是 linux 系统:
head -1 director/one_file.csv > output csv ## writing the header to the final file
tail -n +2 director/*.csv >> output.csv ## writing the content of all csv starting with second line into final file
回答by Padraic Cunningham
Your indentation is wrong, you need to put the loop inside the with block. You can also pass the file object to writer.writerows.
您的缩进是错误的,您需要将循环放在 with 块中。您还可以将文件对象传递给 writer.writerows。
import csv
with open('output.csv','wb') as fout:
wout = csv.writer(fout)
interesting_files = glob.glob("*.csv")
for filename in interesting_files:
print 'Processing',filename
with open(filename,'rb') as fin:
next(fin) # skip header
wout.writerows(fin)
回答by P.R.
If you dont mind the overhead, you could use pandas which is shipped with common python distributions. If you plan do more with speadsheet tables, I recommend using pandas rather than trying to write your own libraries.
如果您不介意开销,您可以使用常见的 Python 发行版附带的 Pandas。如果您计划使用电子表格做更多事情,我建议您使用 Pandas 而不是尝试编写自己的库。
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
Just a little more on pandas. Because it is made to deal with spreadsheet like data, it knows the first line is a header. When reading a CSV it separates the data table from the header which is kept as metadata of the dataframe
, the standard datatype in pandas. If you concat several of these dataframes
it concatenates only the dataparts if their headers are the same. If the headers are not the same it fails and gives you an error. Probably a good thing in case your directory is polluted with CSV files from another source.
关于熊猫的更多信息。因为它是用来处理像数据一样的电子表格,所以它知道第一行是标题。读取 CSV 时,它将数据表与标头分开,标头作为 .pandas 中dataframe
的标准数据类型的元数据保留。如果您dataframes
连接其中的几个,它只会连接数据部分(如果它们的标头相同)。如果标题不相同,它会失败并给你一个错误。如果您的目录被其他来源的 CSV 文件污染,这可能是一件好事。
Another thing: I just added sorted()
around the interesting_files
. I assume your files are named in order and this order should be kept. I am not sure about glob, but the os
functions are not necessarily returning files sorted by their name.
另一件事:我只是sorted()
在interesting_files
. 我假设你的文件是按顺序命名的,应该保持这个顺序。我不确定 glob,但os
函数不一定返回按名称排序的文件。
回答by Jean-Fran?ois Fabre
Your attempt is almost working, but the issues are:
您的尝试几乎奏效,但问题是:
- you're opening the file for reading but closing it before writing the rows.
- you're never writing the title. You have to write it once
- Also you have to excludeoutput.csv from the "glob" else the output is also in input!
- 您正在打开文件进行读取,但在写入行之前将其关闭。
- 你永远不会写标题。你必须写一次
- 此外,您必须从“glob”中排除output.csv,否则输出也在输入中!
Here's the corrected code, passing the csv object direcly to csv.writerows
method for shorter & faster code. Also writing the title from the first file to the output file.
这是更正后的代码,将 csv 对象直接传递给csv.writerows
方法以获得更短和更快的代码。还将标题从第一个文件写入输出文件。
import glob
import csv
output_file = 'output.csv'
header_written = False
with open(output_file,'w',newline="") as fout: # just "wb" in python 2
wout = csv.writer(fout,delimiter=',')
# filter out output
interesting_files = [x for x in glob.glob("*.csv") if x != output_file]
for filename in interesting_files:
print('Processing {}'.format(filename))
with open(filename) as fin:
cr = csv.reader(fin,delmiter=",")
header = cr.next() #skip header
if not header_written:
wout.writerow(header)
header_written = True
wout.writerows(cr)
Note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, they miserably fail, botching the title line/repeating part of it several time, efficiently corrupting the file.
请注意,使用原始逐行处理的解决方案忽略了一个重要点:如果标题是多行的,它们会悲惨地失败,将标题行搞砸/多次重复其中的一部分,从而有效地破坏文件。
csv module (or pandas, too) handle those cases gracefully.
csv 模块(或大熊猫)可以优雅地处理这些情况。