pandas 将多个 csv 文件连接成具有相同标头的单个 csv - Python
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44791212/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Concatenating multiple csv files into a single csv with the same header - Python
提问by mattblack
I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
我目前正在使用以下代码导入 6,000 个 csv 文件(带标题)并将它们导出到单个 csv 文件(带单个标题行)。
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
This code works fine, but it is slow. It can take up to 2 days to process.
这段代码工作正常,但速度很慢。最多可能需要 2 天的时间来处理。
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
我得到了一个终端命令行的单行脚本,它执行相同的操作(但没有标题)。此脚本需要 20 秒。
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
有谁知道我如何加速第一个 Python 脚本?为了缩短时间,我想过不将它导入到 DataFrame 中,而只是连接 CSV,但我无法弄清楚。
Thanks.
谢谢。
回答by ShadowRanger
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
如果您不需要内存中的 CSV,只需从输入复制到输出,那么完全避免解析和复制而不在内存中构建会便宜很多:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it; shutil.copyfileobj
handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.
就是这样; shutil.copyfileobj
处理有效地复制数据,大大减少了 Python 级别的解析和重新序列化工作。
This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
这假设所有 CSV 文件具有相同的格式、编码、行尾等,并且标头不包含嵌入的换行符,但如果是这种情况,它比替代方案要快得多。
回答by Peter Leimbigler
Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat
the header row from a randomly selected input .csv file into merged.csv
before running your one-liner:
您是否需要在 Python 中执行此操作?如果您愿意完全在 shell 中执行此操作,则您需要做的就是在运行单行cat
程序merged.csv
之前首先将随机选择的输入 .csv 文件中的标题行放入:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
回答by Alexander
You don't need pandas for this, just the simple csv
module would work fine.
你不需要Pandas,只要简单的csv
模块就可以正常工作。
import csv
df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
writer = csv.writer(fout)
for filename in allFiles:
with open(filename) as fin:
reader = csv.reader(fin)
headers = reader.next()
if write_headers:
write_headers = False # Only write headers once.
writer.writerow(headers)
writer.writerows(reader) # Write all remaining rows.
回答by markroxor
Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-
这是一种更简单的方法-您可以使用Pandas(尽管我不确定它对 RAM 使用有何帮助)-
import pandas as pd
import glob
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
stockstats_data = pd.concat((df, stockstats_data), axis=0)