pandas 将多个 csv 文件连接成具有相同标头的单个 csv - Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44791212/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:53:56  来源:igfitidea点击:

Concatenating multiple csv files into a single csv with the same header - Python

pythoncsvpandasterminalconcatenation

提问by mattblack

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

我目前正在使用以下代码导入 6,000 个 csv 文件(带标题)并将它们导出到单个 csv 文件(带单个标题行)。

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

This code works fine, but it is slow. It can take up to 2 days to process.

这段代码工作正常,但速度很慢。最多可能需要 2 天的时间来处理。

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

我得到了一个终端命令行的单行脚本,它执行相同的操作(但没有标题)。此脚本需要 20 秒。

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

有谁知道我如何加速第一个 Python 脚本?为了缩短时间,我想过不将它导入到 DataFrame 中,而只是连接 CSV,但我无法弄清楚。

Thanks.

谢谢。

回答by ShadowRanger

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

如果您不需要内存中的 CSV,只需从输入复制到输出,那么完全避免解析和复制而不在内存中构建会便宜很多:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

That's it; shutil.copyfileobjhandles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

就是这样; shutil.copyfileobj处理有效地复制数据,大大减少了 Python 级别的解析和重新序列化工作。

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

这假设所有 CSV 文件具有相同的格式、编码、行尾等,并且标头不包含嵌入的换行符,但如果是这种情况,它比替代方案要快得多。

回答by Peter Leimbigler

Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first catthe header row from a randomly selected input .csv file into merged.csvbefore running your one-liner:

您是否需要在 Python 中执行此操作?如果您愿意完全在 shell 中执行此操作,则您需要做的就是在运行单行cat程序merged.csv之前首先将随机选择的输入 .csv 文件中的标题行放入:

cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

回答by Alexander

You don't need pandas for this, just the simple csvmodule would work fine.

你不需要Pandas,只要简单的csv模块就可以正常工作。

import csv

df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
    writer = csv.writer(fout)
    for filename in allFiles:
        with open(filename) as fin:
            reader = csv.reader(fin)
            headers = reader.next()
            if write_headers:
                write_headers = False  # Only write headers once.
                writer.writerow(headers)
            writer.writerows(reader)  # Write all remaining rows.

回答by markroxor

Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-

这是一种更简单的方法-您可以使用Pandas(尽管我不确定它对 RAM 使用有何帮助)-

import pandas as pd
import glob

path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_)
    stockstats_data = pd.concat((df, stockstats_data), axis=0)