Python 将多个csv文件导入pandas并连接成一个DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20906474/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Import multiple csv files into pandas and concatenate into one DataFrame
提问by jonas
I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:
我想将目录中的几个 csv 文件读入 Pandas 并将它们连接成一个大数据帧。我一直无法弄清楚。这是我到目前为止所拥有的:
import glob
import pandas as pd
# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
I guess I need some help within the for loop???
我想我在 for 循环中需要一些帮助???
采纳答案by Gaurav Singh
If you have same columns in all your csvfiles then you can try the code below.
I have added header=0so that after reading csvfirst row can be assigned as the column names.
如果您的所有csv文件中都有相同的列,那么您可以尝试下面的代码。我已经添加,header=0以便在阅读后csv可以将第一行指定为列名。
import pandas as pd
import glob
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
回答by SKG
Edit: I googled my way into https://stackoverflow.com/a/21232849/186078. However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.
编辑:我用谷歌搜索了https://stackoverflow.com/a/21232849/186078。然而,最近我发现使用 numpy 进行任何操作然后将其分配给数据帧的速度更快,而不是在迭代的基础上操作数据帧本身,它似乎也适用于该解决方案。
I do sincerely want anyone hitting this page to consider this approach, but don't want to attach this huge piece of code as a comment and making it less readable.
我真诚地希望任何点击此页面的人都考虑这种方法,但不想将这段庞大的代码附加为注释并使其可读性降低。
You can leverage numpy to really speed up the dataframe concatenation.
您可以利用 numpy 来真正加快数据帧连接的速度。
import os
import glob
import pandas as pd
import numpy as np
path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))
np_array_list = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
np_array_list.append(df.as_matrix())
comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)
big_frame.columns = ["col1","col2"....]
Timing stats:
计时统计:
total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---
回答by Sid
An alternative to darindaCoder's answer:
darindaCoder 答案的替代方法:
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
回答by Nim J
If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:
如果多个 csv 文件被压缩,您可以使用 zipfile 读取全部并连接如下:
import zipfile
import numpy as np
import pandas as pd
ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')
train=[]
for f in range(0,len(ziptrain.namelist())):
if (f == 0):
train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
else:
my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
train = (pd.DataFrame(np.concatenate((train,my_df),axis=0),
columns=list(my_df.columns.values)))
回答by Jose Antonio Martin H
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
回答by toto_tico
If you want to search recursively(Python 3.5 or above), you can do the following:
如果要递归搜索(Python 3.5 或更高版本),可以执行以下操作:
from glob import iglob
import pandas as pd
path = r'C:\user\your\path\**\*.csv'
all_rec = iglob(path, recursive=True)
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)
Note that the three last lines can be expressed in one single line:
请注意,最后三行可以用一行表示:
df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)
You can find the documentation of **here. Also, I used iglobinstead of glob, as it returns an iteratorinstead of a list.
您可以在**此处找到文档。另外,我使用了iglob而不是glob,因为它返回一个迭代器而不是一个列表。
EDIT: Multiplatform recursive function:
编辑:多平台递归函数:
You can wrap the above into a multiplatform function(Linux, Windows, Mac), so you can do:
您可以将上述内容包装成一个多平台功能(Linux、Windows、Mac),这样您就可以:
df = read_df_rec('C:\user\your\path', *.csv)
Here is the function:
这是函数:
from glob import iglob
from os.path import join
import pandas as pd
def read_df_rec(path, fn_regex=r'*.csv'):
return pd.concat((pd.read_csv(f) for f in iglob(
join(path, '**', fn_regex), recursive=True)), ignore_index=True)
回答by Jouni K. Sepp?nen
The Dask library can read a dataframe from multiple files:
Dask 库可以从多个文件中读取数据帧:
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
(Source: http://dask.pydata.org/en/latest/examples/dataframe-csv.html)
(来源:http: //dask.pydata.org/en/latest/examples/dataframe-csv.html)
The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute()to convert the dataframe into a Pandas dataframe.
Dask 数据帧实现了 Pandas 数据帧 API 的一个子集。如果所有数据都适合内存,您可以调用df.compute()将数据帧转换为 Pandas 数据帧。
回答by robmsmt
Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional 3rd party libraries. You can do this in 2 lines using everything Pandas and python (all versions) already have built in.
这里几乎所有的答案要么过于复杂(全局模式匹配),要么依赖于额外的 3rd 方库。您可以使用 Pandas 和 python(所有版本)已经内置的所有内容在 2 行中完成此操作。
For a few files - 1 liner:
对于一些文件 - 1 个班轮:
df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))
For many files:
对于许多文件:
from os import listdir
filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))
This pandas line which sets the df utilizes 3 things:
这条设置 df 的熊猫线利用了 3 件事:
- Python's map (function, iterable)sends to the function (the
pd.read_csv()) the iterable (our list) which is every csv element in filepaths). - Panda's read_csv()function reads in each CSV file as normal.
- Panda's concat()brings all these under one df variable.
- Python 的映射(函数,可迭代)将
pd.read_csv()可迭代(我们的列表)发送到函数(该 ),它是文件路径中的每个 csv 元素)。 - Panda 的read_csv()函数正常读取每个 CSV 文件。
- Panda 的concat()将所有这些都放在一个 df 变量下。
回答by muon
one liner using map, but if you'd like to specify additional args, you could do:
一个班轮使用map,但如果你想指定额外的参数,你可以这样做:
import pandas as pd
import glob
import functools
df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
glob.glob("data/*.csv")))
Note: mapby itself does not let you supply additional args.
注意:map它本身不允许您提供额外的参数。
回答by mjspier
Another on-liner with list comprehension which allows to use arguments with read_csv.
另一个具有列表理解的在线,它允许使用 read_csv 的参数。
df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

