Python 将多个csv文件导入pandas并连接成一个DataFrame

Question

提问by jonas

I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

我想将目录中的几个 csv 文件读入 Pandas 并将它们连接成一个大数据帧。我一直无法弄清楚。这是我到目前为止所拥有的：

import glob
import pandas as pd

# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

I guess I need some help within the for loop???

我想我在 for 循环中需要一些帮助？？？

Answer 1

采纳答案by Gaurav Singh

If you have same columns in all your csvfiles then you can try the code below. I have added header=0so that after reading csvfirst row can be assigned as the column names.

如果您的所有csv文件中都有相同的列，那么您可以尝试下面的代码。我已经添加，header=0以便在阅读后csv可以将第一行指定为列名。

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Answer 2

回答by SKG

Edit: I googled my way into https://stackoverflow.com/a/21232849/186078. However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.

编辑：我用谷歌搜索了https://stackoverflow.com/a/21232849/186078。然而，最近我发现使用 numpy 进行任何操作然后将其分配给数据帧的速度更快，而不是在迭代的基础上操作数据帧本身，它似乎也适用于该解决方案。

I do sincerely want anyone hitting this page to consider this approach, but don't want to attach this huge piece of code as a comment and making it less readable.

我真诚地希望任何点击此页面的人都考虑这种方法，但不想将这段庞大的代码附加为注释并使其可读性降低。

You can leverage numpy to really speed up the dataframe concatenation.

您可以利用 numpy 来真正加快数据帧连接的速度。

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1","col2"....]

Timing stats:

计时统计：

total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---

Answer 3

回答by Sid

An alternative to darindaCoder's answer:

darindaCoder 答案的替代方法：

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

Answer 4

回答by Nim J

If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:

如果多个 csv 文件被压缩，您可以使用 zipfile 读取全部并连接如下：

import zipfile
import numpy as np
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train=[]

for f in range(0,len(ziptrain.namelist())):
    if (f == 0):
        train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
    else:
        my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
        train = (pd.DataFrame(np.concatenate((train,my_df),axis=0), 
                          columns=list(my_df.columns.values)))

Answer 5

回答by Jose Antonio Martin H

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

Answer 6

回答by toto_tico

If you want to search recursively(Python 3.5 or above), you can do the following:

如果要递归搜索（Python 3.5 或更高版本），可以执行以下操作：

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

Note that the three last lines can be expressed in one single line:

请注意，最后三行可以用一行表示：

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

You can find the documentation of **here. Also, I used iglobinstead of glob, as it returns an iteratorinstead of a list.

您可以在**此处找到文档。另外，我使用了iglob而不是glob，因为它返回一个迭代器而不是一个列表。

EDIT: Multiplatform recursive function:

编辑：多平台递归函数：

You can wrap the above into a multiplatform function(Linux, Windows, Mac), so you can do:

您可以将上述内容包装成一个多平台功能（Linux、Windows、Mac），这样您就可以：

df = read_df_rec('C:\user\your\path', *.csv)

Here is the function:

这是函数：

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

Answer 7

回答by Jouni K. Sepp?nen

The Dask library can read a dataframe from multiple files:

Dask 库可以从多个文件中读取数据帧：

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(Source: http://dask.pydata.org/en/latest/examples/dataframe-csv.html)

（来源：http: //dask.pydata.org/en/latest/examples/dataframe-csv.html）

The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute()to convert the dataframe into a Pandas dataframe.

Dask 数据帧实现了 Pandas 数据帧 API 的一个子集。如果所有数据都适合内存，您可以调用df.compute()将数据帧转换为 Pandas 数据帧。

Answer 8

回答by robmsmt

Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional 3rd party libraries. You can do this in 2 lines using everything Pandas and python (all versions) already have built in.

这里几乎所有的答案要么过于复杂（全局模式匹配），要么依赖于额外的 3rd 方库。您可以使用 Pandas 和 python（所有版本）已经内置的所有内容在 2 行中完成此操作。

For a few files - 1 liner:

对于一些文件 - 1 个班轮：

df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))

For many files:

对于许多文件：

from os import listdir

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

This pandas line which sets the df utilizes 3 things:

这条设置 df 的熊猫线利用了 3 件事：

Python's map (function, iterable)sends to the function (the pd.read_csv()) the iterable (our list) which is every csv element in filepaths).
Panda's read_csv()function reads in each CSV file as normal.
Panda's concat()brings all these under one df variable.

Python 的映射（函数，可迭代）将pd.read_csv()可迭代（我们的列表）发送到函数（该），它是文件路径中的每个 csv 元素）。
Panda 的read_csv()函数正常读取每个 CSV 文件。
Panda 的concat()将所有这些都放在一个 df 变量下。

Answer 9

回答by muon

one liner using map, but if you'd like to specify additional args, you could do:

一个班轮使用map，但如果你想指定额外的参数，你可以这样做：

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None), 
                    glob.glob("data/*.csv")))

Note: mapby itself does not let you supply additional args.

注意：map它本身不允许您提供额外的参数。

Answer 10

回答by mjspier

Another on-liner with list comprehension which allows to use arguments with read_csv.

另一个具有列表理解的在线，它允许使用 read_csv 的参数。

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

Python 将多个csv文件导入pandas并连接成一个DataFrame

提问by jonas

采纳答案by Gaurav Singh

回答by SKG

回答by Sid

回答by Nim J

回答by Jose Antonio Martin H

回答by toto_tico

回答by Jouni K. Sepp?nen

回答by robmsmt

回答by muon

回答by mjspier

相关推荐

最近更新

标签

Python 将多个csv文件导入pandas并连接成一个DataFrame

提问by jonas

采纳答案by Gaurav Singh

回答by SKG

回答by Sid

回答by Nim J

回答by Jose Antonio Martin H

回答by toto_tico

回答by Jouni K. Sepp?nen

回答by robmsmt

回答by muon

回答by mjspier

相关推荐

Python 将 multiprocessing.Process 与最大数量的并发进程一起使用

我如何在python中使用hashlib解密？

Python 使用多处理队列、池和锁定的死简单示例

获取 Errno 9：python 套接字中的错误文件描述符

相关推荐

最近更新

标签