将多个 CSV 文件读入 Python Pandas 数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15843123/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:09:06  来源:igfitidea点击:

Reading Multiple CSV Files into Python Pandas Dataframe

pythonpandas

提问by user892627

The general use case behind the question is to read multiple CSV log files from a target directory into a single Python Pandas DataFrame for quick turnaround statistical analysis & charting. The idea for utilizing Pandas vs MySQL is to conduct this data import or append + stat analysis periodically throughout the day.

问题背后的一般用例是将多个 CSV 日志文件从目标目录读取到单个 Python Pandas DataFrame 中,以便快速周转统计分析和图表。利用 Pandas 与 MySQL 的想法是全天定期进行此数据导入或附加 + 统计分析。

The script below attempts to read all of the CSV (same file layout) files into a single Pandas dataframe & adds a year column associated with each file read.

下面的脚本尝试将所有 CSV(相同文件布局)文件读取到单个 Pandas 数据帧中,并添加与读取的每个文件关联的年份列。

The problem with the script is it now only reads the very last file in the directory instead of the desired outcome being allfiles within the targeted directory.

该脚本的问题在于它现在只读取目录中的最后一个文件,而不是目标目录中的所有文件。

# Assemble all of the data files into a single DataFrame & add a year field
# 2010 is the last available year
years = range(1880, 2011)

for year in years:
    path ='C:\Documents and Settings\Foo\My Documents\pydata-book\pydata-book-master`\ch02\names\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenates everything into a single Dataframe
names = pd.concat(pieces, ignore_index=True)

# Expected row total should be 1690784
names
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33838 entries, 0 to 33837
Data columns:
name      33838  non-null values
sex       33838  non-null values
births    33838  non-null values
year      33838  non-null values
dtypes: int64(2), object(2)

# Start aggregating the data at the year & gender level using groupby or pivot
total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
# Prints pivot table
total_births.tail()

Out[35]:
sex     F   M
year        
2010    1759010     1898382

回答by Greg Reda

The appendmethod on an instance of a DataFrame does not function the same as the appendmethod on an instance of a list. Dataframe.append()does not occur in-place and instead returns a new object.

append对数据帧的实例方法不起作用一样append在列表的实例方法。 Dataframe.append()不会就地发生,而是返回一个新对象。

years = range(1880, 2011)

names = pd.DataFrame()
for year in years:
    path ='C:\Documents and Settings\Foo\My Documents\pydata-book\pydata-book-master`\ch02\names\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    names = names.append(frame, ignore_index=True)

or you can use concat:

或者你可以使用concat

years = range(1880, 2011)

names = pd.DataFrame()
for year in years:
    path ='C:\Documents and Settings\Foo\My Documents\pydata-book\pydata-book-master`\ch02\names\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    names = pd.concat(names, frame, ignore_index=True)

回答by cromastro

I could not get either one of the above answers to work. The first answer was close, but the line space between the second and third lines after the forweren't right. I used the below code snippet in Canopy. Also, for those who are interested... this problem came from an example in "Python for Data Analysis". (An enjoyable book so far otherwise)

我无法获得上述任何一个答案。第一个答案很接近,但是后面的第二行和第三行之间的行距for不对。我在 Canopy 中使用了以下代码片段。另外,对于那些有兴趣的人......这个问题来自“Python for Data Analysis”中的一个例子。(到目前为止,一本有趣的书,否则)

import pandas as pd

years = range(1880,2011)
columns = ['name','sex','births']
names = pd.DataFrame()

for year in years:
    path = 'C:/PythonData/pydata-book-master/pydata-book-master/ch02/names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year
    names = names.append(frame,ignore_index=True)

回答by user3290447

remove the line space between:

删除之间的行空间:

    frame = pd.read_csv(path, names=columns)

&

&

    frame['year'] = year

so it reads

所以它读

    for year in years:
        path ='C:\Documents and Settings\Foo\My Documents\pydata-book\pydata-book-master`\ch02\names\yob%d.txt' % year
        frame = pd.read_csv(path, names=columns)
        frame['year'] = year
        names = pd.append(names, frame, ignore_index=True)