python pandas合并多个csv文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48051100/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python pandas merge multiple csv files
提问by Sayed Gouda
I have around 600 csv file datasets, all have the very same column names [‘DateTime', ‘Actual', ‘Consensus', ‘Previous', ‘Revised'], all economic indicators and all-time series data sets.
我有大约 600 个 csv 文件数据集,都具有相同的列名 ['DateTime', 'Actual', 'Consensus', 'Previous', 'Revised'],所有经济指标和所有时间序列数据集。
the aim is to merge them all together in one csv file.
目的是将它们全部合并到一个 csv 文件中。
With ‘DateTime' as an index.
以 'DateTime' 作为索引。
The way I wanted this file to indexed in is the time line way which means let's say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.
我希望此文件索引的方式是时间线方式,这意味着假设第一个 csv 中的第一个事件日期为 12/18/2017 10:00:00,第二个 csv 中的第一个事件日期为 12/29/ 2017 年 09:00:00 和日期为 12/20/2017 09:00:00 的第三个 csv 中的第一个事件。
So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.
所以,我想先索引它们,然后再索引它们,等等,尽管它最初来自源 csv。
I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime' because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00') Here is the code:
我试图合并其中的 3 个作为实验,问题是“DateTime”,因为它像这样将其中的 3 个打印在一起 ('12/18/2017 10:00:00', '12/29/2017 09 :00:00', '12/20/2017 09:00:00') 这是代码:
import pandas as pd
df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")
df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)
print(df.head())
df.to_csv('df.csv')
回答by Parfait
Consider using read_csv()
args, index_coland parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index()
on final dataframe to sort the datetimes.
考虑使用read_csv()
args、index_col和parse_dates在导入和格式化为日期时间期间创建索引。然后运行您需要的水平合并。下面假设日期在 csv 的第一列中。最后使用sort_index()
最终数据帧对日期时间进行排序。
df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()
And for DRY-er approach especially across the hundreds of csv files, use a list comprehension
对于 DRY-er 方法,尤其是在数百个 csv 文件中,请使用列表理解
import os
...
os.chdir('E:\Business\Economic Indicators')
dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()
回答by John Smith Optional
You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis
should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.
您正在尝试从许多具有相同列名的数据帧的行中构建一个大型数据帧。axis
应该是 0(默认值),而不是 1。你也不需要指定连接类型。这不会有任何影响,因为每个数据帧的列名都是相同的。
df = pd.concat([df1, df2, df3])
should be enough in order to concatenate the datasets.
应该足以连接数据集。
(see https://pandas.pydata.org/pandas-docs/stable/merging.html)
(见https://pandas.pydata.org/pandas-docs/stable/merging.html)
Your call to set_index
to define an index using the values in the DateTime column should then work.
您set_index
使用 DateTime 列中的值定义索引的调用应该可以工作。
回答by bolirev
The problem is two folds: merging the csv into a single dataframe, and then ordering it by date.
问题有两个方面:将 csv 合并到单个数据帧中,然后按日期对其进行排序。
As John Smith pointed out to merge dataframes along rows, you need to use:
正如约翰史密斯指出的那样,要沿行合并数据帧,您需要使用:
df = pd.concat([df1,df2,df3])
Then you want to set an index and reorder your dataframe according to the index.
然后你想设置一个索引并根据索引重新排序你的数据框。
df.set_index('DateTime', inplace=True)
df.sort_index(inplace=True)
or in descending order
或降序
df.sort_index(inplace=True,ascending=False)
(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)
(见https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)
timeindex = pd.date_range('2018/01/01','2018/01/10')
randtimeindex = np.random.permutation(timeindex)
# Create three dataframes
df1 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df1.DateTime=randtimeindex[:3]
df2 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df2.DateTime=randtimeindex[3:6]
df3 = pd.DataFrame(index=range(4),data=np.random.rand(4,3),
columns=['Actual','Consensus','DateTime'])
df3.DateTime=randtimeindex[6:]
# Merge them
df4 = pd.concat([df1, df2, df3], axis=0)
# Reindex the merged dataframe, and sort it
df4.set_index('DateTime', inplace=True)
df4.sort_index(inplace=True, ascending=False)
print(df4.head())