Python 如何在for循环中追加pandas数据帧中的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31674557/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:24:02  来源:igfitidea点击:

How to append rows in a pandas dataframe in a for loop?

pythonfor-looppandasdataframe

提问by Blue Moon

I have the following for loop:

我有以下 for 循环:

for i in links:
     data = urllib2.urlopen(str(i)).read()
     data = json.loads(data)
     data = pd.DataFrame(data.items())
     data = data.transpose()
     data.columns = data.iloc[0]
     data = data.drop(data.index[[0]])

Each dataframe so created has most columns in common with the others but not all of them. Moreover, they all have just one row. What I need to to is to add to the dataframe all the distinct columns and each row from each dataframe produced by the for loop

如此创建的每个数据帧都具有与其他数据帧相同的大多数列,但不是全部。而且,它们都只有一排。我需要做的是将 for 循环生成的每个数据框中的所有不同列和每一行添加到数据框中

I tried pandas concatenate or similar but nothing seemed to work. Any idea? Thanks.

我尝试了 pandas concatenate 或类似的方法,但似乎没有任何效果。任何的想法?谢谢。

采纳答案by unutbu

Suppose your data looks like this:

假设您的数据如下所示:

import pandas as pd
import numpy as np

np.random.seed(2015)
df = pd.DataFrame([])
for i in range(5):
    data = dict(zip(np.random.choice(10, replace=False, size=5),
                    np.random.randint(10, size=5)))
    data = pd.DataFrame(data.items())
    data = data.transpose()
    data.columns = data.iloc[0]
    data = data.drop(data.index[[0]])
    df = df.append(data)
print('{}\n'.format(df))
# 0   0   1   2   3   4   5   6   7   8   9
# 1   6 NaN NaN   8   5 NaN NaN   7   0 NaN
# 1 NaN   9   6 NaN   2 NaN   1 NaN NaN   2
# 1 NaN   2   2   1   2 NaN   1 NaN NaN NaN
# 1   6 NaN   6 NaN   4   4   0 NaN NaN NaN
# 1 NaN   9 NaN   9 NaN   7   1   9 NaN NaN

Then it could be replaced with

然后它可以替换为

np.random.seed(2015)
data = []
for i in range(5):
    data.append(dict(zip(np.random.choice(10, replace=False, size=5),
                         np.random.randint(10, size=5))))
df = pd.DataFrame(data)
print(df)

In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data)once at the end, outside the loop.

换句话说,不要为每一行形成一个新的 DataFrame。相反,收集字典列表中的所有数据,然后df = pd.DataFrame(data)在循环外的最后调用一次。

Each call to df.appendrequires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.appendin a loop very inefficient. The time cost of copying grows quadraticallywith the number of rows. Not only is the call-DataFrame-once code easier to write, it's performance will be much better -- the time cost of copying grows linearly with the number of rows.

每次调用都df.append需要为具有额外一行的新 DataFrame 分配空间,将原始 DataFrame 中的所有数据复制到新 DataFrame 中,然后将数据复制到新行中。所有这些分配和复制使得df.append循环调用非常低效。复制的时间成本与行数成二次方增长。call-DataFrame-once 代码不仅更容易编写,而且性能也会更好——复制的时间成本随行数线性增长。

回答by kztd

There are 2 reasons you may append rows in a loop, 1. add to an existing df, and 2. create a new df.

您可以在循环中追加行的原因有两个,1. 添加到现有的 df,以及 2. 创建一个新的 df。

to create a new df, I think its well documented that you should either create your data as a list and then create the data frame:

要创建一个新的 df,我认为它有据可查,您应该将数据创建为列表,然后创建数据框:

cols = ['c1', 'c2', 'c3']
lst = []
for a in range(2):
    lst.append([1, 2, 3])
df1 = pd.DataFrame(lst, columns=cols)
df1
Out[3]: 
   c1  c2  c3
0   1   2   3
1   1   2   3

OR, Create the dataframe with an index and then add to it

或者,使用索引创建数据框,然后添加到它

cols = ['c1', 'c2', 'c3']
df2 = pd.DataFrame(columns=cols, index=range(2))
for a in range(2):
    df2.loc[a].c1 = 4
    df2.loc[a].c2 = 5
    df2.loc[a].c3 = 6
df2
Out[4]: 
  c1 c2 c3
0  4  5  6
1  4  5  6

If you want to add to an existing dataframe, you could use either method above and then append the df's together (with or without the index):

如果要添加到现有数据帧,可以使用上述任一方法,然后将 df 附加在一起(带或不带索引):

df3 = df2.append(df1, ignore_index=True)
df3
Out[6]: 
  c1 c2 c3
0  4  5  6
1  4  5  6
2  1  2  3
3  1  2  3

Or, you can also create a list of dictionary entries and append those as in the answer above.

或者,您也可以创建一个字典条目列表,并按照上面的答案添加这些条目。

lst_dict = []
for a in range(2):
    lst_dict.append({'c1':2, 'c2':2, 'c3': 3})
df4 = df1.append(lst_dict)
df4
Out[7]: 
   c1  c2  c3
0   1   2   3
1   1   2   3
0   2   2   3
1   2   2   3

Using the dict(zip(cols, vals)))

使用 dict(zip(cols, vals)))

lst_dict = []
for a in range(2):
    vals = [7, 8, 9]
    lst_dict.append(dict(zip(cols, vals)))
df5 = df1.append(lst_dict)

回答by JKC

I have created a data frame in a for loop with the help of a temporary empty data frame. Because for every iteration of for loop, a new data frame will be created thereby overwriting the contents of previous iteration.

我在临时空数据框的帮助下在 for 循环中创建了一个数据框。因为对于 for 循环的每次迭代,都会创建一个新的数据框,从而覆盖前一次迭代的内容。

Hence I need to move the contents of the data frame to the empty data frame that was created already. It's as simple as that. We just need to use .append function as shown below :

因此,我需要将数据框的内容移动到已经创建的空数据框。就这么简单。我们只需要使用 .append 函数,如下所示:

temp_df = pd.DataFrame() #Temporary empty dataframe
for sent in Sentences:
    New_df = pd.DataFrame({'words': sent.words}) #Creates a new dataframe and contains tokenized words of input sentences
    temp_df = temp_df.append(New_df, ignore_index=True) #Moving the contents of newly created dataframe to the temporary dataframe

Outside the for loop, you can copy the contents of the temporary data frame into the master data frame and then delete the temporary data frame if you don't need it

在for循环之外,可以将临时数据框的内容复制到主数据框内,不需要时删除临时数据框

回答by Ayanava Sarkar

A more compact and efficient way would be perhaps:

一种更紧凑和有效的方法可能是:

cols = ['frame', 'count']
N = 4
dat = pd.DataFrame(columns = cols)
for i in range(N):

    dat = dat.append({'frame': str(i), 'count':i},ignore_index=True)

output would be:

输出将是:

>>> dat
   frame count
0     0     0
1     1     1
2     2     2
3     3     3