pandas 将空列表列添加到 DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31466769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:38:01  来源:igfitidea点击:

Add column of empty lists to DataFrame

pythonpandas

提问by vk1011

Similar to this question How to add an empty column to a dataframe?, I am interested in knowing the best way to add a column of empty lists to a DataFrame.

类似于这个问题如何将空列添加到数据框?,我有兴趣了解将一列空列表添加到 DataFrame 的最佳方法。

What I am trying to do is basically initialize a column and as I iterate over the rows to process some of them, then add a filled list in this new column to replace the initialized value.

我想要做的基本上是初始化一个列,当我遍历行以处理其中的一些时,然后在这个新列中添加一个填充列表来替换初始化值。

For example, if below is my initial DataFrame:

例如,如果下面是我的初始 DataFrame:

df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame

>>> df
   a  b
0  1  5
1  2  6
2  3  7

Then I want to ultimately end up with something like this, where each row has been processed separately (sample results shown):

然后我想最终得到这样的结果,其中每一行都被单独处理(显示了示例结果):

>>> df
   a  b          c
0  1  5     [5, 6]
1  2  6     [9, 0]
2  3  7  [1, 2, 3]

Of course, if I try to initialize like df['e'] = []as I would with any other constant, it thinks I am trying to add a sequence of items with length 0, and hence fails.

当然,如果我尝试像df['e'] = []使用任何其他常量一样初始化,它认为我正在尝试添加长度为 0 的项目序列,因此失败。

If I try initializing a new column as Noneor NaN, I run in to the following issues when trying to assign a list to a location.

如果我尝试将新列初始化为NoneNaN,则在尝试将列表分配给某个位置时会遇到以下问题。

df['d'] = None

>>> df
   a  b     d
0  1  5  None
1  2  6  None
2  3  7  None

Issue 1 (it would be perfect if I can get this approach to work! Maybe something trivial I am missing):

问题 1(如果我能用这种方法就完美了!也许我遗漏了一些微不足道的东西):

>>> df.loc[0,'d'] = [1,3]

...
ValueError: Must have equal len keys and value when setting with an iterable

Issue 2 (this one works, but not without a warning because it is not guaranteed to work as intended):

问题 2(这个有效,但并非没有警告,因为不能保证按预期工作):

>>> df['d'][0] = [1,3]

C:\Python27\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

Hence I resort to initializing with empty lists and extending them as needed. There are a couple of methods I can think of to initialize this way, but is there a more straightforward way?

因此,我使用空列表进行初始化并根据需要扩展它们。我可以想到几种方法来以这种方式初始化,但是有没有更直接的方法?

Method 1:

方法一:

df['empty_lists1'] = [list() for x in range(len(df.index))]

>>> df
   a  b   empty_lists1
0  1  5             []
1  2  6             []
2  3  7             []

Method 2:

方法二:

 df['empty_lists2'] = df.apply(lambda x: [], axis=1)

>>> df
   a  b   empty_lists1   empty_lists2
0  1  5             []             []
1  2  6             []             []
2  3  7             []             []

Summary of questions:

问题总结:

Is there any minor syntax change that can be addressed in Issue 1 that can allow a list to be assigned to a None/NaNinitialized field?

是否有任何可以在问题 1 中解决的小的语法更改可以允许将列表分配给None/NaN初始化字段?

If not, then what is the best way to initialize a new column with empty lists?

如果不是,那么用空列表初始化新列的最佳方法是什么?

回答by ComputerFellow

One more way is to use np.empty:

另一种方法是使用np.empty

df['empty_list'] = np.empty((len(df), 0)).tolist()


You could also knock off .indexin your "Method 1" when trying to find lenof df.

你也可以收工.index试图找到当你的“方法1”lendf

df['empty_list'] = [[] for _ in range(len(df))]


Turns out, np.emptyis faster...

事实证明,np.empty速度更快......

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))

In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist()
10 loops, best of 3: 127 ms per loop

In [4]: timeit df['empty2'] = [[] for _ in range(len(df))]
10 loops, best of 3: 193 ms per loop

In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1)
1 loops, best of 3: 5.89 s per loop

回答by tozCSS

I timed all the three methods in the accepted answer, the fastest one took 216 ms on my machine. However, this took only 28 ms:

我对接受的答案中的所有三种方法进行了计时,最快的一种在我的机器上花费了 216 毫秒。但是,这仅用了 28 毫秒:

df['empty4'] = [[]] * len(df)

df['empty4'] = [[]] * len(df)

Note: Similarly, df['e5'] = [set()] * len(df)also took 28ms.

注:同理,df['e5'] = [set()] * len(df)也用了28ms。