Python 如何将列和行的 Pandas DataFrame 子集转换为 numpy 数组？

Question

提问by John Prior

I'm wondering if there is a simpler, memory efficient way to select a subset of rows and columns from a pandas DataFrame.

我想知道是否有一种更简单、内存高效的方法来从 Pandas DataFrame 中选择行和列的子集。

For instance, given this dataframe:

例如，给定这个数据框：

df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
print df

          a         b         c         d         e
0  0.945686  0.000710  0.909158  0.892892  0.326670
1  0.919359  0.667057  0.462478  0.008204  0.473096
2  0.976163  0.621712  0.208423  0.980471  0.048334
3  0.459039  0.788318  0.309892  0.100539  0.753992

I want only those rows in which the value for column 'c' is greater than 0.5, but I only need columns 'b' and 'e' for those rows.

我只需要列 'c' 的值大于 0.5 的那些行，但我只需要这些行的列 'b' 和 'e'。

This is the method that I've come up with - perhaps there is a better "pandas" way?

这是我想出的方法 - 也许有更好的“熊猫”方法？

locs = [df.columns.get_loc(_) for _ in ['a', 'd']]
print df[df.c > 0.5][locs]

          a         d
0  0.945686  0.892892

My final goal is to convert the result to a numpy array to pass into an sklearn regression algorithm, so I will use the code above like this:

我的最终目标是将结果转换为 numpy 数组以传递给 sklearn 回归算法，因此我将像这样使用上面的代码：

training_set = array(df[df.c > 0.5][locs])

... and that peeves me since I end up with a huge array copy in memory. Perhaps there's a better way for that too?

...这让我很恼火，因为我最终在内存中得到了一个巨大的数组副本。也许还有更好的方法？

Answer 1

采纳答案by Jeff

.locaccept row and column selectors simultaneously (as do .ix/.ilocFYI) This is done in a single pass as well.

.loc同时接受行和列选择器（.ix/.iloc仅供参考）这也是一次性完成的。

In [1]: df = DataFrame(np.random.rand(4,5), columns = list('abcde'))

In [2]: df
Out[2]: 
          a         b         c         d         e
0  0.669701  0.780497  0.955690  0.451573  0.232194
1  0.952762  0.585579  0.890801  0.643251  0.556220
2  0.900713  0.790938  0.952628  0.505775  0.582365
3  0.994205  0.330560  0.286694  0.125061  0.575153

In [5]: df.loc[df['c']>0.5,['a','d']]
Out[5]: 
          a         d
0  0.669701  0.451573
1  0.952762  0.643251
2  0.900713  0.505775

And if you want the values (though this should pass directly to sklearn as is); frames support the array interface

如果你想要这些值（尽管这应该直接传递给 sklearn）；框架支持阵列接口

In [6]: df.loc[df['c']>0.5,['a','d']].values
Out[6]: 
array([[ 0.66970138,  0.45157274],
       [ 0.95276167,  0.64325143],
       [ 0.90071271,  0.50577509]])

Answer 2

回答by waitingkuo

Use its value directly:

直接使用它的值：

In [79]: df[df.c > 0.5][['b', 'e']].values
Out[79]: 
array([[ 0.98836259,  0.82403141],
       [ 0.337358  ,  0.02054435],
       [ 0.29271728,  0.37813099],
       [ 0.70033513,  0.69919695]])

Answer 3

回答by Daniel

Perhaps something like this for the first problem, you can simply access the columns by their names:

对于第一个问题，也许像这样，您可以简单地按列名访问列：

>>> df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df[df['c']>.5][['b','e']]
          b         e
1  0.071146  0.132145
2  0.495152  0.420219

For the second problem:

对于第二个问题：

>>> df[df['c']>.5][['b','e']].values
array([[ 0.07114556,  0.13214495],
       [ 0.49515157,  0.42021946]])

Python 如何将列和行的 Pandas DataFrame 子集转换为 numpy 数组？

提问by John Prior

采纳答案by Jeff

回答by waitingkuo

回答by Daniel

相关推荐

最近更新

标签

Python 如何将列和行的 Pandas DataFrame 子集转换为 numpy 数组？

提问by John Prior

采纳答案by Jeff

回答by waitingkuo

回答by Daniel

相关推荐

Python Socket接收大量数据

Python 在 Django 中使用哪个模型字段来存储经度和纬度值？

Python 从给定的字符串中删除 \n 或 \t

Python 类型错误：没有字符串参数的编码或错误

相关推荐

最近更新

标签