Python 如何在数据框的数组列中选择一个元素?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26069235/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:00:41  来源:igfitidea点击:

How do I select an element in array column of a data frame?

pythonarraysnumpypandas

提问by jankos

I have the following data frame:

我有以下数据框:

pa=pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})

I want to select the column 'a' and then only a particular element (i.e. first: 1., 2., 3.)

我想选择列“a”,然后只选择一个特定元素(即第一个:1., 2., 3.)

What do I need to add to:

我需要添加什么:

pa.loc[:,['a']]

?

?

采纳答案by b10n

pa.loc[row]selects the row with label row.

pa.loc[row]选择带有标签的行row

pa.loc[row, col]selects the cells which are the instersection of rowand col

pa.loc[row, col]选择作为row和的交叉点的单元格col

pa.loc[:, col]selects allrows and the column named col. Note that although this works it is not the idiomatic way to refer to a column of a dataframe. For that you should use pa['a']

pa.loc[:, col]选择所有行和名为 的列col。请注意,虽然这有效,但它不是引用数据帧列的惯用方式。为此,您应该使用pa['a']

Now you have lists in the cells of your column so you can use the vectorized string methodsto access the elements of those lists like so.

现在您在列的单元格中有列表,因此您可以使用矢量化字符串方法来访问这些列表的元素,就像这样。

pa['a'].str[0] #first value in lists
pa['a'].str[-1] #last value in lists

回答by unutbu

Storing lists as values in a Pandas DataFrame tends to be a mistake because it prevents you from taking advantage of fast NumPy or Pandas vectorized operations.

将列表作为值存储在 Pandas DataFrame 中往往是一个错误,因为它会阻止您利用快速的 NumPy 或 Pandas 向量化操作。

Therefore, you might be better off converting your DataFrame of lists of numbers into a wider DataFrame with native NumPy dtypes:

因此,您最好将数字列表的 DataFrame 转换为具有本机 NumPy dtypes 的更宽的 DataFrame:

import numpy as np
import pandas as pd

pa = pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
df = pd.DataFrame(pa['a'].values.tolist())
#      0    1    2
# 0  1.0  4.0  NaN
# 1  2.0  NaN  NaN
# 2  3.0  4.0  5.0

Now, you could select the first column like this:

现在,您可以像这样选择第一列:

In [36]: df.iloc[:, 0]
Out[36]: 
0    1.0
1    2.0
2    3.0
Name: 0, dtype: float64

or the first row like this:

或像这样的第一行:

In [37]: df.iloc[0, :]
Out[37]: 
0    1.0
1    4.0
2    NaN
Name: 0, dtype: float64

If you wish to drop NaNs, use .dropna():

如果您想删除 NaN,请使用.dropna()

In [38]: df.iloc[0, :].dropna()
Out[38]: 
0    1.0
1    4.0
Name: 0, dtype: float64

and .tolist()to retrieve the values as a list:

.tolist()以列表形式检索值:

In [39]: df.iloc[0, :].dropna().tolist()
Out[39]: [1.0, 4.0]

but if you wish to leverage NumPy/Pandas for speed, you'll want to express your calculation as vectorized operations on dfitself without converting back to Python lists.

但是,如果您希望利用 NumPy/Pandas 来提高速度,您需要将您的计算表达为对df自身进行矢量化操作,而无需转换回 Python 列表。