Python 从 Pandas 数据框中的多行中提取非 nan 值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16017034/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
To extract non-nan values from multiple rows in a pandas dataframe
提问by user2179627
I am working on several taxi datasets. I have used pandas to concat all the dataset into a single dataframe.
我正在研究几个出租车数据集。我使用 Pandas 将所有数据集连接到一个数据帧中。
My dataframe looks something like this.
我的数据框看起来像这样。
675 1039 #and rest 125 taxis
longitude latitude longitude latitude
date
2008-02-02 13:31:21 116.56359 40.06489 Nan Nan
2008-02-02 13:31:51 116.56486 40.06415 Nan Nan
2008-02-02 13:32:21 116.56855 40.06352 116.58243 39.6313
2008-02-02 13:32:51 116.57127 40.06324 Nan Nan
2008-02-02 13:33:21 116.57120 40.06328 116.55134 39.6313
2008-02-02 13:33:51 116.57121 40.06329 116.55126 39.6123
2008-02-02 13:34:21 Nan Nan 116.55134 39.5123
where 675,1039 are the taxi ids. Basically there are totally 127 taxis having their corresponding latitudes and longitudes columned up.
其中 675,1039 是出租车 ID。基本上一共有127辆出租车,它们对应的经纬度列起来。
I have several ways to extract not-null values for a row.
我有几种方法可以为一行提取非空值。
df.ix[k,df.columns[np.isnan(df.irow(0))!=1]]
(or)
df.irow(0)[np.isnan(df.irow(0))!=1]
(or)
df.irow(0)[np.where(df.irow(0)[df.columns].notnull())[0]]
any of the above commands will return,
任何上述命令都会返回,
675 longitude 116.56359
latitude 40.064890
4549 longitude 116.34642
latitude 39.96662
Name: 2008-02-02 13:31:21
now i want to extract all the notnull values from first few rows(say from row 1 to row 6).
现在我想从前几行(比如从第 1 行到第 6 行)中提取所有的 notnull 值。
how do i do that?
我怎么做?
i can probably loop it up. But i want a non-looped way of doing it.
我大概可以把它循环起来。但我想要一种非循环的方式来做到这一点。
Any help, suggestions are welcome. Thanks in adv! :)
欢迎任何帮助,建议。谢谢你!:)
回答by Dan Allan
df.ix[1:6].dropna(axis=1)
As a heads up, irowwill be deprecated in the next release of pandas. New methods, with clearer usage, replace it.
作为提醒,irow将在下一个熊猫版本中弃用。新方法,使用更清晰,取而代之。
http://pandas.pydata.org/pandas-docs/dev/indexing.html#deprecations
http://pandas.pydata.org/pandas-docs/dev/indexing.html#deprecations
回答by Jeff
In 0.11 (0.11rc1 is out now), this is very easy using .ilocto first select the first 6 rows, then dropna drops any row with a nan(you can also pass some options to dropna to control exactly which columns you want considered)
在 0.11 中(0.11rc1 现已发布),这很容易.iloc用于首先选择前 6 行,然后 dropna 删除任何带有 a 的行nan(您也可以将一些选项传递给 dropna 以精确控制您想要考虑的列)
I realized you want 1:6, I did 0:6 in my answer....
我意识到你想要 1:6,我在我的回答中做了 0:6....
In [8]: df = DataFrame(randn(10,3),columns=list('ABC'),index=date_range('20130101',periods=10))
In [9]: df.ix[6,'A'] = np.nan
In [10]: df.ix[6,'B'] = np.nan
In [11]: df.ix[2,'A'] = np.nan
In [12]: df.ix[4,'B'] = np.nan
In [13]: df.iloc[0:6]
Out[13]:
A B C
2013-01-01 0.442692 -0.109415 -0.038182
2013-01-02 1.217950 0.006681 -0.067752
2013-01-03 NaN -0.336814 -1.771431
2013-01-04 -0.655948 0.484234 1.313306
2013-01-05 0.096433 NaN 1.658917
2013-01-06 1.274731 1.909123 -0.289111
In [14]: df.iloc[0:6].dropna()
Out[14]:
A B C
2013-01-01 0.442692 -0.109415 -0.038182
2013-01-02 1.217950 0.006681 -0.067752
2013-01-04 -0.655948 0.484234 1.313306
2013-01-06 1.274731 1.909123 -0.289111
回答by karen
Using Jeff's dataframe:
使用杰夫的数据框:
import pandas as pd
from numpy.random import randn
df = pd.DataFrame(randn(10,3),columns=list('ABC'),index=pd.date_range('20130101',periods=10))
df.ix[6,'A'] = np.nan
df.ix[6,'B'] = np.nan
df.ix[2,'A'] = np.nan
df.ix[4,'B'] = np.nan
We can replace nans by some number we know is not in the dataframe:
我们可以用我们知道不在数据框中的某个数字替换 nans:
df = df.fillna(999)
If you want to keep only the non-null values without iterating you can do:
如果您只想保留非空值而不进行迭代,您可以执行以下操作:
df_nona = df.apply(lambda x: list(filter(lambda y: y != 999, x)))
df_na = df.apply(lambda x: list(filter(lambda y: y == 999, x)))
The problem of this approach is that the result are lists so you lose information about the index.
这种方法的问题在于结果是列表,因此您会丢失有关索引的信息。
df_nona
A [-1.9804955861, 0.146116306853, 0.359075672435...
B [-1.01963803293, -0.829747654648, 0.6950551455...
C [2.40122968044, 0.79395493777, 0.484201174184,...
dtype: object
Another option is:
另一种选择是:
df1 = df.dropna()
index_na = df.index ^ df1.index
df_na = df[index_na]
In this case you don't lose info about index, although this is really similar to previous answers.
在这种情况下,您不会丢失有关索引的信息,尽管这与之前的答案非常相似。
Hope it helps!
希望能帮助到你!

