Pandas 列列表中每行的第一个非空值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31828240/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
First non-null value per row from a list of Pandas columns
提问by Dave Challis
If I've got a DataFrame in pandas which looks something like:
如果我在 Pandas 中有一个 DataFrame,它看起来像:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None](or equivalent Series).
如何从每一行获取第一个非空值?例如,对于上述内容,我想得到:([1, 3, 4, None]或等效的系列)。
采纳答案by EdChum
This is a really messy way to do this, first use first_valid_indexto get the valid columns, convert the returned series to a dataframe so we can call applyrow-wise and use this to index back to original df:
这是一种非常混乱的方法,首先用于first_valid_index获取有效列,将返回的系列转换为数据帧,以便我们可以按apply行调用并使用它来索引回原始 df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
?
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
编辑
A slightly cleaner way:
一个稍微干净的方法:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
回答by Andy Jones
回答by JoeCondron
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmingives the index of the first Falsevalue in each row of the result of np.isnanin a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
我将在这里权衡一下,因为我认为这比任何提议的方法都要快得多。以向量化的方式给出结果的每一行中argmin第一个False值的索引np.isnan,这是最难的部分。它仍然依赖 Python 循环来提取值,但查找速度非常快:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
编辑:这是一个完全矢量化的解决方案,根据输入的形状,它可以再次快得多。更新了下面的基准测试。
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also. Here's some benchmarking against unutbu's solution:
如果一行完全为空,那么相应的值也将为空。以下是针对 unutbu 解决方案的一些基准测试:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
回答by unutbu
Here is another way to do it:
这是另一种方法:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stackto move the columns into a row index level:
这里的想法是用来stack将列移动到行索引级别:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
现在,如果您按第一行级别(即原始索引)进行分组并从每个组中获取第一个值,您基本上会得到所需的结果:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to include rows that are completely NaN:
我们需要做的就是重新索引结果(使用原始索引)以包含完全 NaN 的行:
df.stack().groupby(level=0).first().reindex(df.index)
回答by LondonRob
This is nothing new, but it's a combination of the best bits of @yangie's approachwith a list comprehension, and @EdChum's df.applyapproachthat I think is easiest to understand.
这并不是什么新鲜事,但它结合了@yangie 方法的最佳部分与列表理解,以及我认为最容易理解的@EdChumdf.apply方法。
First, which columns to we want to pick our values from?
首先,我们想从哪些列中选择我们的值?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
现在我们如何选择值?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
这没问题,但我们真的希望索引与原始索引匹配DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
回答by yangjie
Here is a one line solution:
这是一个单行解决方案:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
编辑:
This solution iterates over rows of df. row.first_valid_index()returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
此解决方案迭代df. row.first_valid_index()返回第一个非 NA/空值的标签,它将用作索引以获取每行中的第一个非空项目。
If there is no non-null value in the row, row.first_valid_index()would be None, thus cannot be used as index, so I need a if-elsestatement.
如果行中没有非空值,则为row.first_valid_index()None,因此不能用作索引,所以我需要一个if-else语句。
I packed everything into a list comprehension for brevity.
为简洁起见,我将所有内容都打包到列表理解中。
回答by Pietro Battiston
JoeCondron's answer(EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
JoeCondron 的回答(编辑:在他最后一次编辑之前!)很酷,但通过避免非矢量化枚举有显着改进的余地:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
如果 DataFrame 相对平坦,则改进很小:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
...但可能与纤薄的 DataFrame 相关:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 μs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
与 JoeCondron 的矢量化版本相比,运行时非常相似(对于纤细的 DataFrames 仍然稍微快一点,对于大的 DataFrames 稍微慢一点)。
回答by piRSquared
groupbyin axis=1
groupby在 axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.aggwhich gives us the firstmethod that makes this easy
如果我们传递一个返回相同值的可调用对象,我们会将所有列组合在一起。这允许我们使用groupby.aggwhich 为我们提供了first使这变得容易的方法
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
这将返回一个数据框,其中包含我在可调用对象中返回的内容的列名
lookup, notna, and idxmax
lookup, notna, 和idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argminand slicing
argmin和切片
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
回答by bhamu
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]

