Pandas 列列表中每行的第一个非空值

Question

提问by Dave Challis

If I've got a DataFrame in pandas which looks something like:

如果我在 Pandas 中有一个 DataFrame，它看起来像：

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None](or equivalent Series).

如何从每一行获取第一个非空值？例如，对于上述内容，我想得到：（[1, 3, 4, None]或等效的系列）。

Answer 1

采纳答案by EdChum

This is a really messy way to do this, first use first_valid_indexto get the valid columns, convert the returned series to a dataframe so we can call applyrow-wise and use this to index back to original df:

这是一种非常混乱的方法，首先用于first_valid_index获取有效列，将返回的系列转换为数据帧，以便我们可以按apply行调用并使用它来索引回原始 df：

In [160]:
def func(x):
    if x.values[0] is None:
        return None
    else:
        return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
?
Out[160]:
0     1
1     3
2     4
3   NaN
dtype: float64

EDIT

编辑

A slightly cleaner way:

一个稍微干净的方法：

In [12]:
def func(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0     1
1     3
2     4
3   NaN
dtype: float64

Answer 2

回答by Andy Jones

Fill the nans from the left with fillna, then get the leftmost column:

用填充左边的 nan fillna，然后获取最左边的列：

df.fillna(method='bfill', axis=1).iloc[:, 0]

Answer 3

回答by JoeCondron

I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmingives the index of the first Falsevalue in each row of the result of np.isnanin a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:

我将在这里权衡一下，因为我认为这比任何提议的方法都要快得多。以向量化的方式给出结果的每一行中argmin第一个False值的索引np.isnan，这是最难的部分。它仍然依赖 Python 循环来提取值，但查找速度非常快：

def get_first_non_null(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return [a[row, col] for row, col in enumerate(col_index)]

EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.

编辑：这是一个完全矢量化的解决方案，根据输入的形状，它可以再次快得多。更新了下面的基准测试。

def get_first_non_null_vec(df):
    a = df.values
    n_rows, n_cols = a.shape
    col_index = np.isnan(a).argmin(axis=1)
    flat_index = n_cols * np.arange(n_rows) + col_index
    return a.ravel()[flat_index]

If a row is completely null then the corresponding value will be null also. Here's some benchmarking against unutbu's solution:

如果一行完全为空，那么相应的值也将为空。以下是针对 unutbu 解决方案的一些基准测试：

df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:


df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop


df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop

Answer 4

回答by unutbu

Here is another way to do it:

这是另一种方法：

In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]: 
0     1
1     3
2     4
3   NaN
dtype: float64

The idea here is to use stackto move the columns into a row index level:

这里的想法是用来stack将列移动到行索引级别：

In [184]: df.stack()
Out[184]: 
0  A    1
   C    2
1  B    3
2  B    4
   C    5
dtype: float64

Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:

现在，如果您按第一行级别（即原始索引）进行分组并从每个组中获取第一个值，您基本上会得到所需的结果：

In [185]: df.stack().groupby(level=0).first()
Out[185]: 
0    1
1    3
2    4
dtype: float64

All we need to do is reindex the result (using the original index) so as to include rows that are completely NaN:

我们需要做的就是重新索引结果（使用原始索引）以包含完全 NaN 的行：

df.stack().groupby(level=0).first().reindex(df.index)

Answer 5

回答by LondonRob

This is nothing new, but it's a combination of the best bits of @yangie's approachwith a list comprehension, and @EdChum's df.applyapproachthat I think is easiest to understand.

这并不是什么新鲜事，但它结合了@yangie 方法的最佳部分与列表理解，以及我认为最容易理解的@EdChumdf.apply方法。

First, which columns to we want to pick our values from?

首先，我们想从哪些列中选择我们的值？

In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)

In [96]: pick_cols
Out[96]: 
0       A
1       B
2       B
3    None
dtype: object

Now how do we pick the values?

现在我们如何选择值？

In [100]: [df.loc[k, v] if v is not None else None 
    ....:     for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]

This is ok, but we really want the index to match that of the original DataFrame:

这没问题，但我们真的希望索引与原始索引匹配DataFrame：

In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
   ....:     for k, v in pick_cols.iteritems()})
Out[98]: 
0     1
1     3
2     4
3   NaN
dtype: float64

Answer 6

回答by yangjie

Here is a one line solution:

这是一个单行解决方案：

[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]

Edit:

编辑：

This solution iterates over rows of df. row.first_valid_index()returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.

此解决方案迭代df. row.first_valid_index()返回第一个非 NA/空值的标签，它将用作索引以获取每行中的第一个非空项目。

If there is no non-null value in the row, row.first_valid_index()would be None, thus cannot be used as index, so I need a if-elsestatement.

如果行中没有非空值，则为row.first_valid_index()None，因此不能用作索引，所以我需要一个if-else语句。

I packed everything into a list comprehension for brevity.

为简洁起见，我将所有内容都打包到列表理解中。

Answer 7

回答by Pietro Battiston

JoeCondron's answer(EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:

JoeCondron 的回答（编辑：在他最后一次编辑之前！）很酷，但通过避免非矢量化枚举有显着改进的余地：

def get_first_non_null_vect(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return a[np.arange(a.shape[0]), col_index]

The improvement is small if the DataFrame is relatively flat:

如果 DataFrame 相对平坦，则改进很小：

In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))

In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop

In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop

... but can be relevant on slim DataFrames:

...但可能与纤薄的 DataFrame 相关：

In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))

In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop

In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 μs per loop

Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).

与 JoeCondron 的矢量化版本相比，运行时非常相似（对于纤细的 DataFrames 仍然稍微快一点，对于大的 DataFrames 稍微慢一点）。

Answer 8

回答by piRSquared

`groupby`in `axis=1`

`groupby`在 `axis=1`

If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.aggwhich gives us the firstmethod that makes this easy

如果我们传递一个返回相同值的可调用对象，我们会将所有列组合在一起。这允许我们使用groupby.aggwhich 为我们提供了first使这变得容易的方法

df.groupby(lambda x: 'Z', 1).first()

     Z
0  1.0
1  3.0
2  4.0
3  NaN

This returns a dataframe with the column name of the thing I was returning in my callable

这将返回一个数据框，其中包含我在可调用对象中返回的内容的列名

`lookup`, `notna`, and `idxmax`

`lookup`, `notna`, 和`idxmax`

df.lookup(df.index, df.notna().idxmax(1))

array([ 1.,  3.,  4., nan])

`argmin`and slicing

`argmin`和切片

v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]

array([ 1.,  3.,  4., nan])

Answer 9

回答by bhamu

df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})

df
     A    B    C
0  1.0  NaN  2.0
1  NaN  3.0  NaN
2  NaN  4.0  5.0
3  NaN  NaN  NaN

df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]

Pandas 列列表中每行的第一个非空值

提问by Dave Challis

采纳答案by EdChum

回答by Andy Jones

回答by JoeCondron

回答by unutbu

回答by LondonRob

回答by yangjie

回答by Pietro Battiston

回答by piRSquared

`groupby`in `axis=1`

`groupby`在 `axis=1`

`lookup`, `notna`, and `idxmax`

`lookup`, `notna`, 和`idxmax`

`argmin`and slicing

`argmin`和切片

回答by bhamu

相关推荐

最近更新

标签

Pandas 列列表中每行的第一个非空值

提问by Dave Challis

采纳答案by EdChum

回答by Andy Jones

回答by JoeCondron

回答by unutbu

回答by LondonRob

回答by yangjie

回答by Pietro Battiston

回答by piRSquared

groupbyin axis=1

groupby在 axis=1

lookup, notna, and idxmax

lookup, notna, 和idxmax

argminand slicing

argmin和切片

回答by bhamu

相关推荐

Pandas Groupy 只取前 N 个组

在 Pandas 中有效地创建稀疏数据透视表？

pandas 给定均值和西格玛绘制正态分布-python

Python pandas.read_csv 使用逗号将列拆分为多个新列以分隔

相关推荐

最近更新

标签

`groupby`in `axis=1`

`groupby`在 `axis=1`

`lookup`, `notna`, and `idxmax`

`lookup`, `notna`, 和`idxmax`

`argmin`and slicing

`argmin`和切片