Python 将某些列保留在 Pandas DataFrame 中,删除其他所有列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16616141/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:10:06  来源:igfitidea点击:

Keep certain columns in a pandas DataFrame, deleting everything else

pythonpandas

提问by Matt

Say I have a data table

说我有一个数据表

    1  2  3  4  5  6 ..  n
A   x  x  x  x  x  x ..  x
B   x  x  x  x  x  x ..  x
C   x  x  x  x  x  x ..  x

And I want to slim it down so that I only have, say, columns 3 and 5 deleting all other and maintaining the structure. How could I do this with pandas? I think I understand how to delete a single column, but I don't know how to save a select few and delete all others.

我想把它缩小,这样我就只有第 3 和第 5 列删除所有其他内容并保持结构。我怎么能用熊猫做到这一点?我想我知道如何删除单个列,但我不知道如何保存选定的几列并删除所有其他列。

采纳答案by Andy Hayden

If you have a list of columns you can just select those:

如果你有一个列列表,你可以选择那些:

In [11]: df
Out[11]:
   1  2  3  4  5  6
A  x  x  x  x  x  x
B  x  x  x  x  x  x
C  x  x  x  x  x  x

In [12]: col_list = [3, 5]

In [13]: df = df[col_list]

In [14]: df
Out[14]:
   3  5
A  x  x
B  x  x
C  x  x

回答by unutbu

You could reassign a new value to your DataFrame, df:

您可以为您的DataFrame,重新分配一个新值df

df = df.loc[:,[3, 5]]

As long as there are no other references to the original DataFrame, the old DataFramewill get garbage collected.

只要没有其他对原始的引用DataFrame,旧的DataFrame就会被垃圾收集。

Note that when using df.loc, the index is specified by labels. Thus above 3and 5are not ordinals, they represent the label names of the columns. If you wish to specify the columns by ordinal index, use df.iloc.

请注意,使用时df.loc,索引由标签指定。因此上面35不是序数,它们代表列的标签名称。如果您希望按序数索引指定列,请使用df.iloc.

回答by Fjolnir Dvorak

For those who are searching an method to do this inplace:

对于那些正在寻找就地执行此操作的方法的人:

from pandas import DataFrame
from typing import Set, Any
def remove_others(df: DataFrame, columns: Set[Any]):
    cols_total: Set[Any] = set(df.columns)
    diff: Set[Any] = cols_total - columns
    df.drop(diff, axis=1, inplace=True)

This will create the complement of all the columns in the dataframe and the columns which should be removed. Those can safely be removed. Drop works even on an empty set.

这将创建数据框中所有列和应删除的列的补充。那些可以安全地移除。Drop 甚至在空集上也能工作。

>>> df = DataFrame({"a":[1,2,3],"b":[2,3,4],"c":[3,4,5]})
>>> df
   a  b  c
0  1  2  3
1  2  3  4
2  3  4  5

>>> remove_others(df, {"a","b","c"})
>>> df
   a  b  c
0  1  2  3
1  2  3  4
2  3  4  5

>>> remove_others(df, {"a"})
>>> df
   a
0  1
1  2
2  3

>>> remove_others(df, {"a","not","existent"})
>>> df
   a
0  1
1  2
2  3

回答by cs95

How do I keep certain columns in a pandas DataFrame, deleting everything else?

如何在 Pandas DataFrame 中保留某些列,删除其他所有列?

The answer to this question is the same as the answer to "How do I delete certain columns in a pandas DataFrame?" Here are some additional options to those mentioned so far, along with timings.

这个问题的答案与“如何删除 Pandas DataFrame 中的某些列?”的答案相同。以下是迄今为止提到的一些附加选项,以及时间安排。

DataFrame.loc

DataFrame.loc

One simple option is selection, as mentioned by in other answers,

一个简单的选择是选择,正如其他答案中提到的,

# Setup.
df
   1  2  3  4  5  6
A  x  x  x  x  x  x
B  x  x  x  x  x  x
C  x  x  x  x  x  x

cols_to_keep = [3,5]

df[cols_to_keep]

   3  5
A  x  x
B  x  x
C  x  x

Or,

或者,

df.loc[:, cols_to_keep]

   3  5
A  x  x
B  x  x
C  x  x


DataFrame.reindexwith axis=1or 'columns'(0.21+)

DataFrame.reindexaxis=1'columns'(0.21+)

However, we also have reindex, in recent versions you specify axis=1to drop:

但是,我们也有reindex,在您指定axis=1要删除的最新版本中:

df.reindex(cols_to_keep, axis=1)
# df.reindex(cols_to_keep, axis='columns')

# for versions < 0.21, use
# df.reindex(columns=cols_to_keep)

   3  5
A  x  x
B  x  x
C  x  x

On older versions, you can also use reindex_axis: df.reindex_axis(cols_to_keep, axis=1).

在旧版本上,您还可以使用reindex_axis: df.reindex_axis(cols_to_keep, axis=1)



DataFrame.drop

DataFrame.drop

Another alternative is to use dropto select columns by pd.Index.difference:

另一种选择是使用以下drop方式选择列pd.Index.difference

# df.drop(cols_to_drop, axis=1)
df.drop(df.columns.difference(cols_to_keep), axis=1)

   3  5
A  x  x
B  x  x
C  x  x


Performance

表现

enter image description here

在此处输入图片说明

The methods are roughly the same in terms of performance; reindexis faster for smaller N, while dropis faster for larger N. The performance is relative as the Y-axis is logarithmic.

方法在性能上大致相同;reindex较小的 N 更快,而drop较大的 N 更快。性能是相对的,因为 Y 轴是对数。

Setup and Code

设置和代码

import pandas as pd
import perfplot

def make_sample(n):
    np.random.seed(0)
    df = pd.DataFrame(np.full((n, n), 'x'))
    cols_to_keep = np.random.choice(df.columns, max(2, n // 4), replace=False)

    return df, cols_to_keep 

perfplot.show(
    setup=lambda n: make_sample(n),
    kernels=[
        lambda inp: inp[0][inp[1]],
        lambda inp: inp[0].loc[:, inp[1]],
        lambda inp: inp[0].reindex(inp[1], axis=1),
        lambda inp: inp[0].drop(inp[0].columns.difference(inp[1]), axis=1)
    ],
    labels=['__getitem__', 'loc', 'reindex', 'drop'],
    n_range=[2**k for k in range(2, 13)],
    xlabel='N',   
    logy=True,
    equality_check=lambda x, y: (x.reindex_like(y) == y).values.all()
)