Python 选择/排除熊猫中的列集

Question

提问by Amelio Vazquez-Reina

I would like to create views or dataframes from an existing dataframe based on column selections.

我想根据列选择从现有数据框创建视图或数据框。

For example, I would like to create a dataframe df2from a dataframe df1that holds all columns from it except two of them. I tried doing the following, but it didn't work:

例如，我想df2从一个数据框创建一个数据框，该数据框df1包含除其中两个之外的所有列。我尝试执行以下操作，但没有奏效：

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')

# This returns an error ("unhashable type: set")
df2 = df[my_cols]

What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusionsof arbitrary sets of columns from a dataframe?

我究竟做错了什么？也许更一般地说，pandas 有什么机制来支持从数据框中挑选和排除任意一组列？

Answer 1

采纳答案by Amrita Sawant

You can either Drop the columns you do not need OR Select the ones you need

您可以删除不需要的列或选择您需要的列

# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)

# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)

# Select the ones you want
df1 = df[['a','d']]

Answer 2

回答by tacaswell

You just need to convert your setto a list

你只需要将你的转换set为list

import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]

Answer 3

回答by piggybox

You don't really need to convert that into a set:

你真的不需要把它转换成一个集合：

cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]

Answer 4

回答by LondonRob

Here's how to create a copyof a DataFrameexcluding a list of columns:

以下是创建不包括列列表的副本的方法DataFrame：

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)

But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2to change too. (Like a view would in a database.)

不过要小心！你在你的问题中提到了观点，暗示如果你改变了df，你也想df2改变。（就像数据库中的视图一样。）

This method doesn't achieve that:

此方法无法实现：

>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
     A         B         C         D
0  999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
          A         C
0  0.251262 -1.980673

Note also that this is also true of @piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)

另请注意，@piggybox 的方法也是如此。（虽然这种方法很好，很流畅，而且很有 Python 风格。我不会这样做的！！）

For more on views vs. copies see this SO answerand this part of the Pandas docswhich that answer refers to.

有关视图与副本的更多信息，请参阅此 SO 答案以及该答案所指的 Pandas 文档的这一部分。

Answer 5

回答by Frank

Also have a look into the built-in DataFrame.filterfunction.

还可以查看内置DataFrame.filter函数。

Minimalistic but greedy approach (sufficient for the given df):

简约但贪婪的方法（足以满足给定的 df）：

df.filter(regex="[^BD]")

Conservative/lazy approach (exact matches only):

保守/懒惰方法（仅完全匹配）：

df.filter(regex="^(?!(B|D)$).*$")

Conservative and generic:

保守和通用：

exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))

Answer 6

回答by IanS

There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.

有一种新的索引方法称为difference. 它返回原始列，并删除作为参数传递的列。

Here, the result is used to remove columns Band Dfrom df:

在这里，结果用于删除列B和Dfrom df：

df2 = df[df.columns.difference(['B', 'D'])]

Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.

请注意，它是基于集合的方法，因此列名重复会导致问题，并且列顺序可能会更改。

Advantageover drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:

优势在drop：当你只需要列的列表，你不创建整个数据帧的副本。例如，为了删除列子集上的重复项：

# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns

# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])

df = df.drop_duplicates(subset=subset)

Answer 7

回答by pylang

In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.

同样，在读取文件时，人们可能希望预先排除列，而不是浪费地将不需要的数据读入内存，然后再将其丢弃。

As of pandas 0.20.0, usecolsnow accepts callables.¹This update allows more flexible options for reading columns:

从 pandas 0.20.0 开始，usecols现在接受 callables。¹此更新允许更灵活的阅读列选项：

skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)

The latter pattern is essentially the inverse of the traditional usecolsmethod - only specified columns are skipped.

后一种模式本质上与传统usecols方法相反——仅跳过指定的列。

Given

给定的

Data in a file

文件中的数据

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

filename = "foo.csv"
df.to_csv(filename)

Code

代码

skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1

Output

输出

          A         C
0  0.062350  0.076924
1 -0.016872  1.091446
2  0.213050  1.646109
3 -1.196928  1.153497
4 -0.628839 -0.856529
...

Details

细节

A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (Band D).

数据帧已写入文件。然后它作为一个单独的 DataFrame 被读回，现在跳过不需要的列（B和D）。

Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.

请注意，对于 OP 的情况，由于数据已经创建，更好的方法是接受的答案，它从现有对象中删除不需要的列。然而，当直接从文件中读取数据到 DataFrame 中时，这里介绍的技术最有用。

^{_{A request was raised for a "skipcols" option in this issueand was addressed in a later issue.}}

^{_{在这个问题中提出了一个“skipcols”选项的请求，并在以后的问题中得到解决。}}

Answer 8

回答by Kapil Marwaha

You have 4 columns A,B,C,D

您有 4 列 A、B、C、D

Here is a better way to select the columns you need for the new dataframe:-

这是选择新数据框所需列的更好方法：-

df2 = df1[['A','D']]

if you wish to use column numbers instead, use:-

如果您想改用列号，请使用：-

df2 = df1[[0,3]]

Answer 9

回答by MrE

Another option, without dropping or filtering in a loop:

另一种选择，无需在循环中删除或过滤：

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]

# or more simply include columns:
df[['A', 'B']]

# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]

Python 选择/排除熊猫中的列集

提问by Amelio Vazquez-Reina

采纳答案by Amrita Sawant

回答by tacaswell

回答by piggybox

回答by LondonRob

回答by Frank

回答by IanS

回答by pylang

回答by Kapil Marwaha

回答by MrE

相关推荐

最近更新

标签

Python 选择/排除熊猫中的列集

提问by Amelio Vazquez-Reina

采纳答案by Amrita Sawant

回答by tacaswell

回答by piggybox

回答by LondonRob

回答by Frank

回答by IanS

回答by pylang

回答by Kapil Marwaha

回答by MrE

相关推荐

Python 如何在 NumPy 中堆叠不同长度的向量？

Python Matplotlib：“未知投影‘3d’”错误

Python中嵌套列表的总和

Python Flask/Werkzeug 如何将 HTTP 内容长度标头附加到文件下载

相关推荐

最近更新

标签