pandas 如何在pandas DataFrame中选择和删除具有重复名称的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20613396/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:27:17  来源:igfitidea点击:

How to select and delete columns with duplicate name in pandas DataFrame

pythonpandasdataframeduplicatesmultiple-columns

提问by user3107640

I have a huge DataFrame, where some columns have the same names. When I try to pick a column that exists twice, (eg del df['col name']or df2=df['col name']) I get an error. What can I do?

我有一个巨大的DataFrame,其中一些列具有相同的名称。当我尝试选择一个存在两次的列时(例如 deldf['col name']df2=df['col name']),我收到错误消息。我能做什么?

回答by Roman Pekar

You can adress columns by index:

您可以按索引对列进行寻址:

>>> df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','a'])
>>> df
   a  a
0  1  2
1  3  4
2  5  6
>>> df.iloc[:,0]
0    1
1    3
2    5

Or you can rename columns, like

或者您可以重命名列,例如

>>> df.columns = ['a','b']
>>> df
   a  b
0  1  2
1  3  4
2  5  6

回答by ely

This is not a good situation to be in. Best would be to create a hierarchical column labeling scheme (Pandas allows for multi-level column labeling or row index labels). Determine what it is that makes the two different columns that have the same name actually different from each other and leverage that to create a hierarchical column index.

这不是一个好的情况。最好是创建一个分层的列标签方案(Pandas 允许多级列标签或行索引标签)。确定是什么使具有相同名称的两个不同列实际上彼此不同,并利用它来创建分层列索引。

In the mean time, if you know the positional location of the columns in the ordered list of columns (e.g. from dataframe.columns) then you can use many of the explicit indexing features, such as .ix[], or .iloc[]to retrieve values from the column positionally.

同时,如果您知道列的有序列表(例如 from dataframe.columns)中列的位置,那么您可以使用许多显式索引功能,例如.ix[], 或.iloc[]从位置检索列中的值。

You can also create copies of the columns with new names, such as:

您还可以使用新名称创建列的副本,例如:

dataframe["new_name"] = data_frame.ix[:, column_position].values

where column_positionreferences the positionallocation of the column you're trying to get (not the name).

wherecolumn_position引用您要获取的列的位置(不是名称)。

These may not work for you if the data is too large, however. So best is to find a way to modify the construction process to get the hierarchical column index.

但是,如果数据太大,这些可能对您不起作用。所以最好是想办法修改构建过程,得到层次化的列索引。

回答by leitungswasser

Another solution:

另一种解决方案:

def remove_dup_columns(frame):
     keep_names = set()
     keep_icols = list()
     for icol, name in enumerate(frame.columns):
          if name not in keep_names:
               keep_names.add(name)
               keep_icols.append(icol)
     return frame.iloc[:, keep_icols]

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.random.randint(0, 50, (5, 4)), columns=['A', 'A', 'B', 'B'])

print(frame)
print(remove_dup_columns(frame))

The output is

输出是

    A   A   B   B
0  18  44  13  47
1  41  19  35  28
2  49   0  30  16
3  39  29  43  41
4  26  19  48  13
    A   B
0  18  13
1  41  35
2  49  30
3  39  43
4  26  48

回答by horseshoe

The following function removes columns with dublicate names and keeps only one. Not exactly what you asked for, but you can use snips of it to solve your problem. The idea is to return the index numbers and then you can adress the specific column indices directly. The indices are unique while the column names aren't

以下函数删除具有重复名称的列并仅保留一列。不完全是你所要求的,但你可以使用它的片段来解决你的问题。这个想法是返回索引号,然后您可以直接寻址特定的列索引。索引是唯一的,而列名不是

def remove_multiples(df,varname):
    """
    makes a copy of the first column of all columns with the same name,
    deletes all columns with that name and inserts the first column again
    """
    from copy import deepcopy
    dfout = deepcopy(df)
    if (varname in dfout.columns):
        tmp = dfout.iloc[:, min([i for i,x in enumerate(dfout.columns == varname) if x])]
        del dfout[varname]
        dfout[varname] = tmp
    return dfout

where

在哪里

[i for i,x in enumerate(dfout.columns == varname) if x]

is the part you need

是你需要的部分