python pandas删除重复的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14984119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:59:52  来源:igfitidea点击:

python pandas remove duplicate columns

pythonpandas

提问by Onlyjus

What is the easiest way to remove duplicate columns from a dataframe?

从数据框中删除重复列的最简单方法是什么?

I am reading a text file that has duplicate columns via:

我正在通过以下方式读取具有重复列的文本文件:

import pandas as pd

df=pd.read_table(fname)

The column names are:

列名是:

Time, Time Relative, N2, Time, Time Relative, H2, etc...

All the Time and Time Relative columns contain the same data. I want:

所有时间和时间相对列都包含相同的数据。我想要:

Time, Time Relative, N2, H2

All my attempts at dropping, deleting, etc such as:

我在删除、删除等方面的所有尝试,例如:

df=df.T.drop_duplicates().T

Result in uniquely valued index errors:

导致唯一值索引错误:

Reindexing only valid with uniquely valued index objects

Sorry for being a Pandas noob. Any Suggestions would be appreciated.

抱歉我是 Pandas 菜鸟。任何建议,将不胜感激。



Additional Details

额外细节

Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)

Pandas 版本:0.9.0
Python 版本:2.7.3
Windows 7
(通过 Pythonxy 2.7.3.0 安装)

data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):

数据文件(注意:在真实文件中,列之间用制表符分隔,这里用4个空格分隔):

Time    Time Relative [s]    N2[%]    Time    Time Relative [s]    H2[ppm]
2/12/2013 9:20:55 AM    6.177    9.99268e+001    2/12/2013 9:20:55 AM    6.177    3.216293e-005    
2/12/2013 9:21:06 AM    17.689    9.99296e+001    2/12/2013 9:21:06 AM    17.689    3.841667e-005    
2/12/2013 9:21:18 AM    29.186    9.992954e+001    2/12/2013 9:21:18 AM    29.186    3.880365e-005    
... etc ...
2/12/2013 2:12:44 PM    17515.269    9.991756+001    2/12/2013 2:12:44 PM    17515.269    2.800279e-005    
2/12/2013 2:12:55 PM    17526.769    9.991754e+001    2/12/2013 2:12:55 PM    17526.769    2.880386e-005
2/12/2013 2:13:07 PM    17538.273    9.991797e+001    2/12/2013 2:13:07 PM    17538.273    3.131447e-005

采纳答案by Gene Burinsky

There's a one line solution to the problem. This applies if some column names are duplicated and you wish to remove them:

这个问题有一个单行解决方案。如果某些列名称重复并且您希望删除它们,则这适用:

df = df.loc[:,~df.columns.duplicated()]

How it works:

这个怎么运作:

Suppose the columns of the data frame are ['alpha','beta','alpha']

假设数据框的列是 ['alpha','beta','alpha']

df.columns.duplicated()returns a boolean array: a Trueor Falsefor each column. If it is Falsethen the column name is unique up to that point, if it is Truethen the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

df.columns.duplicated()返回一个布尔数组:a TrueorFalse对于每一列。如果是,False则列名在该点之前是唯一的,如果是,True则列名之前重复。例如,使用给定的示例,返回值将是[False,False,True]

Pandasallows one to index using boolean values whereby it selects only the Truevalues. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Pandas允许使用布尔值进行索引,从而仅选择True值。由于我们想保留不重复的列,我们需要翻转上面的布尔数组(即[True, True, False] = ~[False,False,True]

Finally, df.loc[:,[True,True,False]]selects only the non-duplicated columns using the aforementioned indexing capability.

最后,df.loc[:,[True,True,False]]使用上述索引功能仅选择非重复列。

Note: the above only checks columns names, notcolumn values.

注意:以上只检查列名,检查列值。

回答by beardc

It sounds like you already know the unique column names. If that's the case, then df = df['Time', 'Time Relative', 'N2']would work.

听起来您已经知道唯一的列名。如果是这样的话,那就行了df = df['Time', 'Time Relative', 'N2']

If not, your solution should work:

如果没有,您的解决方案应该有效:

In [101]: vals = np.random.randint(0,20, (4,3))
          vals
Out[101]:
array([[ 3, 13,  0],
       [ 1, 15, 14],
       [14, 19, 14],
       [19,  5,  1]])

In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
          df
Out[106]:
   Time  H1  N2  Time Relative  N2  Time
0     3  13   0              3  13     0
1     1  15  14              1  15    14
2    14  19  14             14  19    14
3    19   5   1             19   5     1

In [107]: df.T.drop_duplicates().T
Out[107]:
   Time  H1  N2
0     3  13   0
1     1  15  14
2    14  19  14
3    19   5   1

You probably have something specific to your data that's messing it up. We could give more help if there's more details you could give us about the data.

您可能有一些特定于您的数据的东西把它弄乱了。如果您可以向我们提供有关数据的更多详细信息,我们可以提供更多帮助。

Edit:Like Andy said, the problem is probably with the duplicate column titles.

编辑:就像安迪所说,问题可能出在重复的列标题上。

For a sample table file 'dummy.csv' I made up:

对于我编写的示例表文件“dummy.csv”:

Time    H1  N2  Time    N2  Time Relative
3   13  13  3   13  0
1   15  15  1   15  14
14  19  19  14  19  14
19  5   5   19  5   1

using read_tablegives unique columns and works properly:

usingread_table给出了唯一的列并正常工作:

In [151]: df2 = pd.read_table('dummy.csv')
          df2
Out[151]:
         Time  H1  N2  Time.1  N2.1  Time Relative
      0     3  13  13       3    13              0
      1     1  15  15       1    15             14
      2    14  19  19      14    19             14
      3    19   5   5      19     5              1
In [152]: df2.T.drop_duplicates().T
Out[152]:
             Time  H1  Time Relative
          0     3  13              0
          1     1  15             14
          2    14  19             14
          3    19   5              1  

If your version doesn't let your, you can hack together a solution to make them unique:

如果您的版本不支持,您可以组合一个解决方案,使它们独一无二:

In [169]: df2 = pd.read_table('dummy.csv', header=None)
          df2
Out[169]:
              0   1   2     3   4              5
        0  Time  H1  N2  Time  N2  Time Relative
        1     3  13  13     3  13              0
        2     1  15  15     1  15             14
        3    14  19  19    14  19             14
        4    19   5   5    19   5              1
In [171]: from collections import defaultdict
          col_counts = defaultdict(int)
          col_ix = df2.first_valid_index()
In [172]: cols = []
          for col in df2.ix[col_ix]:
              cnt = col_counts[col]
              col_counts[col] += 1
              suf = '_' + str(cnt) if cnt else ''
              cols.append(col + suf)
          cols
Out[172]:
          ['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
          df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
          Time  H1  N2 Time_1 N2_1 Time Relative
        1    3  13  13      3   13             0
        2    1  15  15      1   15            14
        3   14  19  19     14   19            14
        4   19   5   5     19    5             1
In [178]: df2.T.drop_duplicates().T
Out[178]:
          Time  H1 Time Relative
        1    3  13             0
        2    1  15            14
        3   14  19            14
        4   19   5             1 

回答by kalu

Transposing is inefficient for large DataFrames. Here is an alternative:

对于大型数据帧,转置效率低下。这是一个替代方案:

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []
    for t, v in groups.items():
        dcols = frame[v].to_dict(orient="list")

        vs = dcols.values()
        ks = dcols.keys()
        lvs = len(vs)

        for i in range(lvs):
            for j in range(i+1,lvs):
                if vs[i] == vs[j]: 
                    dups.append(ks[i])
                    break

    return dups       

Use it like this:

像这样使用它:

dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)

Edit

编辑

A memory efficient version that treats nans like any other value:

将 nans 视为任何其他值的内存高效版本:

from pandas.core.common import array_equivalent

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            ia = vs.iloc[:,i].values
            for j in range(i+1, lcs):
                ja = vs.iloc[:,j].values
                if array_equivalent(ia, ja):
                    dups.append(cs[i])
                    break

    return dups

回答by Elliott Collins

If I'm not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than @kalu 's function, keeping the first of any similarly named columns.

如果我没记错的话,下面的内容会在没有转置解决方案的内存问题的情况下执行所要求的操作,并且行数少于 @kalu 的函数,保留任何类似命名列的第一个。

Cols = list(df.columns)
for i,item in enumerate(df.columns):
    if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)

回答by kamran kausar

First step:- Read first row i.e all columns the remove all duplicate columns.

第一步:- 读取第一行,即所有列,删除所有重复列。

Second step:- Finally read only that columns.

第二步:- 最后只阅读该列。

cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)

回答by Edmund's Echo

I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.

我遇到了这个问题,其中第一个答案提供的一个班轮效果很好。但是,我遇到了额外的复杂情况,其中列的第二个副本包含所有数据。第一个副本没有。

The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.

解决方案是通过切换否定运算符拆分一个数据帧来创建两个数据帧。获得两个数据框后,我使用lsuffix. 这样,我就可以引用和删除没有数据的列。

- E

- E

回答by Tony B

It looks like you were on the right path. Here is the one-liner you were looking for:

看起来你走在正确的道路上。这是您正在寻找的单线:

df.reset_index().T.drop_duplicates().T

But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:

但是由于没有产生引用的错误消息的示例数据框Reindexing only valid with uniquely valued index objects,因此很难确切说明什么可以解决问题。如果恢复原始索引对您很重要,请执行以下操作:

original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T

回答by Joe

The way below will identify dupe columns to review what is going wrong building the dataframe originally.

下面的方法将识别重复列,以查看最初构建数据框时出了什么问题。

dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]

回答by jaqm

Fast and easy way to drop the duplicated columns by their values:

按值删除重复列的快速简便方法:

df = df.T.drop_duplicates().T

df = df.T.drop_duplicates().T

More info: Pandas DataFrame drop_duplicates manual.

更多信息:Pandas DataFrame drop_duplicates 手册