pandas Panda 的 DataFrame - 重命名多个同名列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24685012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:15:07  来源:igfitidea点击:

Panda's DataFrame - renaming multiple identically named columns

pythonpandas

提问by Lamakaha

i have several columns named the same in a df. Need to rename them. The usual rename renames the all anyway I can rename the below blah(s) to blah1, blah4, blah5?

我在 df 中有几列名称相同。需要重命名它们。通常的重命名重命名所有无论如何我可以将下面的blah(s)重命名为blah1,blah4,blah5?

    In [6]:

    df=pd.DataFrame(np.arange(2*5).reshape(2,5))
    df.columns=['blah','blah2','blah3','blah','blah']
    df
    Out[6]:


blah    blah2   blah3   blah    blah
0    0   1   2   3   4
1    5   6   7   8   9

In [7]:

在 [7] 中:

df.rename(columns = {'blah':'blah1'})
Out[7]:
        blah1   blah2   blah3   blah1   blah1
        0    0   1   2   3   4
        1    5   6   7   8   9

回答by MaxU

Starting with Pandas 0.19.0 pd.read_csv()has improved support for duplicate column names

从 Pandas 0.19.0pd.read_csv()开始改进了对重复列名的支持

So we can try to use the internal method:

所以我们可以尝试使用内部方法:

In [137]: pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
Out[137]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']

This is the "magic" function:

这是“魔术”功能:

def _maybe_dedup_names(self, names):
    # see gh-7160 and gh-9424: this helps to provide
    # immediate alleviation of the duplicate names
    # issue and appears to be satisfactory to users,
    # but ultimately, not needing to butcher the names
    # would be nice!
    if self.mangle_dupe_cols:
        names = list(names)  # so we can index
        counts = {}

        for i, col in enumerate(names):
            cur_count = counts.get(col, 0)

            if cur_count > 0:
                names[i] = '%s.%d' % (col, cur_count)

            counts[col] = cur_count + 1

    return names

回答by Lamakaha

I was looking to find a solution within Pandas more than a general Python solution. Column's get_loc() function returns a masked array if it finds duplicates with 'True' values pointing to the locations where duplicates are found. I then use the mask to assign new values into those locations. In my case, I know ahead of time how many dups I'm going to get and what I'm going to assign to them but it looks like df.columns.get_duplicates() would return a list of all dups and you can then use that list in conjunction with get_loc() if you need a more generic dup-weeding action

我希望在 Pandas 中找到一个解决方案,而不是一个通用的 Python 解决方案。如果 Column 的 get_loc() 函数发现重复项,并且“True”值指向找到重复项的位置,则它会返回一个掩码数组。然后我使用掩码将新值分配到这些位置。就我而言,我提前知道我将获得多少重复数据以及我将分配给它们的内容,但看起来 df.columns.get_duplicates() 会返回所有重复数据的列表,然后您就可以如果您需要更通用的除草操作,请将该列表与 get_loc() 结合使用

cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates(): 
    cols[df.columns.get_loc(dup)] = ([dup + '.' + str(d_idx) 
                                     if d_idx != 0 
                                     else dup 
                                     for d_idx in range(df.columns.get_loc(dup).sum())]
                                    )
df.columns=cols

    blah    blah2   blah3   blah.1  blah.2
 0     0        1       2        3       4
 1     5        6       7        8       9

New Better Method (Update 03Dec2019)

新的更好的方法(更新 03Dec2019)

This code below is better than above code. Copied from another answer below (@SatishSK):

下面的这段代码比上面的代码好。从下面的另一个答案复制(@SatishSK):

#sample df with duplicate blah column
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df

# you just need the following 4 lines to rename duplicates
# df is the dataframe that you want to rename duplicated columns

cols=pd.Series(df.columns)

for dup in cols[cols.duplicated()].unique(): 
    cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]

# rename the columns with the cols list.
df.columns=cols

df

Output:

输出:

    blah    blah2   blah3   blah.1  blah.2
0   0   1   2   3   4
1   5   6   7   8   9

回答by Glen Thompson

You could use this:

你可以用这个:

def df_column_uniquify(df):
    df_columns = df.columns
    new_columns = []
    for item in df_columns:
        counter = 0
        newitem = item
        while newitem in new_columns:
            counter += 1
            newitem = "{}_{}".format(item, counter)
        new_columns.append(newitem)
    df.columns = new_columns
    return df

Then

然后

import numpy as np
import pandas as pd

df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']

so that df:

所以df

   blah  blah2  blah3   blah   blah
0     0      1      2      3      4
1     5      6      7      8      9

then

然后

df = df_column_uniquify(df)

so that df:

所以df

   blah  blah2  blah3  blah_1  blah_2
0     0      1      2       3       4
1     5      6      7       8       9

回答by EdChum

You could assign directly to the columns:

您可以直接分配给列:

In [12]:

df.columns = ['blah','blah2','blah3','blah4','blah5']
df
Out[12]:
   blah  blah2  blah3  blah4  blah5
0     0      1      2      3      4
1     5      6      7      8      9

[2 rows x 5 columns]

If you want to dynamically just rename the duplicate columns then you could do something like the following (code taken from answer 2: Index of duplicates items in a python list):

如果您只想动态地重命名重复的列,那么您可以执行以下操作(代码取自答案 2:python 列表中重复项的索引):

In [25]:

import collections
dups = collections.defaultdict(list)
dup_indices=[]
col_list=list(df.columns)
for i, e in enumerate(list(df.columns)):
  dups[e].append(i)
for k, v in sorted(dups.items()):
  if len(v) >= 2:
    dup_indices = v

for i in dup_indices:
    col_list[i] = col_list[i] + ' ' + str(i)
col_list
Out[25]:
['blah 0', 'blah2', 'blah3', 'blah 3', 'blah 4']

You could then use this to assign back, you could also have a function to generate a unique name that is not present in the columns prior to renaming.

然后您可以使用它来分配回,您还可以使用一个函数来生成一个唯一名称,该名称在重命名之前不存在于列中。

回答by SatishSK

Thank you @Lamakaha for the solution. Your idea gave me a chance to modify it and make it workable in all the cases.

谢谢@Lamakaha 的解决方案。你的想法让我有机会修改它并使其在所有情况下都可行。

I am using Python 3.7.3 version.

我正在使用 Python 3.7.3 版本。

I tried your piece of code on my data set which had only one duplicated column i.e. two columns with same name. Unfortunately, the column names remained As-Is without being renamed. On top of that I got a warning that "get_duplicates()is deprecated and same will be removed in future version". I used duplicated()coupled with unique()in place of get_duplicates()which did not yield the expected result.

我在我的数据集上尝试了你的一段代码,它只有一个重复的列,即两列同名。不幸的是,列名保持原样,没有被重命名。最重要的是,我收到了一个"get_duplicates()已弃用的警告,并且将在未来版本中删除相同的警告。我使用duplicated()了它unique()来代替get_duplicates()它并没有产生预期的结果。

I have modified your piece of code little bit which is working for me now for my data set as well as in other general cases as well.

我已经稍微修改了你的一段代码,它现在对我的数据集以及其他一般情况下对我有用。

Here are the code runs with and without code modification on the example data set mentioned in the question along with results:

以下是在问题中提到的示例数据集上运行和不修改代码的代码以及结果:



df=pd.DataFrame(np.arange(2*5).reshape(2,5))

df.columns=['blah','blah2','blah3','blah','blah']
df

cols=pd.Series(df.columns)

for dup in df.columns.get_duplicates(): 
    cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols

df

f:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release. You can use idx[idx.duplicated()].unique() instead

f:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: 'get_duplicates' 已弃用,将在未来版本中删除。您可以使用 idx[idx.duplicated()].unique() 代替

Output:

输出:

    blah    blah2   blah3   blah    blah.1
0   0   1   2   3   4
1   5   6   7   8   9

Two of the three "blah"(s) are not renamed properly.

三个“废话”中有两个没有正确重命名。



Modified code

修改代码

df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df

cols=pd.Series(df.columns)

for dup in cols[cols.duplicated()].unique(): 
    cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
df.columns=cols

df

Output:

输出:

    blah    blah2   blah3   blah.1  blah.2
0   0   1   2   3   4
1   5   6   7   8   9


Here is a run of modified code on some another example:

这是在另一个示例上运行的修改后的代码:

cols = pd.Series(['X', 'Y', 'Z', 'A', 'B', 'C', 'A', 'A', 'L', 'M', 'A', 'Y', 'M'])

for dup in cols[cols.duplicated()].unique():
    cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]

cols

Output:
0       X
1       Y
2       Z
3       A
4       B
5       C
6     A_1
7     A_2
8       L
9       M
10    A_3
11    Y_1
12    M_1
dtype: object

Hope this helps anybody who is seeking answer to the aforementioned question.

希望这可以帮助任何寻求上述问题答案的人。

回答by normanius

Since the accepted answer (by Lamakaha) is not working for recent versions of pandas, and because the other suggestions looked a bit clumsy, I worked out my own solution:

由于接受的答案(由 Lamakaha 提供)不适用于最新版本的Pandas,并且因为其他建议看起来有点笨拙,我制定了自己的解决方案:

def dedupIndex(idx, fmt=None, ignoreFirst=True):
    # fmt:          A string format that receives two arguments: 
    #               name and a counter. By default: fmt='%s.%03d'
    # ignoreFirst:  Disable/enable postfixing of first element.
    idx = pd.Series(idx)
    duplicates = idx[idx.duplicated()].unique()
    fmt = '%s.%03d' if fmt is None else fmt
    for name in duplicates:
        dups = idx==name
        ret = [ fmt%(name,i) if (i!=0 or not ignoreFirst) else name
                      for i in range(dups.sum()) ]
        idx.loc[dups] = ret
    return pd.Index(idx)

Use the function as follows:

使用函数如下:

df.columns = dedupIndex(df.columns)
#?Result: ['blah', 'blah2', 'blah3', 'blah.001', 'blah.002']
df.columns = dedupIndex(df.columns, fmt='%s #%d', ignoreFirst=False)
#?Result: ['blah #0', 'blah2', 'blah3', 'blah #1', 'blah #2']

回答by T. Jewell


duplicated_idx = dataset.columns.duplicated()

duplicated = dataset.columns[duplicated_idx].unique()



rename_cols = []

i = 1
for col in dataset.columns:
    if col in duplicated:
        rename_cols.extend([col + '_' + str(i)])
    else:
        rename_cols.extend([col])

dataset.columns = rename_cols