问:[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45685254/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:14:46  来源:igfitidea点击:

Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

pythonpandasdataframeindexing

提问by Simon Sharp

I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:

我想获取一个包含一堆不同的独特个体的数据集,每个个体都有多个条目,并为每个人的所有条目分配一个唯一的 id。这是 df 的示例:

      FirstName LastName  id
0     Tom       Jones     1
1     Tom       Jones     1
2     David     Smith     1
3     Alex      Thompson  1
4     Alex      Thompson  1

So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.

所以,基本上我希望 Tom Jones 的所有条目的 id=1,David Smith 的所有条目的 id=2,Alex Thompson 的所有条目的 id=3,等等。

So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:

所以我已经有了一个解决方案,它是一个简单的 python 循环,它迭代两个值(一个用于 id,一个用于索引)并根据它们是否与前一个个体匹配来为个体分配一个 id:

x = 1
i = 1

while i < len(df_test):
    if (df_test.LastName[i] == df_test.LastName[i-1]) & 
    (df_test.FirstName[i] == df_test.FirstName[i-1]):
        df_test.loc[i, 'id'] = x
        i = i+1
    else:
        x = x+1
        df_test.loc[i, 'id'] = x
        i = i+1

The problem I'm running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet. Thanks!

我遇到的问题是数据帧有大约 900 万个条目,因此使用该循环将花费大量时间来运行。谁能想到一个更有效的方法来做到这一点?我一直在寻找 groupby 和 multiindexing 作为潜在的解决方案,但还没有完全找到正确的解决方案。谢谢!

采纳答案by Alexander

You could join the last name and first name, convert it to a category, and then get the codes.

您可以连接姓氏和名字,将其转换为类别,然后获取代码。

Of course, multiple people with the same name would have the same id.

当然,多个同名的人会有相同的id.

df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
  FirstName  LastName  id
0       Tom     Jones   0
1       Tom     Jones   0
2     David     Smith   1
3      Alex  Thompson   2
4      Alex  Thompson   2

回答by Craig

This approach uses .groupby()and .ngroup()(new in Pandas 0.20.2) to create the idcolumn:

这种方法使用.groupby().ngroup()(在 Pandas 0.20.2 中新增)来创建id列:

df['id'] = df.groupby(['LastName','FirstName']).ngroup()
>>> df

   First    Second  id
0    Tom     Jones   0
1    Tom     Jones   0
2  David     Smith   1
3   Alex  Thompson   2
4   Alex  Thompson   2

I checked timings and, for the small dataset in this example, Alexander's answer is faster:

我检查了时间,对于这个例子中的小数据集,亚历山大的回答更快:

%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 μs per loop

%timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop

However, for larger dataframes, the groupby()approach appears to be faster. To create a large, representative data set, I used fakerto create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

但是,对于较大的数据帧,该groupby()方法似乎更快。为了创建一个大型的、有代表性的数据集,我曾经faker创建了一个包含 5000 个名称的数据框,然后将前 2000 个名称连接到该数据框以创建一个包含 7000 个名称的数据框,其中 2000 个是重复的。

import faker
fakenames = faker.Faker()
first = [ fakenames.first_name() for _ in range(5000) ]
last = [ fakenames.last_name() for _ in range(5000) ]
df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
df2 = pd.concat([df2, df2.iloc[:2000]])

Running the timing on this larger data set gives:

在这个更大的数据集上运行时间给出:

%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
100 loops, best of 3: 5.22 ms per loop

%timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
100 loops, best of 3: 3.1 ms per loop

You may want to test both approaches on your data set to determine which one works best given the size of your data.

您可能希望在您的数据集上测试这两种方法,以确定根据您的数据大小,哪种方法最有效。

回答by DougR

This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.

此方法允许使用变量定义“id”列名称。另外,我发现与assign或groupby方法相比,它更容易阅读。

# Create Dataframe
df = pd.DataFrame(
    {'FirstName': ['Tom','Tom','David','Alex','Alex'],
    'LastName': ['Jones','Jones','Smith','Thompson','Thompson'],
    })

newIdName = 'id'   # Set new name here.

df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes     

Output:

输出:

>>> df
          FirstName  LastName  id
        0       Tom     Jones   0
        1       Tom     Jones   0
        2     David     Smith   1
        3      Alex  Thompson   2
        4      Alex  Thompson   2