问：[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID

Question

提问by Simon Sharp

I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:

我想获取一个包含一堆不同的独特个体的数据集，每个个体都有多个条目，并为每个人的所有条目分配一个唯一的 id。这是 df 的示例：

      FirstName LastName  id
0     Tom       Jones     1
1     Tom       Jones     1
2     David     Smith     1
3     Alex      Thompson  1
4     Alex      Thompson  1

So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.

所以，基本上我希望 Tom Jones 的所有条目的 id=1，David Smith 的所有条目的 id=2，Alex Thompson 的所有条目的 id=3，等等。

So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:

所以我已经有了一个解决方案，它是一个简单的 python 循环，它迭代两个值（一个用于 id，一个用于索引）并根据它们是否与前一个个体匹配来为个体分配一个 id：

x = 1
i = 1

while i < len(df_test):
    if (df_test.LastName[i] == df_test.LastName[i-1]) & 
    (df_test.FirstName[i] == df_test.FirstName[i-1]):
        df_test.loc[i, 'id'] = x
        i = i+1
    else:
        x = x+1
        df_test.loc[i, 'id'] = x
        i = i+1

The problem I'm running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet. Thanks!

我遇到的问题是数据帧有大约 900 万个条目，因此使用该循环将花费大量时间来运行。谁能想到一个更有效的方法来做到这一点？我一直在寻找 groupby 和 multiindexing 作为潜在的解决方案，但还没有完全找到正确的解决方案。谢谢！

Answer 1

采纳答案by Alexander

You could join the last name and first name, convert it to a category, and then get the codes.

您可以连接姓氏和名字，将其转换为类别，然后获取代码。

Of course, multiple people with the same name would have the same id.

当然，多个同名的人会有相同的id.

df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
  FirstName  LastName  id
0       Tom     Jones   0
1       Tom     Jones   0
2     David     Smith   1
3      Alex  Thompson   2
4      Alex  Thompson   2

Answer 2

回答by Craig

This approach uses .groupby()and .ngroup()(new in Pandas 0.20.2) to create the idcolumn:

这种方法使用.groupby()和.ngroup()（在 Pandas 0.20.2 中新增）来创建id列：

df['id'] = df.groupby(['LastName','FirstName']).ngroup()
>>> df

   First    Second  id
0    Tom     Jones   0
1    Tom     Jones   0
2  David     Smith   1
3   Alex  Thompson   2
4   Alex  Thompson   2

I checked timings and, for the small dataset in this example, Alexander's answer is faster:

我检查了时间，对于这个例子中的小数据集，亚历山大的回答更快：

%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 μs per loop

%timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop

However, for larger dataframes, the groupby()approach appears to be faster. To create a large, representative data set, I used fakerto create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

但是，对于较大的数据帧，该groupby()方法似乎更快。为了创建一个大型的、有代表性的数据集，我曾经faker创建了一个包含 5000 个名称的数据框，然后将前 2000 个名称连接到该数据框以创建一个包含 7000 个名称的数据框，其中 2000 个是重复的。

import faker
fakenames = faker.Faker()
first = [ fakenames.first_name() for _ in range(5000) ]
last = [ fakenames.last_name() for _ in range(5000) ]
df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
df2 = pd.concat([df2, df2.iloc[:2000]])

Running the timing on this larger data set gives:

在这个更大的数据集上运行时间给出：

%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
100 loops, best of 3: 5.22 ms per loop

%timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
100 loops, best of 3: 3.1 ms per loop

You may want to test both approaches on your data set to determine which one works best given the size of your data.

您可能希望在您的数据集上测试这两种方法，以确定根据您的数据大小，哪种方法最有效。

Answer 3

回答by DougR

This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.

此方法允许使用变量定义“id”列名称。另外，我发现与assign或groupby方法相比，它更容易阅读。

# Create Dataframe
df = pd.DataFrame(
    {'FirstName': ['Tom','Tom','David','Alex','Alex'],
    'LastName': ['Jones','Jones','Smith','Thompson','Thompson'],
    })

newIdName = 'id'   # Set new name here.

df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes

Output:

输出：

>>> df
          FirstName  LastName  id
        0       Tom     Jones   0
        1       Tom     Jones   0
        2     David     Smith   1
        3      Alex  Thompson   2
        4      Alex  Thompson   2

问：[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID

提问by Simon Sharp

采纳答案by Alexander

回答by Craig

回答by DougR

相关推荐

最近更新

标签

问：[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID

提问by Simon Sharp

采纳答案by Alexander

回答by Craig

回答by DougR

相关推荐

pandas 在具有非唯一值的列上合并 Python 中的两个数据框

pandas 如何组合日期列和时间列

在 Pandas 计算中处理除以零

pandas 将熊猫数据框附加到 Google 电子表格

相关推荐

最近更新

标签