问:[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45685254/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
提问by Simon Sharp
I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:
我想获取一个包含一堆不同的独特个体的数据集,每个个体都有多个条目,并为每个人的所有条目分配一个唯一的 id。这是 df 的示例:
FirstName LastName id
0 Tom Jones 1
1 Tom Jones 1
2 David Smith 1
3 Alex Thompson 1
4 Alex Thompson 1
So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.
所以,基本上我希望 Tom Jones 的所有条目的 id=1,David Smith 的所有条目的 id=2,Alex Thompson 的所有条目的 id=3,等等。
So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:
所以我已经有了一个解决方案,它是一个简单的 python 循环,它迭代两个值(一个用于 id,一个用于索引)并根据它们是否与前一个个体匹配来为个体分配一个 id:
x = 1
i = 1
while i < len(df_test):
if (df_test.LastName[i] == df_test.LastName[i-1]) &
(df_test.FirstName[i] == df_test.FirstName[i-1]):
df_test.loc[i, 'id'] = x
i = i+1
else:
x = x+1
df_test.loc[i, 'id'] = x
i = i+1
The problem I'm running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet. Thanks!
我遇到的问题是数据帧有大约 900 万个条目,因此使用该循环将花费大量时间来运行。谁能想到一个更有效的方法来做到这一点?我一直在寻找 groupby 和 multiindexing 作为潜在的解决方案,但还没有完全找到正确的解决方案。谢谢!
采纳答案by Alexander
You could join the last name and first name, convert it to a category, and then get the codes.
您可以连接姓氏和名字,将其转换为类别,然后获取代码。
Of course, multiple people with the same name would have the same id
.
当然,多个同名的人会有相同的id
.
df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
回答by Craig
This approach uses .groupby()
and .ngroup()
(new in Pandas 0.20.2) to create the id
column:
这种方法使用.groupby()
和.ngroup()
(在 Pandas 0.20.2 中新增)来创建id
列:
df['id'] = df.groupby(['LastName','FirstName']).ngroup()
>>> df
First Second id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
I checked timings and, for the small dataset in this example, Alexander's answer is faster:
我检查了时间,对于这个例子中的小数据集,亚历山大的回答更快:
%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 μs per loop
%timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop
However, for larger dataframes, the groupby()
approach appears to be faster. To create a large, representative data set, I used faker
to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.
但是,对于较大的数据帧,该groupby()
方法似乎更快。为了创建一个大型的、有代表性的数据集,我曾经faker
创建了一个包含 5000 个名称的数据框,然后将前 2000 个名称连接到该数据框以创建一个包含 7000 个名称的数据框,其中 2000 个是重复的。
import faker
fakenames = faker.Faker()
first = [ fakenames.first_name() for _ in range(5000) ]
last = [ fakenames.last_name() for _ in range(5000) ]
df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
df2 = pd.concat([df2, df2.iloc[:2000]])
Running the timing on this larger data set gives:
在这个更大的数据集上运行时间给出:
%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
100 loops, best of 3: 5.22 ms per loop
%timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
100 loops, best of 3: 3.1 ms per loop
You may want to test both approaches on your data set to determine which one works best given the size of your data.
您可能希望在您的数据集上测试这两种方法,以确定根据您的数据大小,哪种方法最有效。
回答by DougR
This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.
此方法允许使用变量定义“id”列名称。另外,我发现与assign或groupby方法相比,它更容易阅读。
# Create Dataframe
df = pd.DataFrame(
{'FirstName': ['Tom','Tom','David','Alex','Alex'],
'LastName': ['Jones','Jones','Smith','Thompson','Thompson'],
})
newIdName = 'id' # Set new name here.
df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes
Output:
输出:
>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2