pandas 使用python为组中的每个元素添加序列号

Question

提问by DKA

I have a dataframe of individuals who each have multiple records. I want to enumerate the record in the sequence for each individual in python. Essentially I would like to create the 'sequence' column in the following table:

我有一个数据框，每个人都有多个记录。我想在python中为每个人枚举序列中的记录。基本上我想在下表中创建“序列”列：

patient  date      sequence
145      20Jun2009        1
145      24Jun2009        2
145      15Jul2009        3
582      09Feb2008        1
582      21Feb2008        2
987      14Mar2010        1
987      02May2010        2
987      12May2010        3

This is essentially the same question as here, but I am working in python and unable to implement the sql solution. I suspect I can use a groupby statement with an iterable count, but have so far been unsuccessful. Thanks!

这基本上与此处的问题相同，但我正在使用 python 并且无法实现 sql 解决方案。我怀疑我可以使用带有可迭代计数的 groupby 语句，但到目前为止还没有成功。谢谢！

Answer 1

采纳答案by Jonathan

The question is how do I sort on multiple columns of data.

问题是如何对多列数据进行排序。

One simple trick is to use the keyparameter to the sortedfunction.

一个简单的技巧是使用排序函数的key参数。

You'll be sorting by a string built from the columns of the array.

您将按从数组的列构建的字符串进行排序。

rows = ...# your source data

def date_to_sortable_string(date):
  # use datetime package to convert string to sortable date.
  pass

# Assume x[0] === patient_id and x[1] === encounter date

# Sort by patient_id and date
rows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1])))

for row in rows_sorted:
  print row

Answer 2

回答by DKA

I stumbled upon the answer which was embarrassingly simple. The groupby statement has a 'cumcount()' option which will enumerate group items.

我偶然发现了一个简单得令人尴尬的答案。groupby 语句有一个 'cumcount()' 选项，它将枚举组项目。

df['sequence']=df.groupby('patient').cumcount()

The caveat is that the records have to be in the order you want them enumerated.

需要注意的是，记录必须按照您希望枚举的顺序排列。

Answer 3

回答by Andy Hayden

Firstly you want to convert the date column to be a pandas datetime (rather than strings):

首先，您要将日期列转换为Pandas日期时间（而不是字符串）：

In [11]: pd.to_datetime(df['date'], format='%d%b%Y')
Out[11]:
0   2009-06-20
1   2009-06-24
2   2009-07-15
3   2008-02-09
4   2008-02-21
5   2010-03-14
6   2010-05-02
7   2010-05-12
Name: date, dtype: datetime64[ns]

Note: see docsfor possible format options.

注意：有关可能的格式选项，请参阅文档。

In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')

In [13]: df
Out[13]:
   patient       date  sequence
0      145 2009-06-20         1
1      145 2009-06-24         2
2      145 2009-07-15         3
3      582 2008-02-09         1
4      582 2008-02-21         2
5      987 2010-03-14         1
6      987 2010-05-02         2
7      987 2010-05-12         3

If this isn't in date order (for each patient), I would sort it first:

如果这不是按日期顺序排列的（对于每个患者），我会先对其进行排序：

In [14]: df = df.sort('date')

Now you can groupby and cumcount:

现在您可以 groupby 和 cumcount：

In [15]: g = df.groupby('patient')

In [16]: g.cumcount() + 1
Out[16]:
2    1
3    2
0    1
1    2
4    1
5    2
6    3
dtype: int64

Which is what you want (althout it's out of order):

这就是你想要的（虽然它不正常）：

In [17]: df['sequence'] = g.cumcount() + 1

In [18]: df
Out[18]:
       patient       date  sequence
2      582 2008-02-09         1
3      582 2008-02-21         2
0      145 2009-06-24         1
1      145 2009-07-15         2
4      987 2010-03-14         1
5      987 2010-05-02         2
6      987 2010-05-12         3

To rearrange (though you may not need to) use sort_index(or we could reindex if we saved the initial DataFrame's index):*

要重新排列（尽管您可能不需要）使用sort_index（或者如果我们保存了初始 DataFrame 的索引，我们可以重新索引）：*

In [19]: df.sort_index()
Out[19]:
   patient       date  sequence
0      145 2009-06-24         1
1      145 2009-07-15         2
2      582 2008-02-09         1
3      582 2008-02-21         2
4      987 2010-03-14         1
5      987 2010-05-02         2
6      987 2010-05-12         3

pandas 使用python为组中的每个元素添加序列号

提问by DKA

采纳答案by Jonathan

回答by DKA

回答by Andy Hayden

相关推荐

最近更新

标签

pandas 使用python为组中的每个元素添加序列号

提问by DKA

采纳答案by Jonathan

回答by DKA

回答by Andy Hayden

相关推荐

pandas 如何将数据从 np 矩阵加载到 seaborn？

pandas 如何计算pandas中n列而不是行的差异

pandas 熊猫从数据框中选择不连续的列

pandas 绘制前 10 条与所有其他值的对比图

相关推荐

最近更新

标签