Pandas 根据列中的值将字符串映射到 int

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42330624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:01:20  来源:igfitidea点击:

Pandas map string to int based on value in a column

pythonpandasdataframe

提问by Vijay P R

I have a data frame with two columns :

我有一个包含两列的数据框:

state  total_sales
AL      16714
AR      6498
AZ      107296
CA      33717

Now I want to map the strings in state column to int from 1 to N(where N is the no of rows,here 4 ) based on increasing order of values in total_sales . Result should be stored in another column (say label). That is, wanted a result like this :

现在我想根据 total_sales 中值的递增顺序将 state 列中的字符串从 1 映射到 int 从 1 到 N(其中 N 是行数,这里是 4 )。结果应存储在另一列(例如标签)中。也就是说,想要这样的结果:

state  total_sales label
AL      16714         3
AR      6498          4
AZ      107296        1
CA      33717         2

Please suggest a vectorised implementation .

请建议一个矢量化的实现。

采纳答案by jezrael

You can use rankwith cast to int:

您可以rank与 cast 一起使用int

df['label'] = df['total_sales'].rank(method='dense', ascending=False).astype(int)
print (df)
  state  total_sales  label
0    AL        16714      3
1    AR         6498      4
2    AZ       107296      1
3    CA        33717      2

回答by Magnus Persson

After running into the same issue while taking care of Fitbit sleep stages I worked out another solution (where I can control the mapping to ints). Here I use Pandas way of representing categorical variables. The following is a simple example showing the solution to your MWE.

在处理 Fitbit 睡眠阶段时遇到同样的问题后,我制定了另一个解决方案(我可以控制到整数的映射)。这里我使用 Pandas 的方式来表示分类变量。以下是一个简单的示例,展示了您的 MWE 的解决方案。

df = pd.DataFrame(data={'state':['AL','AR','AZ','CA'] , 
                        'total_sales':[16714,6498,107296,33717] })

Then we simply ask for the "state" column out but as a categorical variable:

然后我们简单地要求“状态”列,但作为分类变量:

df['label'] = df.state.astype("category").cat.codes
print(df)
  state  total_sales  label
0    AL        16714      0
1    AR         6498      1
2    AZ       107296      2
3    CA        33717      3

If you need to control the sequence (e.g. if it is not ordered the same way as they appear) you can supply a list of allowed categories, and in what order:

如果您需要控制顺序(例如,如果顺序与它们出现的顺序不同),您可以提供一个允许类别的列表,以及按什么顺序:

df_cats = ['CA','AZ' ,'AL','AR']
df['label'] = df.state.astype("category",  categories=df_cats).cat.codes
print(df)
  state  total_sales  label
0    AL        16714      2
1    AR         6498      3
2    AZ       107296      1
3    CA        33717      0

Any label not in the category list will yield "-1". There's also a keyword ordered=Truethat you can use, but I don't think it matters here. For more information about Pandas categorical data dtype see: https://pandas.pydata.org/pandas-docs/stable/categorical.html

任何不在类别列表中的标签都将产生“-1”。还有一个ordered=True您可以使用的关键字,但我认为这并不重要。有关 Pandas 分类数据 dtype 的更多信息,请参阅:https://pandas.pydata.org/pandas-docs/stable/categorical.html