Pandas 根据列中的值将字符串映射到 int
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42330624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas map string to int based on value in a column
提问by Vijay P R
I have a data frame with two columns :
我有一个包含两列的数据框:
state total_sales
AL 16714
AR 6498
AZ 107296
CA 33717
Now I want to map the strings in state column to int from 1 to N(where N is the no of rows,here 4 ) based on increasing order of values in total_sales . Result should be stored in another column (say label). That is, wanted a result like this :
现在我想根据 total_sales 中值的递增顺序将 state 列中的字符串从 1 映射到 int 从 1 到 N(其中 N 是行数,这里是 4 )。结果应存储在另一列(例如标签)中。也就是说,想要这样的结果:
state total_sales label
AL 16714 3
AR 6498 4
AZ 107296 1
CA 33717 2
Please suggest a vectorised implementation .
请建议一个矢量化的实现。
采纳答案by jezrael
回答by Magnus Persson
After running into the same issue while taking care of Fitbit sleep stages I worked out another solution (where I can control the mapping to ints). Here I use Pandas way of representing categorical variables. The following is a simple example showing the solution to your MWE.
在处理 Fitbit 睡眠阶段时遇到同样的问题后,我制定了另一个解决方案(我可以控制到整数的映射)。这里我使用 Pandas 的方式来表示分类变量。以下是一个简单的示例,展示了您的 MWE 的解决方案。
df = pd.DataFrame(data={'state':['AL','AR','AZ','CA'] ,
'total_sales':[16714,6498,107296,33717] })
Then we simply ask for the "state" column out but as a categorical variable:
然后我们简单地要求“状态”列,但作为分类变量:
df['label'] = df.state.astype("category").cat.codes
print(df)
state total_sales label
0 AL 16714 0
1 AR 6498 1
2 AZ 107296 2
3 CA 33717 3
If you need to control the sequence (e.g. if it is not ordered the same way as they appear) you can supply a list of allowed categories, and in what order:
如果您需要控制顺序(例如,如果顺序与它们出现的顺序不同),您可以提供一个允许类别的列表,以及按什么顺序:
df_cats = ['CA','AZ' ,'AL','AR']
df['label'] = df.state.astype("category", categories=df_cats).cat.codes
print(df)
state total_sales label
0 AL 16714 2
1 AR 6498 3
2 AZ 107296 1
3 CA 33717 0
Any label not in the category list will yield "-1". There's also a keyword ordered=True
that you can use, but I don't think it matters here.
For more information about Pandas categorical data dtype see: https://pandas.pydata.org/pandas-docs/stable/categorical.html
任何不在类别列表中的标签都将产生“-1”。还有一个ordered=True
您可以使用的关键字,但我认为这并不重要。有关 Pandas 分类数据 dtype 的更多信息,请参阅:https://pandas.pydata.org/pandas-docs/stable/categorical.html