使用 pandas 和 numpy 将字符串类别映射到数字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43882652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:34:26  来源:igfitidea点击:

Mapping string categories to numbers using pandas and numpy

pythonpandasnumpy

提问by Michael Hackman

I have an array of data, each row represents a sample of data (5 samples) and each column represents a feature in the data (6 features for each sample)

我有一个数据数组,每行代表一个数据样本(5 个样本),每列代表数据中的一个特征(每个样本 6 个特征)

I'm trying to quantify the number of states each column contains, then map them to a set of numbers. This should only be done if the column is not currently numeric.

我试图量化每列包含的状态数,然后将它们映射到一组数字。仅当该列当前不是 numeric时才应执行此操作。

This is easier to explain through example:

这通过示例更容易解释:

example input (Input is of type numpy.ndarray):

示例输入(输入类型为 numpy.ndarray):

In = array([['x', 's', 3, 'k', 's', 'u'],
            ['x', 's', 2, 'n', 'n', 'g'],
            ['b', 's', 0, 'n', 'n', 'm'],
            ['k', 'y', 1, 'w', 'v', 'l'],
            ['x', 's', 2, 'o', 'c', 'l']], dtype=object)

For first column

对于第一列

curr_column = 0
colset = set()
for row in In:
    curr_element = row[curr_column]
    if curr_element not in colset:
        colset.add(curr_element)

#now colset = {'x', 'b', 'k'} so 3 possible states
collist = list(colset) #make it indexible
coldict = {}
for i in range(len(collist)):
    coldict[collist[i]] = i

This produces a dictionary, so that I can now recreate the original data as such: (assuming coldict = {'x':0, 'b':1, 'k':2})

这会生成一个字典,以便我现在可以重新创建原始数据:(假设 coldict = {'x':0, 'b':1, 'k':2})

for i in range(len(In)): #loop over each row
    curr_element = In[i][curr_column] #get current element
    In[i][curr_column] = coldict[curr_element] #use it to find the numerical value
'''
now
In = array([[0, 's', 3, 'k', 's', 'u'],
            [0, 's', 2, 'n', 'n', 'g'],
            [1, 's', 0, 'n', 'n', 'm'],
            [2, 'y', 1, 'w', 'v', 'l'],
            [0, 's', 2, 'o', 'c', 'l']], dtype=object)
'''

Now repeat the process for every column.

现在对每一列重复这个过程。

I'm aware that I could speed this up by populating all the column dictionaries in one pass over the dataset, and then replacing values all in one loop as well. I left that out for clarity into the process.

我知道我可以通过在数据集上一次性填充所有列字典来加快速度,然后也在一个循环中替换所有值。为清楚起见,我省略了这一点。

This is horribly inefficient for space and time and takes a large amount of time on large data, in which ways could this algorithm be improved? Is there a mapping function in numpy or in pandas that could either accomplish this or aid me?

这对于空间和时间来说效率极低,并且在大数据上花费大量时间,该算法可以通过哪些方式改进?numpy 或 pandas 中是否有映射函数可以完成此操作或帮助我?

I considered something similar to

我考虑过类似的东西

np.unique(Input, axis=1)

but I need this to be portable and not everyone has 1.13.0 developer version of numpy.

但我需要它是可移植的,并不是每个人都有 1.13.0 开发者版本的 numpy。

Also, how would I differentiate between columns that are numeric and ones that aren't to decide which columns I should apply this to?

另外,我将如何区分数字列和不决定我应该将其应用于哪些列的列?

采纳答案by Andy Hayden

You can use Categorical codes. See Categorical section of the docs.

您可以使用分类代码。请参阅文档的分类部分

In [11]: df
Out[11]:
   0  1  2  3  4  5
0  x  s  3  k  s  u
1  x  s  2  n  n  g
2  b  s  0  n  n  m

In [12]: for col in df.columns:
     ...:     df[col] = pd.Categorical(df[col], categories=df[col].unique()).codes

In [13]: df
Out[13]:
   0  1  2  3  4  5
0  0  0  0  0  0  0
1  0  0  1  1  1  1
2  1  0  2  1  1  2
3  2  1  3  2  2  3
4  0  0  1  3  3  3


I suspect there's a small change which would allow doing this without passing the categories explicitly (Note: pandas doesguarantee that .unique()is in seen-order).

我怀疑有一个小的变化可以允许在不明确传递类别的情况下执行此操作(注意:pandas确实保证.unique()按可见顺序)



Note: To "differentiate between columns that are numeric and ones that aren't" you can use select_dtypesbefore iterating:

注意:要“区分数字列和非数字列”,您可以select_dtypes在迭代之前使用:

for col in df.select_dtypes(exclude=['int']).columns:
    ...

回答by Ben

Pandas also has a map function that you can use. So, if for example you have this dictionary that maps the strings to codes:

Pandas 也有一个你可以使用的地图功能。因此,例如,如果您有将字符串映射到代码的字典:

codes = {'x':0, 'b':1, 'k':2}

You can use the mapfunction to map the column in the pandas dataframe:

您可以使用map函数来映射 Pandas 数据框中的列:

df[col] = df[col].map(codes)