pandas 熊猫将文本特征转换为数值

Question

提问by Don Smythe

I can convert all text features in a pandas dataframe by casting to 'category' using the df.astype() method as below. However I find category hard to work with (eg for plotting data) and would prefer to create a new column of integers

我可以通过使用 df.astype() 方法转换为“类别”来转换Pandas数据框中的所有文本特征，如下所示。但是我发现 category 很难处理（例如用于绘制数据），并且更愿意创建一个新的整数列

#convert all objects to categories
object_types = dataset.select_dtypes(include=['O'])
for col in object_types:
    dataset['{0}_category'.format(col)] = dataset[col].astype('category')

I can convert the text to integers using this hack:

我可以使用这个 hack 将文本转换为整数：

#convert all objects to int values
object_types = dataset.select_dtypes(include=['O'])

new_cols = {}
for col in object_types:
    data_set = set(dataset[col].tolist())
    data_indexed = {}
    for i, item in enumerate(data_set):
        data_indexed[item] = i
    new_list = []
    for item in dataset[col].tolist():
        new_list.append(data_indexed[item])
    new_cols[col]=new_list

for key, val in new_cols.items():
    dataset['{0}_int_value'.format(key)] = val

But is there a better (or existing) way to do the same?

但是有没有更好的（或现有的）方法来做同样的事情？

Answer 1

回答by MaxU

I would use factorizemethod, which is designed for this particular task:

我会使用factorize方法，它是为这个特定任务设计的：

In [90]: x
Out[90]:
    A  B
9   c  z
10  c  z
4   b  x
5   b  y
1   a  w
7   b  z

In [91]: x.apply(lambda col: pd.factorize(col, sort=True)[0])
Out[91]:
    A  B
9   2  3
10  2  3
4   1  1
5   1  2
1   0  0
7   1  3

or:

或者：

In [92]: x.apply(lambda col: pd.factorize(col)[0])
Out[92]:
    A  B
9   0  0
10  0  0
4   1  1
5   1  2
1   2  3
7   1  0

Answer 2

回答by piRSquared

consider df

考虑 df

df = pd.DataFrame(dict(A=list('aaaabbbbcccc'),
                       B=list('wwxxxyyzzzzz')))

df

you can convert to integers like this

你可以像这样转换成整数

def intify(s):
    u = np.unique(s)
    i = np.arange(len(u))
    return s.map(dict(zip(u, i)))

or shorter version

或更短的版本

def intify(s):
    u = np.unique(s)
    return s.map({k: i for i, k in enumerate(u)})

df.apply(intify)

Or in a single line

或者在一行中

df.apply(lambda s: s.map({k:i for i,k in enumerate(s.unique())}))

pandas 熊猫将文本特征转换为数值

提问by Don Smythe

回答by MaxU

回答by piRSquared

相关推荐

最近更新

标签

pandas 熊猫将文本特征转换为数值

提问by Don Smythe

回答by MaxU

回答by piRSquared

相关推荐

用 Pandas 和 Seaborn 绘制日期

Pandas：同名列的平均值

pandas 如何降低熊猫数据框中的所有元素？

pandas DataFrame：如何使用自定义方式剪切数据框？

相关推荐

最近更新

标签