pandas 熊猫和类别替换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29709918/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:13:13  来源:igfitidea点击:

pandas and category replacement

pythonarrayspandascategories

提问by andrewgcross

I'm trying to reduce the size of ~300 csv files (about a billion rows) by replacing lengthy fields with shorter, categorical, values.

我试图通过用较短的分类值替换冗长的字段来减少 ~300 个 csv 文件(大约十亿行)的大小。

I'm making use of pandas, and I've iterated through each of the files to build an array that includes allof the unique values I'm trying to replace. I can't individually just use pandas.factorize on each file, because I need (for example) '3001958145' to map to the same value on file1.csv as well as file244.csv. I've created an array of what I'd like to replace these values withjust by creating another array of incremented integers.

我正在使用 Pandas,并且我遍历了每个文件以构建一个包含我要替换的所有唯一值的数组。我不能单独在每个文件上使用 pandas.factorize,因为我需要(例如)'3001958145' 映射到 file1.csv 和 file244.csv 上的相同值。我创建的想什么,我来代替这些值的数组创建递增整数另一个数组而已。

In [1]: toreplace = data['col1'].unique()
Out[1]: array([1000339602, 1000339606, 1000339626, ..., 3001958145, 3001958397,
   3001958547], dtype=int64)

In [2]: replacewith = range(0,len(data['col1'].unique()))
Out[2]: [0, 1, 2,...]

Now, how do I go about efficiently swapping in my 'replacewith' variable for each corresponding 'toreplace' value for each of the files I need to iterate through?

现在,我如何有效地为我需要迭代的每个文件的每个相应的“toreplace”值交换我的“replacewith”变量?

With as capable as pandas is with dealing with categories, I assume there hasto be a method out there that can accomplish this that I simply just can't find. The function I wrote to do this works (it relies on a pandas.factorized input rather than the arrangement I described above), but it relies on the replace function and iterating through the series so it's quite slow.

有了Pandas处理类别的能力,我认为必须有一种方法可以实现这一点,而我只是找不到。我为此编写的函数有效(它依赖于 pandas.factorized 输入而不是我上面描述的排列),但它依赖于替换函数并遍历系列,所以它很慢。

def powerreplace(pdseries,factorized):
  i = 0
  for unique in pdseries.unique():
    print '%i/%i' % (i,len(pdseries.unique()))
    i=i+1
    pdseries.replace(to_replace=unique,
                     value=np.where(factorized[1]==unique)[0][0],
                     inplace=True)

Can anyone recommend a better way to go about doing this?

任何人都可以推荐一种更好的方法来做到这一点吗?

回答by Jeff

This requires at least pandas 0.15.0; (however the .astypesyntax is a bit more friendly in 0.16.0, so better to use that). Here are the docs for categoricals

这至少需要pandas 0.15.0;(但是.astype0.16.0中的语法更友好一些,所以最好使用它)。这是分类文档

Imports

进口

In [101]: import pandas as pd
In [102]: import string
In [103]: import numpy as np    
In [104]: np.random.seed(1234)
In [105]: pd.set_option('max_rows',10)

Create a sample set to create some data

创建一个样本集来创建一些数据

In [106]: uniques = np.array(list(string.ascii_letters))
In [107]: len(uniques)
Out[107]: 52

Create some data

创建一些数据

In [109]: df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})

In [110]: df1.head()
Out[110]: 
   A
0  p
1  t
2  g
3  v
4  m

In [111]: df1.A.nunique()
Out[111]: 31

In [112]: df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})

In [113]: df2.head()
Out[113]: 
   A
0  I
1  j
2  b
3  A
4  m
In [114]: df2.A.nunique()
Out[114]: 52

So we now have 2 frames that we want to categorize; the first frame happens to have less than the full set of categories. This is on purpose; you don't have to know the complete set upfront.

所以我们现在有 2 个要分类的帧;第一帧恰好少于完整的类别集。这是故意的;您不必预先知道完整的设置。

Convert the A columns to B columns that are a Categorical

将 A 列转换为属于 Categorical 的 B 列

In [116]: df1['B'] = df1['A'].astype('category')

In [118]: i = df1['B'].cat.categories

In [124]: i
Out[124]: Index([u'A', u'B', u'C', u'D', u'E', u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v', u'w', u'x', u'y', u'z'], dtype='object')

If we are iteratively processing these frames, we use the first ones to start. To get each successive one, we add the symmetric difference with the existing set. This keeps the categories in the same order, so when we factorize we get the same numbering scheme.

如果我们迭代处理这些帧,我们使用第一个开始。为了得到每个连续的,我们添加了与现有集合的对称差异。这使类别保持相同的顺序,因此当我们分解时,我们得到相同的编号方案。

In [119]: cats = i.tolist() + i.sym_diff(df2['A'].astype('category').cat.categories).tolist()

We have now gained back the original set

我们现在已经找回了原来的集合

In [120]: (np.array(sorted(cats)) == sorted(uniques)).all()
Out[120]: True

Set the next frames B column to be a categorical, BUT we specify the categories, so when it is factorized the same values are used

将下一帧 B 列设置为分类,但我们指定类别,因此在分解时使用相同的值

In [121]: df2['B'] = df2['A'].astype('category',categories=cats)

To prove it, we select the codes (the factorized map) from each. These codes match; df2 has an additional code (as Z is in the 2nd frame but not the first).

为了证明这一点,我们从每个中选择代码(分解图)。这些代码匹配;df2 有一个额外的代码(因为 Z 在第二帧但不是第一帧)。

In [122]: df1[df1['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[122]: array([30,  0,  5])

In [123]: df2[df2['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[123]: array([ 0, 30,  5, 51])

You can simply then store the codes in lieu of the object dtyped data.

然后您可以简单地存储代码来代替对象类型化数据。

Note that it is actually quite efficient to serialize these to HDF5 as Categoricals are natively stored, see here

请注意,将这些序列化为 HDF5 实际上非常有效,因为 Categoricals 是本机存储的,请参见此处

Note that we are creating a pretty memory efficient way of storing this data. Noting that the memory usage of in [154], the objectdtype is actually MUCH higher the longer the string gets because this is just the memory for a pointer; the actual values are stored on the heap. While [155] is ALL the memory used.

请注意,我们正在创建一种非常高效的存储这些数据的方式。注意到 [154] 中的内存使用量,object字符串越长,dtype 实际上越高,因为这只是指针的内存;实际值存储在堆上。而 [155] 是所有使用的内存。

In [153]: df2.dtypes
Out[153]: 
A      object
B    category
dtype: object

In [154]: df2.A.to_frame().memory_usage()
Out[154]: 
A    8000000
dtype: int64

In [155]: df2.B.to_frame().memory_usage()
Out[155]: 
B    1000416
dtype: int64

回答by Alexander

First, let's create some random 'categorical' data.

首先,让我们创建一些随机的“分类”数据。

# Create some data
random_letters = list('ABCDEFGHIJ')
s_int = pd.Series(np.random.random_integers(0, 9, 100))
s = pd.Series([random_letters[i] for i in s_int])
>>> s.unique()
array(['J', 'G', 'D', 'C', 'F', 'B', 'H', 'A', 'I', 'E'], dtype=object)

Now we'll create a mapping of the unique categories to integers.]

现在我们将创建一个从唯一类别到整数的映射。]

# Create a mapping of integers to the relevant categories.
mapping = {k: v for v, k in enumerate(s.unique())}

>>> mapping
{'A': 7,
 'B': 5,
 'C': 3,
 'D': 2,
 'E': 9,
 'F': 4,
 'G': 1,
 'H': 6,
 'I': 8,
 'J': 0}

Then we use list comprehension to do an inplace replacement of the categories to their mapped integers (the underscore assignment represents an unused dummy variable).

然后我们使用列表理解将类别就地替换为其映射的整数(下划线赋值代表未使用的虚拟变量)。

_ = [s.replace(cat, mapping[cat], inplace=True) for cat in mapping]

>>> s.head()
0    0
1    1
2    2
3    3
4    4
dtype: int64

If you wish to reverse the process and obtain the original categories:

如果您希望逆转该过程并获得原始类别:

reverse_map = {k: v for v, k in mapping.iteritems()}

reverse_map
{0: 'J',
 1: 'G',
 2: 'D',
 3: 'C',
 4: 'F',
 5: 'B',
 6: 'H',
 7: 'A',
 8: 'I',
 9: 'E'}

_ = [s.replace(int, reverse_map[int], inplace=True) for int in reverse_map]

>>> s.head()
0    J
1    G
2    D
3    C
4    F
dtype: object