Pandas Multiindex:我做错了什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26221033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:32:35  来源:igfitidea点击:

Pandas Multiindex: what am I doing wrong?

pythonnumpypandascluster-analysis

提问by Nathan Lloyd

I have a program where I have a large pandas dataframe of pairwise interactions (in rows) that I do random walks through. The list of options for each successive step is narrowed down from the entire dataframe by specific values in two columns, so basically,

我有一个程序,其中我有一个大的成对交互(按行)的大Pandas数据框,我会随机遍历它。每个连续步骤的选项列表通过两列中的特定值从整个数据框中缩小,所以基本上,

df_options = df[(df.A == x) & (df.B == y)]

I had the thing working using syntax like above, but it seemed like it would be a great idea in terms of speed (which was limiting) to index df by A, B like so:

我使用上面的语法进行了操作,但就速度(这是限制性的)而言,通过 A、B 索引 df 似乎是一个好主意,如下所示:

df.sort(['A', 'B'], inplace=True)
df.index = df.index.rename('idx')
df = df.set_index(['A', 'B'], drop=False, append=True, verify_integrity=True)

(note I'm keeping the original index as 'idx' because that was how I was recording the random walks and accessing specific rows)
So then I replaced the original df_options code with, firstly,
df.xs((x, y), level=('A', 'B'))
and after having problems with that,
df.loc(axis=0)[:,A,B]

(请注意,我将原始索引保留为“idx”,因为这是我记录随机游走和访问特定行的方式)
因此,我首先替换了原始 df_options 代码,
df.xs((x, y), level=('A', 'B'))
并且在遇到问题之后,
df.loc(axis=0)[:,A,B]

Also, where I needed specific values, the original syntax changed from

此外,在我需要特定值的地方,原始语法从

df_options.loc[new, 'sim']

to

df_options.xs(new, level='idx')['sim'].values[0]

or

或者

df_options.loc(axis=0)[new,:,:]['sim'].values[0]

("new" is the randomly chosen next index of df, and 'sim' is a column of pairwise similarity scores.)

(“new”是随机选择的df的下一个索引,“sim”是一列成对相似度得分。)

As I hacked away trying to get this to work, I kept getting errors like '...not hashable'and AttributeError: 'Int64Index' object has no attribute 'get_loc_level

当我试图让它工作时,我不断收到错误,例如'...not hashable'AttributeError: 'Int64Index' object has no attribute 'get_loc_level

Which brings me to the question in the title: what am I doing wrong? More specifically:
1) does multiindex really have the potential to speed this process up like I think?,
2) if so, what are the correct idioms to use here (feels like I'm up a creek with .xs and .loc),
3) or should I use something else like raw numpy?

这让我想到了标题中的问题:我做错了什么?更具体地说:
1) multiindex 真的有可能像我想的那样加快这个过程吗?,
2) 如果是这样,这里使用的正确习语是什么(感觉就像我在一条小溪上使用 .xs 和 .loc) ,
3) 还是我应该使用其他类似原始 numpy 的东西?

EDITIn the process of creating an example with code, I managed to get it working. I would say that I had to jump through some awkward hoops though, like .values[0]in row.p2.values[0]and df.index[rand_pair][0][0].

编辑在使用代码创建示例的过程中,我设法让它工作。我会说我不得不跳过一些尴尬的圈子,比如.values[0]inrow.p2.values[0]df.index[rand_pair][0][0]

In response to Jeff: pandas 0.14.1

回应杰夫:pandas 0.14.1

df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 561567 entries, (0, 0, 003) to (561566, 26127, 011)
Data columns (total 14 columns):
p1              561567 non-null int64
smp1            561567 non-null object
rt1             561567 non-null float64
cas1            561567 non-null object
sim1            561567 non-null float64
p2              561567 non-null int64
smp2            561567 non-null object
rt2             561567 non-null float64
cas2            561567 non-null object
sim2            561567 non-null float64
nlsim1          561567 non-null float64
sum_spec_sq1    561567 non-null float64
sum_spec_sq2    561567 non-null float64
sum_s1s2        561567 non-null float64
dtypes: float64(8), int64(2), object(4)

Note: "p1", "smp2", and "nlsim1" correspond to "A" "B", and "sim" in my question above. Enough data to walk a couple steps:

注意:在我上面的问题中,“p1”、“smp2”和“nlsim1”对应于“A”、“B”和“sim”。足够的数据走几步:

df = pd.DataFrame({u'nlsim1': {174513: 0.8782, 270870: 0.9461, 478503: 0.8809},
 u'p1': {174513: 8655, 270870: 13307, 478503: 22276},
 u'p2': {174513: 13307, 270870: 22276, 478503: 2391},
 u'smp1': {174513: u'007', 270870: u'010', 478503: u'016'},
 u'smp2': {174513: u'010', 270870: u'016', 478503: u'002'}})
df.index = df.index.rename('idx')
df = df.set_index(['p1', 'smp2'], drop=False, append=True, verify_integrity=True)

def weighted_random_choice():
    options = df_options.index.tolist()
    tot = df_options.nlsim1.sum()
    options_weight = df_options.nlsim1 / tot
    return np.random.choice(options, p=list(options_weight))

Initiates the walk:

开始步行:

samples = set([c for a, b, c in df.index.values])
df_numbered = range(df.shape[0])
#rand_pair = random.sample(df_numbered, 1)
rand_pair = [0]
path = [df.index[rand_pair][0][0]]

The walk (iterate it):

步行(迭代它):

row = df.loc[path[-1],:,:]
p = row.p2.values[0]
smp = row.smp2.values[0]
print p, smp
samples.discard(smp)
print sorted(list(samples))
pick_sample = random.sample(samples, 1)[0]
print pick_sample
df_options = df.xs((p, pick_sample), level=('p1', 'smp2'))
if df_options.shape[0] < 1:
    print "out of options, stop iterating"
    print "path=", path
else:
    print "# options: ", df_options.shape[0]
    new = weighted_random_choice()
    path.append(new)
    print path
    print "you should keep going"

output, 1st step:

输出,第一步:

13307 010
[u'002', u'016']
016
# options:  1
[174513, 270870]
you should keep going

2nd step:

第二步:

22276 016
[u'002']
002
# options:  1
[174513, 270870, 478503]
you should keep going

3rd step errors as expected b/c it runs out of samples.

预期的第三步错误 b/c 它用完了样本。

采纳答案by Nathan Lloyd

Well, the simple fix was to use two copies of the data frame, the original and another one indexed by 'A' and 'B':

嗯,简单的解决方法是使用数据框的两个副本,原始副本和另一个由“A”和“B”索引的副本:

dfi = df.set_index(['A', 'B'])

By changing the "select specific A, B" paradigm from

通过将“选择特定的 A、B”范式从

df_options = df[(df.A == x) & (df.B == y)]

to

df_options = dfi.loc(axis=0)[x, y]

I was able to gain a 5x improvement in speed. It should scale better with the size of df.

我的速度提高了 5 倍。它应该随着 df 的大小更好地扩展。