是什么导致 Pandas 中的“索引过去 lexsort 深度”警告?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54307300/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:17:03  来源:igfitidea点击:

What causes "indexing past lexsort depth" warning in Pandas?

pythonpandas

提问by Josh Friedlander

I'm indexing a large multi-index Pandas df using df.loc[(key1, key2)]. Sometimes I get a series back (as expected), but other times I get a dataframe. I'm trying to isolate the cases which cause the latter, but so far all I can see is that it's correlated with getting a PerformanceWarning: indexing past lexsort depth may impact performancewarning.

我正在使用df.loc[(key1, key2)]. 有时我会得到一个系列(如预期的那样),但有时我会得到一个数据框。我试图隔离导致后者的情况,但到目前为止我所看到的是它与收到PerformanceWarning: indexing past lexsort depth may impact performance警告有关。

I'd like to reproduce it to post here, but I can't generate another case that gives me the same warning. Here's my attempt:

我想复制它以在此处发布,但我无法生成另一个给我相同警告的案例。这是我的尝试:

def random_dates(start, end, n=10):
    start_u = start.value//10**9
    end_u = end.value//10**9
    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

np.random.seed(0)
df = pd.DataFrame(np.random.random(3255000).reshape(465000,7))  # same shape as my data
df['date'] = random_dates(pd.to_datetime('1990-01-01'), pd.to_datetime('2018-01-01'), 465000)
df = df.set_index([0, 'date'])
df = df.sort_values(by=[3])  # unsort indices, just in case
df.index.lexsort_depth
> 0
df.index.is_monotonic
> False
df.loc[(0.9987185534991936, pd.to_datetime('2012-04-16 07:04:34'))]
# no warning

So my question is: what causes this warning? How do I artificially induce it?

所以我的问题是:是什么导致了这个警告?我如何人工诱导它?

回答by cs95

I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame(under "Question 3").

我实际上在我的文章中详细地写过这个:在 Pandas MultiIndex DataFrame 中选择行(在“问题 3”下)。

To reproduce,

为了重现,

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

You'll notice that the second level is not properly sorted.

您会注意到第二层没有正确排序。

Now, try to index a specific cross section:

现在,尝试索引特定的横截面:

df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
  # encoding: utf-8

         col
one two     
c   u      9

You'll see the same behaviour with xs:

你会看到相同的行为xs

df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
  self.interact()

         col
one two     
c   u      9

The docs, backed by this timing test I once didseem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).

我曾经做过这个计时测试支持的文档似乎表明处理未排序的索引会导致速度变慢——索引是 O(N) 时间,而它可能/应该是 O(1)。

If you sort the index before slicing, you'll notice the difference:

如果在切片之前对索引进行排序,您会注意到不同之处:

df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]

         col
one two     
c   u      9


%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]

802 μs ± 12.1 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
648 μs ± 20.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted.

最后,如果您想知道索引是否已排序,请使用MultiIndex.is_lexsorted.

df.index.is_lexsorted()
# False

df2.index.is_lexsorted()
# True


As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:

至于您关于如何诱导这种行为的问题,简单地排列索引就足够了。如果您的索引是唯一的,这会起作用:

df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]

If your index is not unique, add a cumcounted level first,

如果您的索引不是唯一的,cumcount请先添加一个ed 级别,

df.set_index(
    df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True) 
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)

回答by Andrew Naguib

According to pandas advanced indexing (Sorting a Multiindex)

根据Pandas高级索引(Sorting a Multiindex)

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex

在更高维度的对象上,如果其他轴具有 MultiIndex,您可以按级别对任何其他轴进行排序

And also:

并且:

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

即使数据未排序,索引也能工作,但效率会很低(并显示 PerformanceWarning)。它还将返回数据的副本而不是视图:

According to them, you may need to ensure that indices are sorted properly.

根据他们的说法,您可能需要确保索引正确排序。