Python 在 Pandas 中混洗/排列 DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15772009/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:57:49  来源:igfitidea点击:

shuffling/permutating a DataFrame in pandas

pythonnumpypandas

提问by

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0)that takes a dataframe, a number of shuffles n, and an axis (axis=0is rows, axis=1is columns) and returns a copy of the dataframe that has been shuffled ntimes.

按行或按列在 Pandas 中混洗数据帧的简单有效方法是什么?即如何编写一个函数shuffle(df, n, axis=0),该函数接受一个数据帧、多次 shufflen和一个轴(axis=0是行,axis=1是列)并返回已被洗牌n次数的数据帧的副本。

Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.indexthat loses all that information. I want the resulting dfto be the same as the original except with the order of rows or order of columns different.

编辑:关键是在不破坏数据框的行/列标签的情况下执行此操作。如果您只是洗牌df.index,则会丢失所有信息。我希望结果df与原始结果相同,只是行的顺序或列的顺序不同。

Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns aand b, I want each row shuffled on its own, so that you don't have the same associations between aand bas you do if you just re-order each row as a whole. Something like:

Edit2:我的问题不清楚。当我说洗牌时,我的意思是独立洗牌每一行。因此,如果您有两列aand b,我希望每一行都单独重新排列,这样您就不会像将每一行作为一个整体重新排序时那样在a和之间具有相同的关联b。就像是:

for 1...n:
  for each col in df: shuffle column
return new_df

But hopefully more efficient than naive looping. This does not work for me:

但希望比天真的循环更有效。这对我不起作用:

def shuffle(df, n, axis=0):
        shuffled_df = df.copy()
        for k in range(n):
            shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
        return shuffled_df

df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)

采纳答案by root

In [16]: def shuffle(df, n=1, axis=0):     
    ...:     df = df.copy()
    ...:     for _ in range(n):
    ...:         df.apply(np.random.shuffle, axis=axis)
    ...:     return df
    ...:     

In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [18]: shuffle(df)

In [19]: df
Out[19]: 
   A  B
0  8  5
1  1  7
2  7  3
3  6  2
4  3  4
5  0  1
6  9  0
7  4  6
8  2  8
9  5  9

回答by Zelazny7

Use numpy's random.permuationfunction:

使用 numpy 的random.permuation功能:

In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [2]: df
Out[2]:
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9


In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
   A  B
0  0  0
5  5  5
6  6  6
3  3  3
8  8  8
7  7  7
9  9  9
1  1  1
2  2  2
4  4  4

回答by Midnighter

I resorted to adapting @root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.

我稍微调整了@root 的答案并直接使用原始值。当然,这意味着您失去了进行花哨索引的能力,但它非常适合仅对数据进行混洗。

In [1]: import numpy

In [2]: import pandas

In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})    

In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 μs per loop

In [5]: %%timeit
   ...: for view in numpy.rollaxis(df.values, 1):
   ...:     numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 22.8 μs per loop

In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 μs per loop

In [7]: %%timeit                                      
for view in numpy.rollaxis(df.values, 0):
    numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 23.4 μs per loop

Note that numpy.rollaxisbrings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.

请注意,numpy.rollaxis将指定的轴带到第一个维度,然后让我们迭代具有剩余维度的数组,即,如果我们想沿着第一个维度(列)进行 shuffle,我们需要将第二个维度滚动到前面,以便我们将改组应用于第一维的视图。

In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)

In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)

Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:

然后,您的最终函数使用一个技巧使结果符合将函数应用于轴的预期:

def shuffle(df, n=1, axis=0):     
    df = df.copy()
    axis = int(not axis) # pandas.DataFrame is always 2D
    for _ in range(n):
        for view in numpy.rollaxis(df.values, axis):
            numpy.random.shuffle(view)
    return df

回答by JeromeZhao

This might be more useful when you want your index shuffled.

当您希望对索引进行洗牌时,这可能更有用。

def shuffle(df):
    index = list(df.index)
    random.shuffle(index)
    df = df.ix[index]
    df.reset_index()
    return df

It selects new df using new index, then reset them.

它使用新索引选择新 df,然后重置它们。

回答by Evan Zamir

From the docs use sample():

从文档中使用sample()

In [79]: s = pd.Series([0,1,2,3,4,5])

# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]: 
0    0
dtype: int64

# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]: 
5    5
2    2
4    4
dtype: int64

# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]: 
5    5
4    4
1    1
dtype: int64

回答by W.P. McNeill

Sampling randomizes, so just sample the entire data frame.

采样是随机的,所以只需对整个数据帧进行采样。

df.sample(frac=1)

回答by ashimashi

Here is a work around I found if you want to only shuffle a subset of the DataFrame:

如果您只想打乱 DataFrame 的一个子集,我发现这里有一个解决方法:

shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

回答by Franck Dernoncourt

You can use sklearn.utils.shuffle()(requiressklearn 0.16.1 or higher to support Pandas data frames):

您可以使用 sklearn.utils.shuffle()需要sklearn 0.16.1 或更高版本才能支持 Pandas 数据帧):

# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))

# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))

outputs:

输出:

df:    A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4


df:    A  B
1  1  1
0  0  0
3  3  3
4  4  4
2  2  2

Then you can use df.reset_index()to reset the index column, if needs to be:

然后你可以使用df.reset_index()来重置索引列,如果需要的话:

df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)

outputs:

输出:

df:    A  B
0  1  1
1  0  0
2  4  4
3  2  2
4  3  3

回答by Raphvanns

I know the question is for a pandasdf but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.arrayinstead, then np.apply_along_axis()will be what you are looking for.

我知道问题是针对pandasdf 的,但如果按行进行洗牌(列顺序已更改,行顺序未更改),那么列名不再重要,使用 annp.array代替可能会很有趣,然后np.apply_along_axis()将是您的正在找。

If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.

如果这是可以接受的,那么这会有所帮助,请注意,很容易切换数据混洗所沿的轴。

If you panda data frame is named df, maybe you can:

如果您的熊猫数据框被命名为df,也许您可​​以:

  1. get the values of the dataframe with values = df.values,
  2. create an np.arrayfrom values
  3. apply the method shown below to shuffle the np.arrayby row or column
  4. recreate a new (shuffled) pandas df from the shuffled np.array
  1. 获取数据帧的值values = df.values
  2. 创建一个np.arrayvalues
  3. 应用下面显示的方法np.array按行或列打乱
  4. 从洗牌中重新创建一个新的(洗牌的)pandas df np.array

Original array

原始数组

a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

Keep row order, shuffle colums within each row

保持行序,在每行内随机排列列

print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
 [22 21 20]
 [31 30 32]
 [40 41 42]]

Keep colums order, shuffle rows within each column

保持列顺序,在每列中随机排列行

print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
 [20 31 42]
 [10 11 12]
 [30 21 22]]

Original array is unchanged

原始数组不变

print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

回答by Ted Petrou

A simple solution in pandas is to use the samplemethod independently on each column. Use applyto iterate over each column:

pandas 中的一个简单解决方案是sample在每列上独立使用该方法。使用apply遍历每个列:

df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

df.apply(lambda x: x.sample(frac=1).values)

   a  b
0  4  2
1  1  6
2  6  5
3  5  3
4  2  4
5  3  1

You must use .valueso that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:

您必须使用,.value以便您返回一个 numpy 数组而不是一个系列,否则返回的系列将与原始 DataFrame 对齐而不改变任何事物:

df.apply(lambda x: x.sample(frac=1))

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6