pandas 基于列值的随机抽样熊猫

Question

提问by RnK

I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).

我有文件（A、B、C 等），每个文件都有 12,000 个数据点。我已将文件分成 1000 个点的批次并计算每个批次的值。所以现在对于每个文件，我们有 12 个值，它们被加载到一个 Pandas 数据帧中（如下所示）。

    file    value_1     value_2
0   A           1           43
1   A           1           89
2   A           1           22
3   A           1           87
4   A           1           43
5   A           1           89
6   A           1           22
7   A           1           87
8   A           1           43
9   A           1           89
10  A           1           22
11  A           1           87
12  A           1           83
13  B           0           99
14  B           0           23
15  B           0           29
16  B           0           34
17  B           0           99
18  B           0           23
19  B           0           29
20  B           0           34
21  B           0           99
22  B           0           23
23  B           0           29
24  B           0           34
25  C           1           62
-   -           -           -
-   -           -           -

Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.

现在作为下一步，我需要随机选择一个文件，并为该文件随机为 value_1 选择 4 个批次的序列。后者，我相信可以用 df.sample() 来完成，但我不确定如何随机选择文件。我试图让它与 np.random.choice(data['file'].unique()) 一起工作，但似乎不正确。

Thanks for the help in advance. I'm pretty new to pandas and python in general.

我在这里先向您的帮助表示感谢。总的来说，我对Pandas和蟒蛇很陌生。

Answer 1

回答by Abdou

If I understand what you are trying to get at, the following should be of help:

如果我理解您想要达到的目的，以下内容应该会有所帮助：

# Test dataframe
import numpy as np
import pandas as pd


data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
                     'value_1': np.repeat([1,0,1],12),
                     'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)

# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])

# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])

Answer 2

回答by Jason

Here's a rather long winded answer that has a lot of flexibility and uses some random data I generated. I also added a field to the dataframeto denote whether that row had been used.

这是一个相当冗长的答案，它具有很大的灵活性并使用我生成的一些随机数据。我还在中添加了一个字段来dataframe表示该行是否已被使用。

Generating Data

生成数据

import pandas as pd
from string import ascii_lowercase
import random

random.seed(44)

files = [ascii_lowercase[i] for i in range(4)]
value_1 = random.sample(range(1, 10), 8)

files_df = files*len(value_1)
value_1_df = value_1*len(files)
value_1_df.sort()
value_2_df = random.sample(range(100, 200), len(files_df))

df = pd.DataFrame({'file' : files_df,
                 'value_1': value_1_df,
                 'value_2': value_2_df,
                  'used': 0})

Randomly Selecting Files

随机选择文件

len_to_run = 3 #change to run for however long you'd like
batch_to_pull = 4
updated_files = df.loc[df.used==0,'file'].unique()

for i in range(len_to_run): #not needed if you only want to run once
    file_to_pull = ''.join(random.sample(updated_files, 1))
    print 'file ' + file_to_pull
    for j in range(batch_to_pull): #pulling 4 values
        updated_value_1 = df.loc[(df.used==0) & (df.file==file_to_pull),'value_1'].unique()
        value_1_to_pull = random.sample(updated_value_1,1)
        print 'value_1 ' + str(value_1_to_pull)
        df.loc[(df.file == file_to_pull) & (df.value_1==value_1_to_pull),'used']=1

file a
value_1 [1]
value_1 [7]
value_1 [5]
value_1 [4]
file d
value_1 [3]
value_1 [2]
value_1 [1]
value_1 [5]
file d
value_1 [7]
value_1 [4]
value_1 [6]
value_1 [9]

pandas 基于列值的随机抽样熊猫

提问by RnK

回答by Abdou

回答by Jason

Generating Data

生成数据

Randomly Selecting Files

随机选择文件

相关推荐

最近更新

标签

pandas 基于列值的随机抽样熊猫

提问by RnK

回答by Abdou

回答by Jason

Generating Data

生成数据

Randomly Selecting Files

随机选择文件

相关推荐

pandas 熊猫系列'对象没有属性'find'

ValueError：时间数据 - 与格式不匹配 - Pandas

根据其他列中的值替换列值，用于 Pandas 数据框中的所有行

pandas 将带有时间戳的熊猫数据帧转换为字符串

相关推荐

最近更新

标签