pandas 使用pandas从csv中删除特定行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43600066/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:28:18  来源:igfitidea点击:

delete specific rows from csv using pandas

python-3.xcsvpandasnumpy

提问by Sarvagya Gupta

I have a csv file in the format shown below:enter image description here

我有一个格式如下所示的 csv 文件:在此处输入图片说明

I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.

我编写了以下代码来读取文件并随机删除转向值为 0 的行。我只想保留转向值为 0 的行的 10%。

df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)

However, I get the following error:

但是,我收到以下错误:

df = df.drop(df.query('steering==0').sample(frac=0.90).index)

locs = rs.choice(axis_length, size=n, replace=replace, p=weights)

File "mtrand.pyx", line 1104, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:17062)

ValueError: a must be greater than 0

df = df.drop(df.query('steering==0').sample(frac=0.90).index)

locs = rs.choice(axis_length, size=n, replace=replace, p=weights)

文件“mtrand.pyx”,第 1104 行,在 mtrand.RandomState.choice 中(numpy/random/mtrand/mtrand.c:17062)

值错误:a 必须大于 0

Can you guys help me?

你们能帮帮我吗?

回答by andrew_reece

Here's a one-line approach, using concat()and sample():

这是一种单行方法,使用concat()and sample()

import numpy as np
import pandas as pd

# first, some sample data

# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)

# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)

print(df.shape)
(100,3)

The first few rows of sample output:

示例输出的前几行:

df.head()
           center         left         right  steering  throttle  brake
0   center_72.jpg  left_26.jpg  right_59.jpg         3         3      0
1   center_75.jpg  left_68.jpg  right_26.jpg         0         0      2
2   center_29.jpg   left_8.jpg  right_88.jpg         0         1      0
3   center_22.jpg  left_26.jpg  right_23.jpg         1         0      0
4   center_88.jpg   left_0.jpg  right_56.jpg         4         1      0
5   center_93.jpg  left_18.jpg  right_15.jpg         0         0      0

Now drop all but 10% of rows with steering==0:

现在删除除 10% 之外的所有行steering==0

newdf = pd.concat([df.loc[df.steering!=0], 
                   df.loc[df.steering==0].sample(frac=0.1)])

With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0cases remaining.

使用我在本示例中使用的概率权重,您将看到 中剩余 50-60 个条目newdfsteering==0剩余大约 5 个案例。

回答by MaxU

sample DataFrame built with @andrew_reece's code

使用@andrew_reece 的代码构建的示例 DataFrame

In [9]: df
Out[9]:
           center         left         right  steering  throttle  brake
0   center_54.jpg  left_75.jpg  right_39.jpg         1         0      0
1   center_20.jpg  left_81.jpg  right_49.jpg         3         1      1
2   center_34.jpg  left_96.jpg  right_11.jpg         0         4      2
3   center_98.jpg  left_87.jpg  right_34.jpg         0         0      0
4   center_67.jpg  left_12.jpg  right_28.jpg         1         1      0
5   center_11.jpg  left_25.jpg  right_94.jpg         2         1      0
6   center_66.jpg  left_27.jpg  right_52.jpg         1         3      3
7   center_18.jpg  left_50.jpg  right_17.jpg         0         0      4
8   center_60.jpg  left_25.jpg  right_28.jpg         2         4      1
9   center_98.jpg  left_97.jpg  right_55.jpg         3         3      0
..            ...          ...           ...       ...       ...    ...
90  center_31.jpg  left_90.jpg  right_43.jpg         0         1      0
91  center_29.jpg   left_7.jpg  right_30.jpg         3         0      0
92  center_37.jpg  left_10.jpg  right_15.jpg         1         0      0
93  center_18.jpg   left_1.jpg  right_83.jpg         3         1      1
94  center_96.jpg  left_20.jpg  right_56.jpg         3         0      0
95  center_37.jpg  left_40.jpg  right_38.jpg         0         3      1
96  center_73.jpg  left_86.jpg  right_71.jpg         0         1      0
97  center_85.jpg  left_31.jpg   right_0.jpg         3         0      4
98  center_34.jpg  left_52.jpg  right_40.jpg         0         0      2
99  center_91.jpg  left_46.jpg  right_17.jpg         0         0      0

[100 rows x 6 columns]

In [10]: df.steering.value_counts()
Out[10]:
0    43    # NOTE: 43 zeros
1    18
2    15
4    12
3    12
Name: steering, dtype: int64

In [11]: df.shape
Out[11]: (100, 6)

your solution (unchanged):

您的解决方案(不变):

In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)

In [13]: df.steering.value_counts()
Out[13]:
1    18
2    15
4    12
3    12
0     4        # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64

In [14]: df.shape
Out[14]: (61, 6)

NOTE:make sure that steeringcolumn has numeric dtype! If it's a string (object) then you would need to change your code as follows:

注意:确保该steering列具有数字 dtype!如果它是一个字符串(对象),那么您需要按如下方式更改代码:

df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
#  NOTE:                         ^ ^

after that you can save the modified (reduced) DataFrame to CSV:

之后,您可以将修改后的(减少的)DataFrame 保存为 CSV:

df.to_csv('/path/to/filename.csv', index=False)

回答by jakevdp

Using a mask on steeringcombined with a random number should work:

将掩码steering与随机数结合使用应该可以:

df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]

This does generate some extra random values, but it's nice and compact.

这确实会产生一些额外的随机值,但它很好而且很紧凑。

Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query()statement is returning an empty dataframe, which probably means that the "sample"column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

编辑:也就是说,我尝试了您的示例代码,并且效果很好。我的猜测是错误来自这样一个事实,即您的df.query()语句返回一个空数据框,这可能意味着该"sample"列不包含任何零,或者该列被读取为字符串而不是数字。在运行上述代码段之前尝试将列转换为整数。