pandas 更改熊猫数据框中的值不起作用
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/17995328/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Changing values in pandas dataframe does not work
提问by idoda
I'm having a problem changing values in a dataframe. I also want to consult regarding a problem I need to solve and the proper way to use pandas to solve it. I'll appreciate help on both. I have a file containing information about matching degree of audio files to speakers. The file looks something like that:
我在更改数据框中的值时遇到问题。我也想咨询一下我需要解决的问题以及使用pandas解决它的正确方法。我会很感激这两个方面的帮助。我有一个文件,其中包含有关音频文件与扬声器的匹配程度的信息。该文件看起来像这样:
wave_path   spk_name    spk_example#    score   mark    comments    isUsed
190  122_65_02.04.51.800.wav     idoD    idoD    88  NaN     NaN     False
191  121_110_20.17.27.400.wav    idoD    idoD    87  NaN     NaN     False
192  121_111_00.34.57.300.wav    idoD    idoD    87  NaN     NaN     False
193  103_31_18.59.12.800.wav     idoD    idoD_0  99  HIT     VP  False
194  131_101_02.08.06.500.wav    idoD    idoD_0  96  HIT     VP  False
What I need to do, is some kind of a sophisticated counting. I need to group the results by speaker, and calculate for each speaker some calculation. I then proceed with the speaker that made the best calculation for me, but before proceeding I need to mark all the files which I used for the calculation as being used, i.e. changing the isUsed value for each row in which they appear (files can appear more than once) to TRUE. Then I make another iteration. Calculate for each speaker, mark the used files and so on until no more speakers left to be calculated.
我需要做的是某种复杂的计数。我需要按扬声器对结果进行分组,并为每个扬声器计算一些计算。然后我继续使用为我做出最佳计算的扬声器,但在继续之前我需要将我用于计算的所有文件标记为正在使用,即更改它们出现的每一行的 isUsed 值(文件可以出现不止一次)到 TRUE。然后我再做一次迭代。计算每个扬声器,标记使用的文件等,直到没有更多的扬声器需要计算。
I thought a lot about how to implement that process using pandas (it is quite easy to implement in regular python but it will take a lot of looping and data structuring that my guess will slow the process down significantly, and also I'm using this process to get to learn pandas abilities more deeply)
我想了很多关于如何使用 Pandas 实现该过程(在常规 python 中实现它很容易,但它需要大量循环和数据结构化,我的猜测会显着减慢该过程,而且我正在使用它更深入地学习Pandas能力的过程)
I came out with the following solution. As preparation steps, I'll group by speaker name and set the file name as index by the set_index method. I will then iterate over the groupbyObj and apply the calculation function, which will return the selected speaker and the files to be marked as used.
我提出了以下解决方案。作为准备步骤,我将按扬声器名称分组,并通过 set_index 方法将文件名设置为索引。然后我将遍历 groupbyObj 并应用计算函数,该函数将返回选定的扬声器和要标记为已使用的文件。
Then I'll iterate over the files and mark them as used (this would be fast and simple since I set them as indexes beforehand), and so on until I finish calculating.
然后我将迭代这些文件并将它们标记为已使用(因为我预先将它们设置为索引,这将快速而简单),依此类推,直到我完成计算。
First, I'm not sure about this solution, so feel free to tell me your thoughts on it. Now, I've tried implementing this, and got into trouble:
首先,我不确定这个解决方案,所以请随时告诉我你的想法。现在,我已经尝试实现这一点,但遇到了麻烦:
First I indexed by file name, no problem here:
首先我按文件名索引,这里没问题:
In [53]:
    marked_results['isUsed'] = False
    ind_res = marked_results.set_index('wave_path')
    ind_res.head()
Out[53]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav      idoD    idoD    99  HIT     VP  False
    131_101_02.08.06.500.wav     idoD    idoD    99  HIT     VP  False
    144_35_22.46.38.700.wav      idoD    idoD    96  HIT     VP  False
    41_09_17.10.11.700.wav       idoD    idoD    93  HIT     TEST    False
    122_188_03.19.20.400.wav     idoD    idoD    93  NaN     NaN     False
Then I choose a file and checked that I get the entries relevant to that file:
然后我选择一个文件并检查我是否获得了与该文件相关的条目:
In [54]:
    example_file = ind_res.index[0];
    ind_res.ix[example_file]
Out[54]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav  idoD    idoD    99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_0  99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_1  97  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_2  95  HIT     VP  False
Now problems here too. Then I tried to change the isUsed value for that file to True, and that where I got the problem:
现在问题也在这里。然后我尝试将该文件的 isUsed 值更改为 True,然后我遇到了问题:
In [56]:
    ind_res.ix[example_file]['isUsed'] = True
    ind_res.ix[example_file].isUsed = True
    ind_res.ix[example_file]
Out[56]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav  idoD    idoD    99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_0  99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_1  97  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_2  95  HIT     VP  False
So, you see the problem. Nothing has changed. What am I doing wrong? Is the problem described above should be solved using pandas?
所以,你看到了问题。什么也没有变。我究竟做错了什么?上面描述的问题是否应该使用pandas来解决?
And also: 1. How can I approach a specific group by a groupby object? bcz I thought maybe instead of setting the files as indexed, grouping by a file, and the using that groupby obj to apply a changing function to all of its occurrences. But I didn't find a way to approach a specific group and passing the group name as parameter and calling apply on all the groups and then acting only on one of them seemed not "right" to me.
还有: 1. 如何通过 groupby 对象接近特定组?bcz 我想也许不是将文件设置为索引,按文件分组,并使用该 groupby obj 将更改函数应用于所有出现的事件。但是我没有找到接近特定组并将组名作为参数传递并对所有组调用 apply 然后仅对其中一个执行操作的方法对我来说似乎不“正确”。
I hope it is not to long... :)
我希望它不会太长...... :)
回答by unutbu
Indexing Panda objects can return two fundamentally different objects: a view or a copy.
索引 Panda 对象可以返回两个根本不同的对象:视图或副本。
If maskis a basic slice, then df.ix[mask]returns a viewof df. Views share the same underlying data as the original object (df). So modifying the view, also modifies the original object.
如果mask是一个基本的切片,然后df.ix[mask]返回一个视图中df。视图与原始对象 ( df)共享相同的基础数据。所以修改视图,也修改了原来的对象。
If maskis something more complicated, such as an arbitrary sequence of indices, then df.ix[mask]returns a copyof some rows in df. Modifying the copy has no affect on the original.
如果mask是更复杂的东西,比如指数的任意序列,然后df.ix[mask]返回一个副本,在某些行的df。修改副本对原件没有影响。
In your case, since the rows which share the same wave_pathoccur at arbitrary locations, ind_res.ix[example_file]returns a copy. So
在您的情况下,由于共享相同的行wave_path出现在任意位置,因此ind_res.ix[example_file]返回一个副本。所以
ind_res.ix[example_file]['isUsed'] = True
has no effect on ind_res.
对 没有影响ind_res。
Instead, you could use
相反,您可以使用
ind_res.ix[example_file, 'isUsed'] = True
to modify ind_res. However, see below for a groupbysuggestion which I think might be closer to what you really want.
修改ind_res. 但是,请参阅下面的groupby建议,我认为它可能更接近您真正想要的。
Jeff has already provided a link to the Pandas docswhich state that
Jeff 已经提供了一个指向 Pandas 文档的链接,其中指出
The rules about when a view on the data is returned are entirely dependent on NumPy.
关于何时返回数据视图的规则完全取决于 NumPy。
Here are the (complicated) rules which describe when a view or copy is returned. Basically, however, the rule is if the index is requesting a regularly spaced slice of the underlying array then a view is returned, otherwise a copy (out of necessity) is returned.
以下是描述何时返回视图或副本的(复杂)规则。然而,基本上,规则是如果索引请求底层数组的规则间隔切片,则返回视图,否则返回副本(不必要)。
Here is a simple example which uses basic slice. A view is returned by df.ix, so modifying subdfmodifies dfas well:
这是一个使用基本切片的简单示例。视图由 返回df.ix,因此修改也会subdf修改df:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(4,3), 
         columns=list('ABC'), index=[0,1,2,3])
subdf = df.ix[0]
print(subdf.values)
# [0 1 2]
subdf.values[0] = 100
print(subdf)
# A    100
# B      1
# C      2
# Name: 0, dtype: int32
print(df)           # df is modified
#      A   B   C
# 0  100   1   2
# 1    3   4   5
# 2    6   7   8
# 3    9  10  11
Here is a simple example which uses "fancy indexing" (arbitrary rows selected). A copy is returned by df.ix. So modifying subdfdoes not affect df.
这是一个使用“花式索引”(选择任意行)的简单示例。副本由 返回df.ix。所以修改subdf不会影响df.
df = pd.DataFrame(np.arange(12).reshape(4,3), 
         columns=list('ABC'), index=[0,1,0,3])
subdf = df.ix[0]
print(subdf.values)
# [[0 1 2]
#  [6 7 8]]
subdf.values[0] = 100
print(subdf)
#      A    B    C
# 0  100  100  100
# 0    6    7    8
print(df)          # df is NOT modified
#    A   B   C
# 0  0   1   2
# 1  3   4   5
# 0  6   7   8
# 3  9  10  11
Notice the only difference between the two examples is that in the first, where a view is returned, the index was [0,1,2,3], whereas in the second, where a copy is returned, the index was [0,1,0,3].
请注意,这两个示例之间的唯一区别在于,在第一个返回视图的地方,索引为 [0,1,2,3],而在第二个返回副本的地方,索引为 [0, 1,0,3]。
Since we are selected rows where the index is 0, in the first example, we can do that with a basic slice. In th second example, the rows where index equals 0 could appear at arbitrary locations, so a copy has to be returned.
由于我们选择了索引为 0 的行,因此在第一个示例中,我们可以使用基本切片来做到这一点。在第二个示例中,索引等于 0 的行可能出现在任意位置,因此必须返回副本。
Despite having ranted on about the subtlety of Pandas/NumPy slicing, I really don't think that
尽管对 Pandas/NumPy 切片的微妙之处大加赞赏,但我真的不认为
ind_res.ix[example_file, 'isUsed'] = True
is what you are ultimately looking for. You probably want to do something more like
是您最终要寻找的。你可能想做一些更像
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(4,3), 
                  columns=list('ABC'))
df['A'] = df['A']%2
print(df)
#    A   B   C
# 0  0   1   2
# 1  1   4   5
# 2  0   7   8
# 3  1  10  11
def calculation(grp):
    grp['C'] = True
    return grp
newdf = df.groupby('A').apply(calculation)
print(newdf)
which yields
这产生
   A   B     C
0  0   1  True
1  1   4  True
2  0   7  True
3  1  10  True

