pandas 进行熊猫操作和跳过行的有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34389922/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient way to do pandas operation and skip row
提问by mmee
There must be a simple way to do this, but I'm missing it. First, imagine the situation in Excel:
必须有一个简单的方法来做到这一点,但我想念它。首先,想象一下Excel中的情况:
- I have a column of percent changes. (assume column A)
- In the next column (B), I want to create an indexed series that begins at 1000 based on the percent changes. In Excel, I do this by. -setting B1 to 1000 -set B2 to the formula =(1+A2)*B1 -copy the column down. Simple.
- 我有一列百分比变化。(假设A列)
- 在下一列 (B) 中,我想根据百分比变化创建一个从 1000 开始的索引系列。在 Excel 中,我是这样做的。- 将 B1 设置为 1000 - 将 B2 设置为公式 =(1+A2)*B1 - 向下复制该列。简单的。
Now, I want to do the same thing with pandas, and the problem is that the following code results in the target array becoming NaN:
现在,我想对 Pandas 做同样的事情,问题是以下代码导致目标数组变为 NaN:
import pandas as pd
import numpy as np
df_source = pd.DataFrame(np.random.normal(0,.05,10), index=range(10), columns=['A'])
df_target = pd.DataFrame(index = df_source.index)
df_target.loc[0,"A"] = 1000 # initialize target array to start at 1000
df_target["A"] = (1 + df_source) * df_target["A"].shift(1) # How to skip first row?
The target array becomes NaN because the first row tries to reference a value "off the dataframe", so the whole column returns NaN.
目标数组变为 NaN,因为第一行尝试引用“数据框外”的值,因此整列返回 NaN。
I realize I could iterate over rows with a loop, skipping the first row, but this is very slow and not practical for the size of datasets or iterations I will be doing.
我意识到我可以用循环遍历行,跳过第一行,但这非常慢,而且对于我将要做的数据集或迭代的大小来说不切实际。
There must be a way to use pandas/numpy array functions but tell it to skip the first row in the calculation. How to do that? I've tried Boolean indexing but can't get it to work, and maybe there is a way to tell Pandas to skip the NaN results... but the best approach seems to be a qualifier that says "apply this code, starting at the second row."
必须有一种方法可以使用 pandas/numpy 数组函数,但告诉它跳过计算中的第一行。怎么做?我试过布尔索引但无法让它工作,也许有一种方法可以告诉 Pandas 跳过 NaN 结果......但最好的方法似乎是一个限定符,上面写着“应用这段代码,从第二排。”
What am I missing here?
我在这里错过了什么?
Edit:
编辑:
Looks like my problem is deeper than I realized. jezrael's answer below solves the NA problem, but I think I am confused about the pandas logic. The code I give above DOES NOT work because it does not work element-wise. For instance, the trivial example:
看起来我的问题比我意识到的更深。jezrael 下面的回答解决了 NA 问题,但我想我对大Pandas的逻辑感到困惑。我上面给出的代码不起作用,因为它在元素方面不起作用。举个简单的例子:
seriesdf = pd.DataFrame(index = range(10))
seriesdf['A'] = 1
seriesdf['A'].ix[1:] = 1 + seriesdf['A'].shift(1)
gives the result
给出结果
A
0 1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
not an ascending count as I had assumed. So the question is what is the most efficient way to do this row by row calculation on a pandas dataframe? Speed matters in this application so I would prefer to not interate through rows.
不是我假设的递增计数。所以问题是在 Pandas 数据帧上进行逐行计算的最有效方法是什么?在这个应用程序中速度很重要,所以我宁愿不通过行进行交互。
New python programmer here so trying to figure this out. Answers that show me how to learn/figure stuff like this out for myself are very appreciated. Thank you!
新的 python 程序员在这里试图解决这个问题。非常感谢向我展示如何为自己学习/弄清楚这样的东西的答案。谢谢!
采纳答案by jezrael
IIUC you can skip first row of column A
of df_source
by selection all rows without first by ix
:
IIUC可以跳过栏的第一行A
中df_source
通过选择所有行,而不首先ix
:
df_target["A"].ix[1:] = df_source['A'].ix[1:] + 1
print df_target
A
0 1000.000000
1 0.988898
2 0.986142
3 1.009979
4 1.005165
5 1.101116
6 0.992312
7 0.962890
8 1.051340
9 1.009750
Or maybe you think:
或者你可能认为:
import pandas as pd
import numpy as np
df_source = pd.DataFrame(np.random.normal(0,.05,10), index=range(10), columns=['A'])
print df_source
A
0 0.039965
1 0.060821
2 -0.079238
3 -0.129932
4 0.002196
5 -0.003721
6 -0.008358
7 0.014104
8 -0.022905
9 0.014793
df_target = pd.DataFrame(index = df_source.index)
#all A set to 1000
df_target["A"] = 1000 # initialize target array to start at 1000
print df_target
A
0 1000
1 1000
2 1000
3 1000
4 1000
5 1000
6 1000
7 1000
8 1000
9 1000
df_target["A"] = (1 + df_source["A"].shift(-1))* df_target["A"]
print df_target
A
0 1060.820882
1 920.761946
2 870.067878
3 1002.195555
4 996.279287
5 991.641909
6 1014.104402
7 977.094961
8 1014.793488
9 NaN
EDIT:
编辑:
Maybe you need cumsum
:
也许你需要cumsum
:
df_target["B"] = 2
df_target["C"] = df_target["B"].cumsum()
df_target["D"] = df_target["B"] + df_target.index
print df_target
A B C D
0 1041.003000 2 2 2
1 1013.817000 2 4 3
2 948.853000 2 6 4
3 1031.692000 2 8 5
4 970.875000 2 10 6
5 1011.095000 2 12 7
6 1053.472000 2 14 8
7 903.765000 2 16 9
8 1010.546000 2 18 10
9 0.010546 2 20 11
回答by Quentin
I think I understand your problem and in these cases, I usually find it easier to make a list and append it to the existing dataframe. You, of course, could make an Series instance first and thendo calculations.
我想我理解你的问题,在这些情况下,我通常发现制作一个列表并将其附加到现有数据框更容易。当然,您可以先创建一个 Series 实例,然后再进行计算。
new_series = [0]*len(df["A"])
new_series[0] = 1000
for i,k in enumerate(dataframe["A"].ix[1:]):
new_series[i] = (1 + k)*new_series[i-1]
dataframe["B"] = pd.Series(new_series)
IIRC, ilocis being deprecated in future builds of pandas in favor of ix
IIRC,在未来的Pandas构建中不推荐使用iloc以支持ix
After rethinking the problem, you can use lambda expressionsas elements in your dataframe
重新思考问题后,您可以使用lambda 表达式作为数据帧中的元素
dataframe["B"] = [lambda row: (1 + dataframe["A"].ix[row])*dataframe["B"].ix[row-1]*len(dataframe["A"])
# Above: initiate "B" with a lambda expression that is as long as "A"
dataframe["B"].ix[0] = 1000
for i,k in enumerate(dataframe["B"].ix[1]):
dataframe["B"].ix[i] = k(row=i)
I am trying to think of a way around using a for loop to this problem but can't manage to figure where to grab a row number from.
我正在想办法解决使用 for 循环解决这个问题,但无法确定从哪里获取行号。
Hope this helps.
希望这可以帮助。