使用 Pandas 以迭代方式向数据框添加列

Question

提问by TaterTots

I have some relatively simple code that I'm struggling to put together. I have a CSV that I've read into a dataframe. The CSV is panel data (i.e., unique company and year observations for each row). I have two columns that I want to perform a function on and then I want to create new variables based on the output of the function.

我有一些相对简单的代码，我正在努力拼凑。我有一个已读入数据框的 CSV。CSV 是面板数据（即每行的唯一公司和年份观察值）。我有两列要对其执行函数，然后我想根据函数的输出创建新变量。

Here's what I have so far with code:

这是我到目前为止的代码：

#Loop through rows in a CSV file
for index, rows in df.iterrows():
    #Start at column 6 and go to the end of the file
    for row in rows[6:]:
        data = perform_function1( row )
        output =  perform_function2(data)    
        df.ix[index, 'new_variable'] = output
        print output

I want this code to iterate starting in column 6 and then going to the end of the file (e.g., I have two columns I want to perform the function on Column6 and Column7) and then create new columns based on the functions that were performed (e.g., Output6 and Output7). The code above returns the output for Column7, but I can't figure out how to create a variable that allows me to capture the outputs from both columns (i.e., a new variable that isn't overwritten by loop). I searched Stackoverflow and didn't see anything that immediately related to my question (maybe because I'm too big of a noob?). I would really appreciate your help.

我希望这段代码从第 6 列开始迭代，然后到文件末尾（例如，我有两列我想在 Column6 和 Column7 上执行该函数），然后根据执行的函数创建新列（例如，输出 6 和输出 7）。上面的代码返回 Column7 的输出，但我不知道如何创建一个变量来允许我捕获两列的输出（即，一个不被循环覆盖的新变量）。我搜索了 Stackoverflow 并没有看到任何与我的问题直接相关的内容（也许是因为我太笨了？）。我将衷心感谢您的帮助。

Thanks,

谢谢，

TT

P.S. I'm not sure if I've provided enough detail. Please let me know if I need to provide more.

PS我不确定我是否提供了足够的细节。如果我需要提供更多信息，请告诉我。

Answer 1

采纳答案by ASGM

Operating iteratively doesn't take advantage of Pandas' capabilities. Pandas' strength is in applying operations efficiently across the whole dataframe, rather than in iterating row by row. It's great for a task like this where you want to chain a few functions across your data. You should be able to accomplish your whole task in a single line.

迭代操作没有利用 Pandas 的功能。Pandas 的优势在于有效地跨整个数据帧应用操作，而不是逐行迭代。它非常适合像这样的任务，您希望在数据中链接一些函数。您应该能够在一行中完成整个任务。

df["new_variable"] = df.ix[6:].apply(perform_function1).apply(perform_function2)

perform_function1will be applied to each row, and perform_function2will be applied to the results of the first function.

perform_function1将应用于每一行，并将perform_function2应用于第一个函数的结果。

Answer 2

回答by GeauxEric

If you want to apply function to certain columns in a dataframe

如果要将函数应用于数据框中的某些列

# Get the Series
colmun6 = df.ix[:, 5]  
# perform_function1 applied to each row
output6 = column6.apply(perform_function1)  
df["new_variable"] = output6

Answer 3

回答by Alexander Huszagh

Pandas is quite slow acting row-by-row: you're much better off using the append, concat, merge, or joinfunctionalities on the whole dataframe.

Pandas 逐行执行非常缓慢：您最好在整个数据帧上使用append、concat、merge或join功能。

To give some idea why, let's consider a random DataFrame example:

为了说明原因，让我们考虑一个随机 DataFrame 示例：

import numpy as np
import pandas as pd
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df2 = df.copy()
# operation to concatenate two dataframes
%timeit pd.concat([df2, df])
1000 loops, best of 3: 737 μs per loop
 %timeit df.loc['2013-01-01']
1000 loops, best of 3: 251 μs per loop
# single element operation
%timeit df.loc['2013-01-01', 'A'] = 3
1000 loops, best of 3: 218 μs per loop

Notice how efficiently Pandas handles entire dataFrame operations, and how inefficiently it handles operations on single elements?

请注意 Pandas 处理整个 dataFrame 操作的效率如何，以及它处理单个元素操作的效率如何？

If we expand this, the same tendency occurs, only is much more pronounced:

如果我们扩大这一点，也会出现同样的趋势，只是更加明显：

df = pd.DataFrame(np.random.randn(200, 300))
# single element operation
%timeit df.loc[1,1] = 3
10000 loops, best of 3: 74.6 μs per loop
df2 = df.copy()
# full dataframe operation
%timeit pd.concat([df2, df])
1000 loops, best of 3: 830 μs per loop

Pandas performs an operation on the whole, 200x300 DataFrame about 6,000 times faster than it does for an operation on a single element. In short, the iteration would kill the whole purpose of using Pandas. If you're accessing a dataframe element-by-element, consider using a dictionary instead.

Pandas 对整个 200x300 DataFrame 执行操作比对单个元素执行操作快 6,000 倍。简而言之，迭代将扼杀使用 Pandas 的全部目的。如果您正在逐个元素访问数据框，请考虑改用字典。

使用 Pandas 以迭代方式向数据框添加列

提问by TaterTots

采纳答案by ASGM

回答by GeauxEric

回答by Alexander Huszagh

相关推荐

最近更新

标签

使用 Pandas 以迭代方式向数据框添加列

提问by TaterTots

采纳答案by ASGM

回答by GeauxEric

回答by Alexander Huszagh

相关推荐

Python Pandas 时间序列插值和正则化

pandas 熊猫中的 NoneType 对象不是可迭代的错误

pandas 如何使用 XlsxWriter 将多种格式应用于一列

当使用“pandas.read_hdf()”读取巨大的 HDF5 文件时，为什么即使我通过指定块大小读取块，我仍然会收到 MemoryError？

相关推荐

最近更新

标签