pandas python中的并行处理

Question

提问by surya

I have 5,000,000 rows in my dataframe. In my code, I am using iterrows() which is taking too much time. To get the required output, I have to iterate through all the rows . So I wanted to know whether I can parallelize the code in pandas.

我的数据框中有 5,000,000 行。在我的代码中，我使用的 iterrows() 花费了太多时间。为了获得所需的输出，我必须遍历所有行。所以我想知道我是否可以并行化pandas中的代码。

Answer 1

回答by jtitusj

Here's a webpage I found that might help: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html

这是我发现可能有帮助的网页：http: //gouthamanbalaraman.com/blog/distributed-processing-pandas.html

And here's the code for multiprocessing found in that page:

这是在该页面中找到的多处理代码：

import pandas as pd
import multiprocessing as mp

LARGE_FILE = "D:\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time

def process_frame(df):
    # process data frame
    return len(df)

if __name__ == '__main__':
    reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
    pool = mp.Pool(4) # use 4 processes

    funclist = []
    for df in reader:
        # process each data frame
        f = pool.apply_async(process_frame,[df])
        funclist.append(f)

    result = 0
    for f in funclist:
        result += f.get(timeout=10) # timeout in 10 seconds

    print "There are %d rows of data"%(result)

pandas python中的并行处理

提问by surya

回答by jtitusj

相关推荐

最近更新

标签

pandas python中的并行处理

提问by surya

回答by jtitusj

相关推荐

pandas 调试类型错误：不可散列类型：'numpy.ndarray'

将 pandas csv 保存到子目录

Pandas Dataframe 中 group by 的多重聚合

分区上的聚合 - pandas Dataframe

相关推荐

最近更新

标签