pandas python中的并行处理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36054321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
parallel processing in pandas python
提问by surya
I have 5,000,000 rows in my dataframe. In my code, I am using iterrows() which is taking too much time. To get the required output, I have to iterate through all the rows . So I wanted to know whether I can parallelize the code in pandas.
我的数据框中有 5,000,000 行。在我的代码中,我使用的 iterrows() 花费了太多时间。为了获得所需的输出,我必须遍历所有行。所以我想知道我是否可以并行化pandas中的代码。
回答by jtitusj
Here's a webpage I found that might help: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
这是我发现可能有帮助的网页:http: //gouthamanbalaraman.com/blog/distributed-processing-pandas.html
And here's the code for multiprocessing found in that page:
这是在该页面中找到的多处理代码:
import pandas as pd
import multiprocessing as mp
LARGE_FILE = "D:\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
print "There are %d rows of data"%(result)