pandas 多线程中的熊猫数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40939078/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas dataframe in multiple threads
提问by Yasir Azeem
Can someone tell me a way to add data into pandas dataframe in python while multiple threads are going to use a function in which data has to be appended into a dataframe...?
有人能告诉我一种在python中将数据添加到pandas数据帧的方法,而多个线程将使用一个必须将数据附加到数据帧的函数......?
My code scrapes data from a URL and then i was using df.loc[index]... to add the scrapped row into the dataframe.
我的代码从 URL 中抓取数据,然后我使用 df.loc[index]... 将废弃的行添加到数据框中。
Since I've started a multi thread which basically assigns each URL to each thread. So in short many pages are being scraped at once...
因为我已经启动了一个多线程,它基本上将每个 URL 分配给每个线程。简而言之,许多页面同时被刮掉......
How do I append those rows into the dataframe?
如何将这些行附加到数据框中?
采纳答案by exp1orer
Adding rows to dataframes one-by-one is not recommended. I suggest you build your data in lists, then combine those lists at the end, and then only call the DataFrame constructor once at the end on the full data set.
不建议将行一一添加到数据帧。我建议你在列表中构建你的数据,然后在最后组合这些列表,然后在完整数据集的最后只调用一次 DataFrame 构造函数。
Example:
例子:
# help from http://stackoverflow.com/a/28463266/3393459
# and http://stackoverflow.com/a/2846697/3393459
from multiprocessing.dummy import Pool as ThreadPool
import requests
import pandas as pd
pool = ThreadPool(4)
# called by each thread
def get_web_data(url):
return {'col1': 'something', 'request_data': requests.get(url).text}
urls = ["http://google.com", "http://yahoo.com"]
results = pool.map(get_web_data, urls)
print results
print pd.DataFrame(results)