Pandas 与 SQL 速度

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44695020/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:51:16  来源:igfitidea点击:

Pandas Vs SQL Speed

sqlsqlitepandas

提问by user1761806

I'm hearing different views on when one should use Pandas vs when to use SQL.

关于何时应该使用 Pandas 与何时使用 SQL,我听到了不同的看法。

I tried to do the following in Pandas on 19,150,869 rows of data:

我尝试在 Pandas 中对 19,150,869 行数据执行以下操作:

for idx, row in df.iterrows():
    tmp = int((int(row['M']) / PeriodGranularity))+1
    row['TimeSlot'] = str(row["D"]+1) + "-" + str(row["H"]) + "-" + str(tmp)

And found it was taking so long I had to abort after 20 minutes.

发现它花了很长时间,我不得不在 20 分钟后中止。

I performed the following in SQLLite:

我在 SQLLite 中执行了以下操作:

Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M,cast(strftime('%M',PlayedTimestamp) / 15+1 as int) as TimeSlot from tblMain

and found it took 4 seconds ("19150869 rows returned in 2445ms").

并发现它花了 4 秒(“19150869 行在 2445 毫秒内返回”)。

Note: For the Pandas code I ran this in the step before it to get the data from the db:

注意:对于 Pandas 代码,我在它之前的步骤中运行了它以从数据库中获取数据:

sqlStr = "Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M from tblMain"
df = pd.read_sql_query(sqlStr, con)

Is it my coding that's at fault here or is it generally accepted that for certain tasks SQL is a lot faster?

是我的编码有问题,还是普遍认为对于某些任务 SQL 快得多?

回答by jezrael

It seems you can use vectorize solution (PeriodGranularityis some variable):

看来您可以使用矢量化解决方案(PeriodGranularity是一些变量):

df['TimeSlot'] = (df["D"]+1).astype(str) + "-" + 
                  df["H"].astype(str) + "-" + 
                 ((df['M'].astype(int) / PeriodGranularity).astype(int)+1).astype(str)

And for parse datetimeto struse strftime.

而对于解析datetimestr使用strftime

DataFrame.iterrowsis really slow - check this.

DataFrame.iterrows真的很慢 - 检查这个

First some comaprison of code for users coming from SQL background.

首先是一些来自 SQL 背景的用户的代码comaprison

Comapring 2 technologies is really hard and I am not sure if some nice answer in SO (too broad reasons), but I find this.

Comapring 2 技术真的很难,我不确定 SO 中是否有一些不错的答案(原因太广泛),但我发现这个.

回答by George Gousios

Pandas dataframes are not meant to be iterated this way. Thisis an excellent source to better understand what should be used for efficiency. In short, numpy vectorization is the way to go whenever possible, otherwise pandas apply() function is still many times faster than iterrows(). Would be interesting to see the comparison results after refactoring.

Pandas 数据帧不打算以这种方式迭代。是一个很好的来源,可以更好地了解应该使用什么来提高效率。简而言之,numpy 向量化是可行的方法,否则pandas apply() 函数仍然比iterrows() 快很多倍。看到重构后的比较结果会很有趣。