Pandas 与 SQL 速度

Question

提问by user1761806

I'm hearing different views on when one should use Pandas vs when to use SQL.

关于何时应该使用 Pandas 与何时使用 SQL，我听到了不同的看法。

I tried to do the following in Pandas on 19,150,869 rows of data:

我尝试在 Pandas 中对 19,150,869 行数据执行以下操作：

for idx, row in df.iterrows():
    tmp = int((int(row['M']) / PeriodGranularity))+1
    row['TimeSlot'] = str(row["D"]+1) + "-" + str(row["H"]) + "-" + str(tmp)

And found it was taking so long I had to abort after 20 minutes.

发现它花了很长时间，我不得不在 20 分钟后中止。

I performed the following in SQLLite:

我在 SQLLite 中执行了以下操作：

Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M,cast(strftime('%M',PlayedTimestamp) / 15+1 as int) as TimeSlot from tblMain

and found it took 4 seconds ("19150869 rows returned in 2445ms").

并发现它花了 4 秒（“19150869 行在 2445 毫秒内返回”）。

Note: For the Pandas code I ran this in the step before it to get the data from the db:

注意：对于 Pandas 代码，我在它之前的步骤中运行了它以从数据库中获取数据：

sqlStr = "Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M from tblMain"
df = pd.read_sql_query(sqlStr, con)

Is it my coding that's at fault here or is it generally accepted that for certain tasks SQL is a lot faster?

是我的编码有问题，还是普遍认为对于某些任务 SQL 快得多？

Answer 1

回答by jezrael

It seems you can use vectorize solution (PeriodGranularityis some variable):

看来您可以使用矢量化解决方案（PeriodGranularity是一些变量）：

df['TimeSlot'] = (df["D"]+1).astype(str) + "-" + 
                  df["H"].astype(str) + "-" + 
                 ((df['M'].astype(int) / PeriodGranularity).astype(int)+1).astype(str)

And for parse datetimeto struse strftime.

而对于解析datetime来str使用strftime。

DataFrame.iterrowsis really slow - check this.

DataFrame.iterrows真的很慢 - 检查这个。

First some comaprison of code for users coming from SQL background.

首先是一些来自 SQL 背景的用户的代码comaprison。

Comapring 2 technologies is really hard and I am not sure if some nice answer in SO (too broad reasons), but I find this.

Comapring 2 技术真的很难，我不确定 SO 中是否有一些不错的答案（原因太广泛），但我发现这个.

Answer 2

回答by George Gousios

Pandas dataframes are not meant to be iterated this way. Thisis an excellent source to better understand what should be used for efficiency. In short, numpy vectorization is the way to go whenever possible, otherwise pandas apply() function is still many times faster than iterrows(). Would be interesting to see the comparison results after refactoring.

Pandas 数据帧不打算以这种方式迭代。这是一个很好的来源，可以更好地了解应该使用什么来提高效率。简而言之，numpy 向量化是可行的方法，否则pandas apply() 函数仍然比iterrows() 快很多倍。看到重构后的比较结果会很有趣。

Pandas 与 SQL 速度

提问by user1761806

回答by jezrael

回答by George Gousios

相关推荐

最近更新

标签

Pandas 与 SQL 速度

提问by user1761806

回答by jezrael

回答by George Gousios

相关推荐

pandas 无法将大小为 1 的数组重塑为形状 (48,48)

Pandas DataFrame.apply：使用来自两列的数据创建新列

Python Pandas Dataframe 合并并仅选择几列

Scatter_Matrix 不会显示使用 Pandas 和

相关推荐

最近更新

标签