为什么 2012 年 Python 中的 Pandas 合并速度比 R 中的 data.table 合并速度快?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8991709/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why were pandas merges in python faster than data.table merges in R in 2012?
提问by Zach
I recently came across the pandaslibrary for python, which according to this benchmarkperforms very fast in-memory merges. It's even faster than the data.tablepackage in R (my language of choice for analysis).
我最近遇到了python的pandas库,根据这个基准,它执行非常快的内存合并。它甚至比R 中的data.table包(我选择的分析语言)还要快。
Why is pandasso much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.tablewithout resorting to merge(X, Y, all=FALSE)and merge(X, Y, all=TRUE)?
为什么pandas比 快这么多data.table?是因为 python 比 R 具有固有的速度优势,还是有一些我不知道的权衡?有没有办法在data.table不诉诸merge(X, Y, all=FALSE)和的情况下执行内部和外部联接merge(X, Y, all=TRUE)?


Here's the R codeand the Python codeused to benchmark the various packages.
采纳答案by Matt Dowle
It looks like Wes may have discovered a known issue in data.tablewhen the number of unique strings (levels) is large: 10,000.
data.table当唯一字符串(级别)的数量很大时,Wes 似乎发现了一个已知问题:10,000。
Does Rprof()reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn't really the join itself (the algorithm), but a preliminary step.
是否Rprof()透露了通话中花费的大部分时间sortedmatch(levels(i[[lc]]), levels(x[[rc]])?这并不是真正的连接本身(算法),而是一个初步步骤。
Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by test.data.table()but that code isn't hooked up yet to replace the levels to levels match.
最近的努力已进入允许键中的字符列,这应该通过与 R 自己的全局字符串哈希表更紧密地集成来解决该问题。已经报告了一些基准测试结果,test.data.table()但该代码尚未连接以将级别替换为级别匹配。
Are pandas merges faster than data.tablefor regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.
熊猫合并是否比data.table常规整数列更快?这应该是一种将算法本身与因素问题隔离开来的方法。
Also, data.tablehas time series mergein mind. Two aspects to that: i) multi column orderedkeys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.
此外,data.table有时间序列合并的初衷。两个方面:i) 多列有序键,例如 (id,datetime) ii) 快速流行的 join ( roll=TRUE) 又名最后一次观察结转。
I'll need some time to confirm as it's the first I've seen of the comparison to data.tableas presented.
我需要一些时间来确认,因为这是我第一次看到与data.table所呈现的比较。
UPDATE from data.table v1.8.0 released July 2012
来自 2012 年 7 月发布的 data.table v1.8.0 的更新
- Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.
- 内部函数 sortedmatch() 在将 i 个级别匹配到 x 级别的“因子”类型的列时删除并替换为 chmatch()。当因子列的水平数很大(例如> 10,000)时,此初步步骤会导致(已知)显着放缓。正如 Wes McKinney(Python 包 Pandas 的作者)所证明的那样,在加入四个这样的列的测试中加剧了这种情况。例如,匹配 100 万个字符串(其中 600,000 个是唯一的)现在从 16 秒减少到 0.5 秒。
also in that release was :
在那个版本中还有:
character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.
New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.
字符列现在允许在键中并且优先于因子。data.table() 和 setkey() 不再将字符强制转换为因子。仍然支持因素。实施 FR#1493、FR#1224 和(部分)FR#951。
新函数 chmatch() 和 %chin%,用于字符向量的 match() 和 %in% 的更快版本。使用了 R 的内部字符串缓存(没有构建哈希表)。它们比 ?chmatch 中的示例中的 match() 快大约 4 倍。
As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWSis updated live.
截至 2013 年 9 月,data.table 在 CRAN 上为 v1.8.10,我们正在开发 v1.9.0。新闻实时更新。
But as I wrote originally, above :
但正如我最初写的,上面:
data.tablehas time series mergein mind. Two aspects to that: i) multi column orderedkeys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.
data.table具有时间序列合并的初衷。两个方面:i) 多列有序键,例如 (id,datetime) ii) 快速流行的 join (roll=TRUE) 又名最后一次观察结转。
So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.
所以两个字符列的 Pandas equi join 可能仍然比 data.table 快。因为它听起来像是对组合的两列进行哈希处理。data.table 不会对键进行哈希处理,因为它考虑到了普遍的有序连接。data.table 中的“键”实际上只是排序顺序(类似于 SQL 中的聚集索引;即,这就是数据在 RAM 中的排序方式)。例如,在列表中添加辅助键。
In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.
总而言之,这个具有超过 10,000 个唯一字符串的特定两字符列测试突出显示的明显速度差异现在应该不会那么糟糕,因为已知问题已得到修复。
回答by Wes McKinney
The reason pandas is faster is because I came up with a better algorithm, which is implemented very carefully using a fast hash table implementation - kliband in C/Cythonto avoid the Python interpreter overhead for the non-vectorizable parts. The algorithm is described in some detail in my presentation: A look inside pandas design and development.
pandas 更快的原因是我想出了一个更好的算法,它使用快速哈希表实现非常小心地实现 - klib和 C/ Cython以避免非矢量化部分的 Python 解释器开销。该算法在我的演示文稿中进行了一些详细的描述:大熊猫设计和开发内部的介绍。
The comparison with data.tableis actually a bit interesting because the whole point of R's data.tableis that it contains pre-computed indexesfor various columns to accelerate operations like data selection and merges. In this case (database joins) pandas' DataFrame contains no pre-computed informationthat is being used for the merge, so to speak it's a "cold" merge. If I had stored the factorized versions of the join keys, the join would be significantly faster - as factorizing is the biggest bottleneck for this algorithm.
与 的比较data.table实际上有点有趣,因为 R 的全部意义data.table在于它包含针对各个列的预先计算的索引,以加速数据选择和合并等操作。在这种情况下(数据库连接)pandas 的 DataFrame 不包含用于合并的预计算信息,可以说它是“冷”合并。如果我存储了连接键的分解版本,连接会明显更快 - 因为分解是该算法的最大瓶颈。
I should also add that the internal design of pandas' DataFrame is much more amenable to these kinds of operations than R's data.frame (which is just a list of arrays internally).
我还应该补充一点,pandas 的 DataFrame 的内部设计比 R 的 data.frame(它只是内部数组列表)更适合这些类型的操作。
回答by Merik
This topic is two years old but seems like a probable place for people to land when they search for comparisons of Pandas and data.table
这个话题已经有两年的历史了,但当人们搜索 Pandas 和 data.table 的比较时,这似乎是他们可能登陆的地方
Since both of these have evolved over time, I want to post a relatively newer comparison (from 2014) here for the interested users: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping
由于这两者都随着时间的推移而发展,我想在这里为感兴趣的用户发布一个相对较新的比较(从 2014 年开始):https: //github.com/Rdatatable/data.table/wiki/Benchmarks-: -Grouping
It would be interesting to know if Wes and/or Matt (who, by the way, are creators of Pandas and data.table respectively and have both commented above) have any news to add here as well.
想知道 Wes 和/或 Matt(顺便说一下,他们分别是 Pandas 和 data.table 的创建者,并且都在上面发表了评论)是否也有任何新闻要添加在这里会很有趣。
-- UPDATE --
- 更新 -
A comment posted below by jangorecki contains a link that I think is very useful: https://github.com/szilard/benchm-databases
jangorecki 在下面发布的评论包含一个我认为非常有用的链接:https: //github.com/szilard/benchm-databases
This graph depicts the average times of aggregation and join operations for different technologies (lower = faster; comparison last updated in Sept 2016). It was really educational for me.
此图描绘了不同技术的聚合和连接操作的平均时间(较低 = 更快;比较最后更新于 2016 年 9 月)。这对我来说真的很有教育意义。
Going back to the question, R DT keyand R DTrefer to the keyed/unkeyed flavors of R's data.table and happen to be faster in this benchmark than Python's Pandas (Py pandas).
回过头来看看这个问题,R DT key并R DT指于R的data.table的键/无锁的口味而发生更快是这个基准比Python的熊猫(Py pandas)。
回答by jangorecki
There are great answers, notably made by authors of both tools that question asks about. Matt's answer explain the case reported in the question, that it was caused by a bug, and not an merge algorithm. Bug was fixed on the next day, more than a 7 years ago already.
有很好的答案,特别是由提出问题的两种工具的作者提出的。马特的回答解释了问题中报告的案例,它是由错误引起的,而不是由合并算法引起的。错误在第二天修复,已经是 7 多年前的事了。
In my answer I will provide some up-to-date timings of merging operation for data.table and pandas. Note that plyr and base R merge are not included.
在我的回答中,我将为 data.table 和 pandas 提供一些最新的合并操作时间。请注意,不包括 plyr 和 base R 合并。
Timings I am presenting are coming from db-benchmarkproject, a continuously run reproducible benchmark. It upgrades tools to recent versions and re-run benchmark scripts. It runs many other software solutions. If you are interested in Spark, Dask and few others be sure to check the link.
我展示的时间来自db-benchmark项目,这是一个持续运行的可重现基准。它将工具升级到最新版本并重新运行基准测试脚本。它运行许多其他软件解决方案。如果您对 Spark、Dask 和其他一些人感兴趣,请务必查看链接。
As of now... (still to be implemented: one more data size and 5 more questions)
截至目前......(仍有待实施:多一个数据大小和 5 个问题)
We tests 2 different data sizes of LHS table.
For each of those data sizes we run 5 different merge questions.
我们测试了 LHS 表的 2 种不同数据大小。
对于这些数据大小中的每一个,我们运行 5 个不同的合并问题。
q1: LHS inner join RHS-smallon integer
q2: LHS inner join RHS-medium on integer
q3: LHS outerjoin RHS-medium on integer
q4: LHS inner join RHS-medium on factor (categorical)
q5: LHS inner join RHS-bigon integer
q1:LHS 内连接 RHS-小整数
q2 :LHS 内连接 RHS-中整数
q3:LHS外连接 RHS-中整数
q4 :LHS 内连接 RHS-中因子(分类)
q5:LHS 内连接 RHS-中整数大整数
RHS table is of 3 various sizes
RHS 桌子有 3 种不同尺寸
- smalltranslates to size of LHS/1e6
- mediumtranslates to size of LHS/1e3
- bigtranslates to size of LHS
- 小转化为 LHS/1e6 的大小
- medium转换为 LHS/1e3 的大小
- big转化为 LHS 的大小
In all cases there are around 90% of matching rows between LHS and RHS, and no duplicates in RHS joining column (no cartesian product).
在所有情况下,LHS 和 RHS 之间大约有 90% 的匹配行,并且 RHS 连接列中没有重复(没有笛卡尔积)。
As of now (run on 2nd Nov 2019)
截至目前(2019 年 11 月 2 日运行)
pandas 0.25.3 released on 1st Nov 2019
data.table 0.12.7 (92abb70) released on 2nd Nov 2019
pandas 0.25.3 于 2019 年 11 月 1 日发布
data.table 0.12.7 (92abb70) 于 2019 年 11 月 2 日发布
Below timings are in seconds, for two different data sizes of LHS. Column pd2dtis added field storing ratio of how many times pandas is slower than data.table.
以下计时以秒为单位,用于 LHS 的两种不同数据大小。列pd2dt增加了pandas比data.table慢多少倍的字段存储比例。
- 0.5 GB LHS data
- 0.5 GB LHS 数据
+-----------+--------------+----------+--------+
| question | data.table | pandas | pd2dt |
+-----------+--------------+----------+--------+
| q1 | 0.51 | 3.60 | 7 |
| q2 | 0.50 | 7.37 | 14 |
| q3 | 0.90 | 4.82 | 5 |
| q4 | 0.47 | 5.86 | 12 |
| q5 | 2.55 | 54.10 | 21 |
+-----------+--------------+----------+--------+
- 5 GB LHS data
- 5 GB LHS 数据
+-----------+--------------+----------+--------+
| question | data.table | pandas | pd2dt |
+-----------+--------------+----------+--------+
| q1 | 6.32 | 89.0 | 14 |
| q2 | 5.72 | 108.0 | 18 |
| q3 | 11.00 | 56.9 | 5 |
| q4 | 5.57 | 90.1 | 16 |
| q5 | 30.70 | 731.0 | 23 |
+-----------+--------------+----------+--------+

