Python 在什么情况下我可以使用 Dask 而不是 Apache Spark?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38882660/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
At what situation I can use Dask instead of Apache Spark?
提问by Hariprasad
I am currently using Pandas and Spark for data analysis. I found Dask provides parallelized NumPy array and Pandas DataFrame.
我目前正在使用 Pandas 和 Spark 进行数据分析。我发现 Dask 提供了并行化的 NumPy 数组和 Pandas DataFrame。
Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.
Pandas 在 Python 中进行数据分析既简单又直观。但是我发现由于系统内存有限,我很难在 Pandas 中处理多个更大的数据帧。
Simple Answer:
Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. ... Generally Dask is smaller and lighter weight than Spark.
简单回答:
Apache Spark 是一个包罗万象的框架,它结合了分布式计算、SQL 查询、机器学习等,它在 JVM 上运行,并且通常与其他大数据框架(如 Hadoop)共同部署。... 一般来说,Dask 比 Spark 更小、重量更轻。
I get to know below details from http://dask.pydata.org/en/latest/spark.html
我从http://dask.pydata.org/en/latest/spark.html了解以下详细信息
- Dask is light weighted
- Dask is typically used on a single machine, but also runs well on a distributed cluster.
- Dask to provides parallel arrays, dataframes, machine learning, and custom algorithms
- Dask has an advantage for Python users because it is itself a Python library, so serialization and debugging when things go wrong happens more smoothly.
- Dask gives up high-level understanding to allow users to express more complex parallel algorithms.
- Dask is lighter weight and is easier to integrate into existing code and hardware.
- If you want a single project that does everything and you're already on Big Data hardware then Spark is a safe bet
- Spark is typically used on small to medium sized cluster but also runs well on a single machine.
- Dask 是轻量级的
- Dask 通常在单机上使用,但也可以在分布式集群上运行良好。
- Dask 提供并行数组、数据帧、机器学习和自定义算法
- Dask 对 Python 用户来说有一个优势,因为它本身就是一个 Python 库,所以出现问题时的序列化和调试会更顺利。
- Dask 放弃了高级别的理解,让用户可以表达更复杂的并行算法。
- Dask 重量更轻,更容易集成到现有代码和硬件中。
- 如果你想要一个可以做所有事情的单一项目并且你已经在大数据硬件上,那么 Spark 是一个安全的选择
- Spark 通常用于中小型集群,但也可以在单台机器上运行良好。
I understand more things about Dask from the below link https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster
我从以下链接中了解了有关 Dask 的更多信息 https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster
- If you're running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.
- Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data
- scales up resiliently and elastically on clusters with hundreds of nodes.
- Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.
- 如果您在使用 Pandas、NumPy 或其他 Python 计算时在单台机器上遇到内存问题、存储限制或 CPU 边界问题,Dask 可以帮助您在一台机器上扩展所有内核,或向外扩展在集群中的所有内核和内存上。
- Dask 在单台机器上运行良好,可充分利用笔记本电脑上的所有内核并处理大于内存的数据
- 在具有数百个节点的集群上弹性伸缩。
- Dask 使用 Python 原生处理不同格式和存储系统的数据,包括 Hadoop 分布式文件系统 (HDFS) 和 Amazon S3。Anaconda 和 Dask 可以与您现有的企业 Hadoop 发行版一起使用,包括 Cloudera CDH 和 Hortonworks HDP。
http://dask.pydata.org/en/latest/dataframe-overview.html
http://dask.pydata.org/en/latest/dataframe-overview.html
Limitations
限制
Dask.DataFrame does not implement the entire Pandas interface. Users expecting this will be disappointed.Notably, dask.dataframe has the following limitations:
Dask.DataFrame 没有实现整个 Pandas 接口。期待这个的用户会失望。值得注意的是,dask.dataframe 有以下限制:
- Setting a new index from an unsorted column is expensive
- Many operations, like groupby-apply and join on unsorted columns require setting the index, which as mentioned above, is expensive
- The Pandas API is very large. Dask.dataframe does not attempt to implement many pandas features or any of the more exotic data structures like NDFrames
- 从未排序的列设置新索引的成本很高
- 许多操作,如 groupby-apply 和未排序列上的 join 需要设置索引,如上所述,这很昂贵
- Pandas API 非常大。Dask.dataframe 不会尝试实现许多 Pandas 功能或任何更奇特的数据结构,如 NDFrames
Thanks to the Dask developers. It seems like very promising technology.
感谢 Dask 开发人员。这似乎是非常有前途的技术。
Overall I can understand Dask is simpler to use than spark. Dask is as flexible as Pandas with more power to compute with more cpu's parallely.
总的来说,我可以理解 Dask 比 spark 更易于使用。Dask 与 Pandas 一样灵活,具有更强的计算能力和更多的 CPU 并行性。
I understand all the above facts about Dask.
我了解有关 Dask 的所有上述事实。
So, roughly how much amount of data(in terabyte) can be processed with Dask?
那么,Dask 大约可以处理多少数据量(以 TB 为单位)?
回答by MaxU
you may want to read Dask comparison to Apache Spark
您可能想阅读Dask 与 Apache Spark 的比较
Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. It was originally optimized for bulk data ingest and querying common in data engineering and business analytics but has since broadened out. Spark is typically used on small to medium sized cluster but also runs well on a single machine.
Dask is a parallel programming library that combines with the Numeric Python ecosystem to provide parallel arrays, dataframes, machine learning, and custom algorithms. It is based on Python and the foundational C/Fortran stack. Dask was originally designed to complement other libraries with parallelism, particular for numeric computing and advanced analytics, but has since broadened out. Dask is typically used on a single machine, but also runs well on a distributed cluster.
Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and instead is intended to be used in conjunction with other libraries, particularly those in the numeric Python ecosystem.
Apache Spark 是一个包罗万象的框架,它结合了分布式计算、SQL 查询、机器学习等,它在 JVM 上运行,并且通常与其他大数据框架(如 Hadoop)共同部署。它最初针对数据工程和业务分析中常见的批量数据摄取和查询进行了优化,但后来得到了扩展。Spark 通常用于中小型集群,但也可以在单台机器上运行良好。
Dask 是一个并行编程库,它与 Numeric Python 生态系统相结合,提供并行数组、数据帧、机器学习和自定义算法。它基于 Python 和基础的 C/Fortran 堆栈。Dask 最初旨在用并行性补充其他库,特别是数字计算和高级分析,但后来扩展了。Dask 通常在单机上使用,但也可以在分布式集群上运行良好。
一般来说,Dask 比 Spark 更小更轻。这意味着它具有较少的功能,而是旨在与其他库结合使用,尤其是数字 Python 生态系统中的库。