scala spark中的RDD是什么
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34433027/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is RDD in spark
提问by kittu
Definition says:
定义说:
RDD is immutable distributed collection of objects
RDD 是不可变的分布式对象集合
I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python)
我不太明白这是什么意思。是不是像存储在硬盘上的数据(分区对象)如果是这样,那么RDD为什么可以拥有用户定义的类(例如java,scala或python)
From this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch03.htmlIt mentions:
从这个链接:https: //www.safaribooksonline.com/library/view/learning-spark/97814449359034/ch03.html它提到:
Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program
用户通过两种方式创建 RDD:通过加载外部数据集,或通过在其驱动程序中分发对象集合(例如,列表或集合)
I am really confused understanding RDD in general and in relation to spark and hadoop.
我真的很困惑对 RDD 的一般理解以及与 spark 和 hadoop 的关系。
Can some one please help.
有人可以帮忙吗。
采纳答案by Ewan Leith
An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.
从本质上讲,RDD 是一组数据的 Spark 表示,分布在多台机器上,并带有 API 以供您对其进行操作。RDD 可以来自任何数据源,例如文本文件、通过 JDBC 的数据库等。
The formal definition is:
正式定义是:
RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.
RDD 是容错的并行数据结构,允许用户将中间结果显式地保存在内存中,控制它们的分区以优化数据放置,并使用一组丰富的运算符来操作它们。
If you want the full details on what an RDD is, read one of the core Spark academic papers, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
如果您想了解什么是 RDD 的完整详细信息,请阅读 Spark 的核心学术论文之一,弹性分布式数据集:内存集群计算的容错抽象
回答by tharindu_DG
RDDis a logical reference of a datasetwhich is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure.
RDD是 a 的逻辑引用,dataset它在集群中的许多服务器机器上进行分区。RDD是不可变的,并且在失败的情况下可以自我恢复。
datasetcould be the data loaded externally by the user. It could be a json file, csv file or a text file with no specific data structure.
dataset可以是用户从外部加载的数据。它可以是 json 文件、csv 文件或没有特定数据结构的文本文件。
UPDATE: Hereis the paper what describe RDD internals:
更新:这是描述 RDD 内部结构的论文:
Hope this helps.
希望这可以帮助。
回答by Mahesh
Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.
从形式上讲,RDD 是只读的、分区的记录集合。RDD 只能通过对 (1) 稳定存储中的数据或 (2) 其他 RDD 的确定性操作来创建。
RDDs have the following properties –
RDDs 具有以下特性——
Immutability and partitioning:RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.
Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed.
Coarse grained operations:Coarse grained operations are operations which are applied to all elements in datasets. For example – a map, or filter or groupBy operation which will be performed on all elements in a partition of RDD.
Fault Tolerance:Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of these transformations to produce one RDD is called as Lineage Graph.
不变性和分区:RDD 由分区的记录集合组成。分区是RDD中并行的基本单位,每个分区是不可变数据的一个逻辑分区,通过对现有分区进行一些转换而创建。不变性有助于实现计算的一致性。
如果需要,用户可以根据他们想要加入多个数据集的键来定义他们自己的分区标准。
粗粒度操作:粗粒度操作是应用于数据集中所有元素的操作。例如——一个映射、过滤器或 groupBy 操作,它们将在 RDD 的一个分区中的所有元素上执行。
容错:由于 RDD 是在一组转换上创建的,它记录这些转换,而不是实际数据。这些转换生成一个 RDD 的图称为沿袭图。
For example –
例如 -
firstRDD=sc.textFile("hdfs://...")
secondRDD=firstRDD.filter(someFunction);
thirdRDD = secondRDD.map(someFunction);
result = thirdRDD.count()
In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.
如果我们丢失了 RDD 的某个分区,我们可以在 lineage 上重放那个分区上的转换来实现相同的计算,而不是跨多个节点进行数据复制。 这个特性是 RDD 的最大好处,因为它节省了很多精力在数据管理和复制中,从而实现更快的计算。
Lazy evaluations:Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when count() action is invoked.
Persistence:Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk etc.)
惰性求值:Spark 会在 RDD 首次用于操作时惰性计算它们,以便它可以流水线化转换。因此,在上面的示例中,只有在调用 count() 操作时才会评估 RDD。
持久性:用户可以指示他们将重用哪些 RDD 并为它们选择存储策略(例如,内存中存储或磁盘上等)
These properties of RDDs make them useful for fast computations.
RDD 的这些特性使其可用于快速计算。
回答by pgirard
Resilient Distributed Dataset(RDD) is the way Spark represents data. The data can come from various sources :
弹性分布式数据集(RDD) 是 Spark 表示数据的方式。数据可以来自各种来源:
- Text File
- CSV File
- JSON File
- Database (via JBDC driver)
- 文本文件
- CSV文件
- JSON 文件
- 数据库(通过 JBDC 驱动程序)
RDD in relation to Spark
与 Spark 相关的 RDD
Spark is simply an implementation of RDD.
Spark 只是 RDD 的一个实现。
RDD in relation to Hadoop
与 Hadoop 相关的 RDD
The power of Hadoop reside in the fact that it let users write parallel computations without having to worry about work distribution and fault tolerance. However, Hadoop is inefficient for the applications that reuse intermediate results. For example, iterative machine learning algorithms, such as PageRank, K-means clustering and logistic regression, reuse intermediate results.
Hadoop 的强大之处在于它让用户可以编写并行计算,而不必担心工作分布和容错。但是,对于重用中间结果的应用程序,Hadoop 效率低下。例如,迭代机器学习算法,如 PageRank、K-means 聚类和逻辑回归,重用中间结果。
RDD allows to store intermediate results inside the RAM. Hadoop would have to write it to an external stable storage system, which generate disk I/O and serialization. With RDD, Spark is up to 20X faster than Hadoop for iterative applications.
RDD 允许将中间结果存储在 RAM 中。Hadoop 必须将其写入外部稳定存储系统,该系统生成磁盘 I/O 和序列化。使用 RDD,Spark 在迭代应用程序中比 Hadoop 快 20 倍。
Futher implementations details about Spark
关于 Spark 的进一步实现细节
Coarse-Grained transformations
粗粒度变换
The transformations applied to an RDD are Coarse-Grained. This means that the operations on a RDD are applied to the whole dataset, not on its individual elements. Therefore, operations like map, filter, group, reduce are allowed, but operations like set(i) and get(i) are not.
应用于 RDD 的转换是粗粒度的。这意味着对 RDD 的操作应用于整个数据集,而不是其单个元素。因此,允许像 map、filter、group、reduce 这样的操作,但是像 set(i) 和 get(i) 这样的操作是不允许的。
The inverse of coarse-grained is fine-grained. A fine-grained storage system would be a database.
粗粒度的反义词是细粒度。细粒度存储系统将是一个数据库。
Fault Tolerant
容错
RDD are fault tolerant, which is a property that enable the system to continue working properly in the event of the failure of one of its components.
RDD 是容错的,这是一种使系统在其一个组件发生故障时能够继续正常工作的属性。
The fault tolerance of Spark is strongly linked to its coarse-grained nature. The only-way to implement fault tolerance in a fine-grained storage system is to replicate its data or log updates across machines. However, in a coarse-grained system like Spark, only the transformations are logged. If a partition of an RDD is lost, the RDD has enough information the recompute it quickly.
Spark 的容错性与其粗粒度的特性密切相关。在细粒度存储系统中实现容错的唯一方法是跨机器复制其数据或日志更新。然而,在像 Spark 这样的粗粒度系统中,只记录转换。如果 RDD 的分区丢失,则 RDD 有足够的信息可以快速重新计算。
Data storage
数据存储
The RDD is "distributed" (separated) in partitions. Each partitions can be present in the memory or on the disk of a machine. When Spark wants to launch a task on a partition, he sends it to the machine containing the partition. This is know as "locally aware scheduling".
RDD 在分区中“分布”(分离)。每个分区都可以存在于内存或机器的磁盘中。当 Spark 想要在一个分区上启动一个任务时,他会将它发送到包含该分区的机器上。这被称为“本地感知调度”。
Sources : Great research papers about Spark : http://spark.apache.org/research.html
资料来源:关于 Spark 的优秀研究论文:http: //spark.apache.org/research.html
Include the paper suggested by Ewan Leith.
包括 Ewan Leith 建议的论文。
回答by SPR
RDD = Resilient Distributed Dataset
RDD = 弹性分布式数据集
Resilient (Dictionary meaning) = (of a substance or object) able to recoil or spring back into shape after bending, stretching, or being compressed
弹性(字典含义)=(物质或物体的)在弯曲、拉伸或压缩后能够回弹或弹回形状
RDD is defined as (from LearningSpark - OREILLY): The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.
RDD 定义为(来自 LearningSpark - OREILLY):始终重新计算 RDD 的能力实际上是 RDD 被称为“弹性”的原因。当保存 RDD 数据的机器出现故障时,Spark 使用此功能重新计算丢失的分区,对用户透明。
This means 'data' is surely available at all times. Also, Spark can run without Hadoop and hence data is NOT replicated. One of the best characterstics of Hadoop2.0 is 'High Availbility' with the help of Passive Standby Namenode. The same is achieved by RDD in Spark.
这意味着“数据”肯定始终可用。此外,Spark 可以在没有 Hadoop 的情况下运行,因此不会复制数据。Hadoop2.0 的最佳特性之一是在被动备用 Namenode 的帮助下的“高可用性”。Spark 中的 RDD 也是如此。
A given RDD (Data) can span across various nodes in Spark cluster (like in Hadoop based cluster).
给定的 RDD(数据)可以跨越 Spark 集群中的各个节点(如基于 Hadoop 的集群)。
If any node crashes, Spark can re-compute the RDD and loads the data in some other node, and data is always available. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elementsthat can be operated on in parallel (http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)
如果任何节点崩溃,Spark 可以重新计算 RDD 并将数据加载到其他节点,并且数据始终可用。Spark 围绕弹性分布式数据集 (RDD)的概念展开,RDD 是可以并行操作的元素的容错集合( http://spark.apache.org/docs/latest/programming-guide.html #resilient-distributed-datasets-rdds)
回答by Saketh
To compare RDD with scala collection, below are few differences
将 RDD 与 Scala 集合进行比较,以下是一些差异
- Same but runs on a cluster
- Lazy in nature where scala collections are strict
- RDD is always Immutable i.e., you can not change the state of the data in the collection
- RDD are self recovered i.e., fault-tolerant
- 相同但在集群上运行
- 本质上是懒惰的,Scala 集合是严格的
- RDD 总是不可变的,即你不能改变集合中数据的状态
- RDD是自恢复的,即容错的
回答by user2314737
RDD(Resilient Distributed Datasets) are an abstraction for representing data. Formally they are a read-only, partitioned collection of records that provides a convenient API.
RDD(řesilient distributed datasets)是用于表示数据的抽象。从形式上讲,它们是一个只读的、分区的记录集合,提供了一个方便的 API。
RDD provide a performant solution for processing large datasets on cluster computing frameworks such as MapReduce by addressing some key issues:
RDD 通过解决一些关键问题,为在集群计算框架(如 MapReduce)上处理大型数据集提供了一种高性能的解决方案:
- data is kept in memory to reduce disk I/O; this is particularly relevant for iterative computations -- not having to persist intermediate data to disk
- fault-tolerance (resilience) is obtained not by replicating data but by keeping track of all transformations applied to the initial dataset (the lineage). This way, in case of failure lost data can always be recomputed from its lineage and avoiding data replication again reduces storage overhead
- lazy evaluation, i.e. computations are carried out first when they're needed
- 数据保存在内存中以减少磁盘I/O;这与迭代计算特别相关——不必将中间数据保存到磁盘
- 容错(弹性)不是通过复制数据获得的,而是通过跟踪应用于初始数据集(谱系)的所有转换来获得的。这样,在发生故障时丢失的数据总是可以从其沿袭重新计算,避免再次复制数据减少了存储开销
- 惰性求值,即在需要时首先进行计算
RDD's have two main limitations:
RDD 有两个主要限制:
- they're immutable (read-only)
- they only allow coarse-grainedtransformations (i.e. operations that apply to the entire dataset)
- 它们是不可变的(只读)
- 它们只允许粗粒度的转换(即适用于整个数据集的操作)
One nice conceptual advantage of RDD's is that they pack together data and code making it easier to reuse data pipelines.
RDD 的一个很好的概念优势是它们将数据和代码打包在一起,从而更容易重用数据管道。
Sources: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, An Architecture for Fast and General Data Processing on Large Clusters
回答by Saad Ahmed
RDD is a way of representing data in spark.The source of data can be JSON,CSV textfile or some other source. RDD is fault tolerant which means that it stores data on multiple locations(i.e the data is stored in distributed form ) so if a node fails the data can be recovered. In RDD data is available at all times. However RDD are slow and hard to code hence outdated. It has been replaced by concept of DataFrame and Dataset.
RDD 是 spark 中表示数据的一种方式。数据的来源可以是 JSON、CSV 文本文件或其他一些来源。RDD 是容错的,这意味着它将数据存储在多个位置(即数据以分布式形式存储),因此如果节点发生故障,数据可以恢复。在 RDD 中,数据始终可用。然而,RDD 速度慢且难以编码,因此已经过时。它已被 DataFrame 和 Dataset 的概念所取代。
回答by amarnath pimple
RDD is an Resilient Distributed Data Set. It is an core part of spark. It is an Low Level API of spark. DataFrame and DataSets are built on top of RDD. RDD are nothing but row level data i.e. sits on n number of executors. RDD's are immutable .means you cannot change the RDD. But you can create new RDD using Transformation and Actions
RDD 是弹性分布式数据集。它是spark的核心部分。它是 Spark 的低级 API。DataFrame 和 DataSet 建立在 RDD 之上。RDD 只不过是行级数据,即位于 n 个执行器上。RDD 是不可变的。意味着你不能改变 RDD。但是你可以使用 Transformation 和 Actions 创建新的 RDD


