scala Spark textFile 与 WholeTextFiles

Question

提问by Dan

I understand the basic theory of textFilegenerating partition for each file, while wholeTextFilesgenerates an RDD of pair values, where the key is the path of each file, the value is the content of each file.

我理解了textFile为每个文件生成分区的基本原理，同时wholeTextFiles生成一个对值的RDD，其中键是每个文件的路径，值是每个文件的内容。

Now, from a technical point of view, what's the difference between :

现在，从技术的角度来看，它们之间的区别是什么：

val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions

and

和

val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions

In both methods I'm generating 8 partitions. So why should I use wholeTextFilesin the first place, and what's its benefit over textFile?

在这两种方法中，我都生成了 8 个分区。那么我wholeTextFiles首先为什么要使用它，它的好处是什么textFile？

Answer 1

回答by Shaido - Reinstate Monica

The main difference, as you mentioned, is that textFilewill return an RDD with each line as an element while wholeTextFilesreturns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.

正如您所提到的，主要区别在于，textFile它将以每一行作为元素返回一个 RDD，而wholeTextFiles返回一个以文件路径为键的 PairRDD。如果不需要根据文件分离数据，只需使用textFile.

When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFilesshould be used.

使用读取未压缩文件时textFile，它会将数据拆分为 32MB 的数据块。从内存的角度来看，这是有利的。这也意味着行的顺序丢失了，如果应该保留顺序则wholeTextFiles应该使用。

wholeTextFileswill read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

wholeTextFiles将一次读取文件的完整内容，它不会部分溢出到磁盘或部分垃圾收集。每个文件将由一个内核处理，并且每个文件的数据将是一台机器，这使得分配负载变得更加困难。

Answer 2

回答by Tzach Zohar

textFilegenerating partition for each file, while wholeTextFilesgenerates an RDD of pair values

textFile为每个文件生成分区，同时wholeTextFiles生成对值的 RDD

That's not accurate:

这不准确：

textFileloads one or more files, with each lineas a recordin the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").
wholeTextFilespreserves the relation between data and the files that contained it, by loading the data into a PairRDDwith one record per input file. The record will have the form (fileName, fileContent). This means that loading large files is risky (might cause bad performance or OutOfMemoryErrorsince each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.

textFile加载一个或多个文件，每一行作为结果 RDD 中的一条记录。如果文件足够大（取决于请求的分区数、Spark 的默认分区数和底层文件系统），单个文件可能会被拆分为多个分区。一次加载多个文件时，此操作“丢失”了记录与包含它的文件之间的关系——即无法知道哪个文件包含哪一行。RDD 中记录的顺序将遵循文件的字母顺序，以及文件内记录的顺序（顺序不会“丢失”）。
wholeTextFiles保存数据，并包含它，通过将数据加载到一个文件之间的关系，PairRDD与每个输入文件的一个记录。记录的格式为(fileName, fileContent)。这意味着加载大文件是有风险的（可能会导致性能不佳或OutOfMemoryError因为每个文件都必须存储在单个节点上）。分区是根据用户输入或 Spark 的配置完成的 - 多个文件可能加载到单个分区中。

Generally speaking, textFileserves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFilesshould only be used if you actually need to know the originating file name of each record, andif you know all files are small enough.

一般来说，textFile服务于仅加载大量数据的常见用例（无论它如何分解为文件）。readWholeFiles仅当您确实需要知道每个记录的原始文件名并且您知道所有文件都足够小时才应使用。

Answer 3

回答by Sainagaraju Vaduka

As of Spark2.1.1 following is the code for textFile.

从 Spark2.1.1 开始，以下是 textFile 的代码。

def textFile(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString).setName(path)  }

Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like file://, hdfs://, and s3a://

内部使用hadoopFile使用模式一样可以读取这两种本地文件，HDFS文件，S3 file://，hdfs://和s3a://

Where as WholeTextFile the syntax is as below

WholeTextFile 的语法如下

def wholeTextFiles(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope

If we observe the syntax for the both methods are equal, but textfileis useful to read the files, where as wholeTextFilesis used to read the directories of small files. How ever we can also use larger filesbut performance may effect.
So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better

如果我们观察到这两种方法的语法是相同的，但是textfile用于读取文件，而WholeTextFiles用于读取小文件的目录。我们也可以使用更大的文件，但性能可能会受到影响。
所以当你想处理大文件时，textFile 是更好的选择，而如果我们想处理小文件的目录，wholeTextFile 更好

Answer 4

回答by KayV

textfile() reads a text file and returns an RDD of Strings. For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.
wholeTextFile() reads a directory of text files and returns pairRDD. For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.

textfile() 读取文本文件并返回字符串的 RDD。例如 sc.textFile("/mydata.txt") 将创建 RDD，其中每一行都是一个元素。
WholeTextFile() 读取文本文件目录并返回pairRDD。例如，如果目录中的文件很少，则 WholeTextFile() 方法将创建以文件名和路径为键，值是整个文件作为字符串的对 RDD。

scala Spark textFile 与 WholeTextFiles

提问by Dan

回答by Shaido - Reinstate Monica

回答by Tzach Zohar

回答by Sainagaraju Vaduka

回答by KayV

相关推荐

最近更新

标签

scala Spark textFile 与 WholeTextFiles

提问by Dan

回答by Shaido - Reinstate Monica

回答by Tzach Zohar

回答by Sainagaraju Vaduka

回答by KayV

相关推荐

scala Spark 列字符串在其他列（行）中出现时替换

scala Akka Stream Kafka 与 Kafka Streams

scala org.apache.spark.SparkException：无法执行用户定义的函数

scala 如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用

相关推荐

最近更新

标签