Python Spark DataFrame 方法“toPandas”实际上在做什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29226210/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:14:57  来源:igfitidea点击:

What is the Spark DataFrame method `toPandas` actually doing?

pythonpandasapache-sparkpyspark

提问by Napitupulu Jon

I'm a beginner of Spark-DataFrame API.

我是 Spark-DataFrame API 的初学者。

I use this code to load csv tab-separated into Spark Dataframe

我使用此代码将 csv tab-separated 加载到 Spark Dataframe

lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)

Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),

假设我使用 Spark 从新文件创建 DataFrame,并使用内置方法 toPandas() 将其转换为 Pandas,

  • Does it store the Pandas object to local memory?
  • Does Pandas low-level computation handled all by Spark?
  • Does it exposed all pandas dataframe functionality?(I guess yes)
  • Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
  • 它是否将 Pandas 对象存储到本地内存?
  • Pandas 低级计算是否全部由 Spark 处理?
  • 它是否暴露了所有 Pandas 数据框功能?(我想是的)
  • 我可以将它转换为 Pandas 并完成它,而不用太多接触 DataFrame API 吗?

采纳答案by Phillip Cloud

Using spark to read in a CSV file to pandasis quite a roundabout method for achieving the end goal of reading a CSV file into memory.

使用 spark 读取 CSV 文件pandas是实现将 CSV 文件读入内存的最终目标的一种迂回方法。

It seems like you might be misunderstanding the use cases of the technologies in play here.

似乎您可能误解了此处使用的技术的用例。

Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.

Spark 用于分布式计算(尽管它可以在本地使用)。它通常过于重量级,无法用于简单地读取 CSV 文件。

In your example, the sc.textFilemethod will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.

在您的示例中,该sc.textFile方法将简单地为您提供一个 spark RDD,它实际上是一个文本行列表。这可能不是您想要的。不会执行类型推断,因此如果您想对 CSV 文件中的一列数字求和,您将无法进行,因为就 Spark 而言,它们仍然是字符串。

Just use pandas.read_csvand read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.

只需使用pandas.read_csv整个 CSV 并将其读入内存。Pandas 会自动推断每一列的类型。Spark 不会这样做。

Now to answer your questions:

现在回答您的问题:

Does it store the Pandas object to local memory:

它是否将 Pandas 对象存储到本地内存

Yes. toPandas()will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.

是的。toPandas()将 Spark DataFrame 转换为 Pandas DataFrame,当然是在内存中。

Does Pandas low-level computation handled all by Spark

Pandas 低级计算是否全部由 Spark 处理

No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply someAPI compatibility.

不。Pandas 运行自己的计算,spark 和 Pandas 之间没有相互作用,只是一些API 兼容性。

Does it exposed all pandas dataframe functionality?

它是否暴露了所有 Pandas 数据框功能?

No. For example, Seriesobjects have an interpolatemethod which isn't available in PySpark Columnobjects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.

不可以。例如,Series对象具有interpolate在 PySparkColumn对象中不可用的方法。PySpark API 中有许多 Pandas API 中没有的方法和函数。

Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

我可以将它转换为 Pandas 并完成它,而不用太多接触 DataFrame API 吗?

Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csvwill likely handle your use case unless you're working with a hugeamount of data.

绝对地。事实上,在这种情况下,您甚至可能根本不应该使用 Spark。pandas.read_csv除非你是一个工作很可能会处理你的使用情况庞大的数据量。

Try to solve your problem with simple, low-tech, easy-to-understand libraries, and onlygo to something more complicated as you need it. Many times, you won't need the more complex technology.

尝试使用简单、技术含量低、易于理解的库来解决您的问题,并且在需要时才使用更复杂的库。很多时候,您不需要更复杂的技术。

回答by TheProletariat

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect()method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).

使用一些 spark 上下文或 hive 上下文方法 ( sc.textFile(), hc.sql()) 将数据“读入内存”返回一个 RDD,但 RDD 保留在分布式内存中(工作节点上的内存),而不是主节点上的内存。所有 RDD 方法(rdd.map()rdd.reduceByKey()等)都设计为在工作节点上并行运行,但有一些例外。例如,如果您运行一个rdd.collect()方法,您最终会将 rdd 的内容从所有工作节点复制到主节点内存。因此,您失去了分布式计算的优势(但仍然可以运行 rdd 方法)。

Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

与 pandas 类似,当您运行 时toPandas(),您将数据帧从分布式(工作)内存复制到本地(主)内存,并失去了大部分分布式计算能力。因此,一种可能的工作流程(我经常使用)可能是使用分布式计算方法将您的数据预先调整为合理的大小,然后转换为 Pandas 数据框以获得丰富的功能集。希望有帮助。