Python Spark DataFrame 方法“toPandas”实际上在做什么？

Question

提问by Napitupulu Jon

I'm a beginner of Spark-DataFrame API.

我是 Spark-DataFrame API 的初学者。

I use this code to load csv tab-separated into Spark Dataframe

我使用此代码将 csv tab-separated 加载到 Spark Dataframe

lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)

Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),

假设我使用 Spark 从新文件创建 DataFrame，并使用内置方法 toPandas() 将其转换为 Pandas，

Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

它是否将 Pandas 对象存储到本地内存？
Pandas 低级计算是否全部由 Spark 处理？
它是否暴露了所有 Pandas 数据框功能？（我想是的）
我可以将它转换为 Pandas 并完成它，而不用太多接触 DataFrame API 吗？

Answer 1

采纳答案by Phillip Cloud

Using spark to read in a CSV file to pandasis quite a roundabout method for achieving the end goal of reading a CSV file into memory.

使用 spark 读取 CSV 文件pandas是实现将 CSV 文件读入内存的最终目标的一种迂回方法。

It seems like you might be misunderstanding the use cases of the technologies in play here.

似乎您可能误解了此处使用的技术的用例。

Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.

Spark 用于分布式计算（尽管它可以在本地使用）。它通常过于重量级，无法用于简单地读取 CSV 文件。

In your example, the sc.textFilemethod will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.

在您的示例中，该sc.textFile方法将简单地为您提供一个 spark RDD，它实际上是一个文本行列表。这可能不是您想要的。不会执行类型推断，因此如果您想对 CSV 文件中的一列数字求和，您将无法进行，因为就 Spark 而言，它们仍然是字符串。

Just use pandas.read_csvand read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.

只需使用pandas.read_csv整个 CSV 并将其读入内存。Pandas 会自动推断每一列的类型。Spark 不会这样做。

Now to answer your questions:

现在回答您的问题：

Does it store the Pandas object to local memory:

它是否将 Pandas 对象存储到本地内存：

Yes. toPandas()will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.

是的。toPandas()将 Spark DataFrame 转换为 Pandas DataFrame，当然是在内存中。

Does Pandas low-level computation handled all by Spark

Pandas 低级计算是否全部由 Spark 处理

No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply someAPI compatibility.

不。Pandas 运行自己的计算，spark 和 Pandas 之间没有相互作用，只是一些API 兼容性。

Does it exposed all pandas dataframe functionality?

它是否暴露了所有 Pandas 数据框功能？

No. For example, Seriesobjects have an interpolatemethod which isn't available in PySpark Columnobjects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.

不可以。例如，Series对象具有interpolate在 PySparkColumn对象中不可用的方法。PySpark API 中有许多 Pandas API 中没有的方法和函数。

Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

我可以将它转换为 Pandas 并完成它，而不用太多接触 DataFrame API 吗？

Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csvwill likely handle your use case unless you're working with a hugeamount of data.

绝对地。事实上，在这种情况下，您甚至可能根本不应该使用 Spark。pandas.read_csv除非你是一个工作很可能会处理你的使用情况庞大的数据量。

Try to solve your problem with simple, low-tech, easy-to-understand libraries, and onlygo to something more complicated as you need it. Many times, you won't need the more complex technology.

尝试使用简单、技术含量低、易于理解的库来解决您的问题，并且仅在需要时才使用更复杂的库。很多时候，您不需要更复杂的技术。

Answer 2

回答by TheProletariat

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect()method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).

使用一些 spark 上下文或 hive 上下文方法 ( sc.textFile(), hc.sql()) 将数据“读入内存”返回一个 RDD，但 RDD 保留在分布式内存中（工作节点上的内存），而不是主节点上的内存。所有 RDD 方法（rdd.map()、rdd.reduceByKey()等）都设计为在工作节点上并行运行，但有一些例外。例如，如果您运行一个rdd.collect()方法，您最终会将 rdd 的内容从所有工作节点复制到主节点内存。因此，您失去了分布式计算的优势（但仍然可以运行 rdd 方法）。

Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

与 pandas 类似，当您运行时toPandas()，您将数据帧从分布式（工作）内存复制到本地（主）内存，并失去了大部分分布式计算能力。因此，一种可能的工作流程（我经常使用）可能是使用分布式计算方法将您的数据预先调整为合理的大小，然后转换为 Pandas 数据框以获得丰富的功能集。希望有帮助。

Python Spark DataFrame 方法“toPandas”实际上在做什么？

提问by Napitupulu Jon

采纳答案by Phillip Cloud

回答by TheProletariat

相关推荐

最近更新

标签

Python Spark DataFrame 方法“toPandas”实际上在做什么？

提问by Napitupulu Jon

采纳答案by Phillip Cloud

回答by TheProletariat

相关推荐

目录中的 Python 文件夹名称

Python Django 找不到静态文件。需要第二双眼睛，我快疯了

Python 数据类型“datetime64[ns]”和“<M8[ns]”之间的区别？

Python pymongo.errors.CursorNotFound: 游标 ID '...' 在服务器上无效

相关推荐

最近更新

标签