Pandas 数据框到 RDD

Question

提问by kraster

Can I convert a Pandas DataFrame to RDD?

我可以将 Pandas DataFrame 转换为 RDD 吗？

if isinstance(data2, pd.DataFrame):
    print 'is Dataframe'
else:
    print 'is NOT Dataframe'

is DataFrame

是数据帧

Here is the output when trying to use .rdd

这是尝试使用 .rdd 时的输出

dataRDD = data2.rdd
print dataRDD

AttributeError                            Traceback (most recent call last)
<ipython-input-56-7a9188b07317> in <module>()
----> 1 dataRDD = data2.rdd
      2 print dataRDD

/usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2148                 return self[name]
   2149             raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150                                  (type(self).__name__, name))
   2151 
   2152     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'rdd'

I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF

我想使用 Pandas Dataframe 而不是 sqlContext 来构建，因为我不确定 Pandas DF 中的所有功能是否都可以在 Spark 中使用。如果这是不可能的，是否有人可以提供使用 Spark DF 的示例

Answer 1

回答by zero323

Can I convert a Pandas Dataframe to RDD?

我可以将 Pandas Dataframe 转换为 RDD 吗？

Well, yes you can do it. Pandas Data Frames

嗯，是的，你可以做到。Pandas数据帧

pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print pdDF

##      k  v
## 0  foo  1
## 1  bar  2

can be converted to Spark Data Frames

可以转换为 Spark 数据帧

spDF = sqlContext.createDataFrame(pdDF)
spDF.show()

## +---+-+
## |  k|v|
## +---+-+
## |foo|1|
## |bar|2|
## +---+-+

and after that you can easily access underlying RDD

之后您可以轻松访问底层RDD

spDF.rdd.first()

## Row(k=u'foo', v=1)

Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no rddattribute). Unlike Spark DataFrame it provides random access capabilities.

不过，我认为你在这里有一个错误的想法。Pandas Data Frame 是一种本地数据结构。它在驱动程序上本地存储和处理。没有数据分发或并行处理，也没有使用 RDD（因此没有rdd属性）。与 Spark DataFrame 不同，它提供随机访问功能。

Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (sqlContext.sql) or SQL like API (df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))). There is no random access and it is immutable (no equivalent of Pandas inplace). Every transformation returns new DataFrame.

Spark DataFrame 是在幕后使用 RDD 的分布式数据结构。可以使用原始 SQL ( sqlContext.sql) 或 SQL 之类的 API ( df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar"))))访问它。没有随机访问，它是不可变的（没有 Pandas 的等价物inplace）。每个转换都会返回新的 DataFrame。

If this is not possible, is there anyone that can provide an example of using Spark DF

如果这是不可能的，是否有人可以提供使用 Spark DF 的示例

Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these:

并不真地。对于 SO 来说，这是一个广泛的话题。Spark 有一个非常好的文档，Databricks 提供了一些额外的资源。对于初学者，您可以检查这些：

Pandas 数据框到 RDD

提问by kraster

回答by zero323

相关推荐

最近更新

标签

Pandas 数据框到 RDD

提问by kraster

回答by zero323

相关推荐

pandas 将集合计数器变成字典

pandas 熊猫从 csv 读取数据帧，索引为字符串，而不是 int

pandas 使用熊猫读取csv中的特定单元格

pandas 将超链接添加到由熊猫数据框 to_excel 方法创建的 Excel 表

相关推荐

最近更新

标签