Spark RDD 到 DataFrame python
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39699107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark RDD to DataFrame python
提问by Hyman Daniel
I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the scheme is passed to
sqlContext.CreateDataFrame(rdd,schema)
function.
我正在尝试将 Spark RDD 转换为 DataFrame。我已经看到了将方案传递给sqlContext.CreateDataFrame(rdd,schema)
函数的文档和示例
。
But I have 38 columns or fields and this will increase further. If I manually give the schema specifying each field information, that it going to be so tedious job.
但是我有 38 个列或字段,这将进一步增加。如果我手动给出指定每个字段信息的模式,那将是一项繁琐的工作。
Is there any other way to specify the schema without knowing the information of the columns prior.
有没有其他方法可以在不知道列信息的情况下指定模式。
回答by Thiago Baldim
See,
看,
There are two ways to convert an RDD to DF in Spark.
在 Spark 中有两种方法可以将 RDD 转换为 DF。
toDF()
and createDataFrame(rdd, schema)
toDF()
和 createDataFrame(rdd, schema)
I will show you how you can do that dynamically.
我将向您展示如何动态地做到这一点。
toDF()
toDF()
The toDF()
command gives you the way to convert an RDD[Row]
to a Dataframe. The point is, the object Row()
can receive a **kwargs
argument. So, there is an easy way to do that.
该toDF()
命令为您提供了将 an 转换RDD[Row]
为 Dataframe 的方法。关键是,对象Row()
可以接收一个**kwargs
参数。因此,有一种简单的方法可以做到这一点。
from pyspark.sql.types import Row
#here you are going to create a function
def f(x):
d = {}
for i in range(len(x)):
d[str(i)] = x[i]
return d
#Now populate that
df = rdd.map(lambda x: Row(**f(x))).toDF()
This way you are going to be able to create a dataframe dynamically.
通过这种方式,您将能够动态创建数据框。
createDataFrame(rdd, schema)
createDataFrame(rdd,架构)
Other way to do that is creating a dynamic schema. How?
另一种方法是创建动态模式。如何?
This way:
这边走:
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
schema = StructType([StructField(str(i), StringType(), True) for i in range(32)])
df = sqlContext.createDataFrame(rdd, schema)
This second way is cleaner to do that...
第二种方法更干净……
So this is how you can create dataframes dynamically.
这就是动态创建数据帧的方式。
回答by Arun Sharma
Try if that works
试试看是否有效
sc = spark.sparkContext
# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(RddName)
schemaPeople.createOrReplaceTempView("RddName")
回答by pegah
I liked Arun's answer better but there is a tiny problem and I could not comment or edit the answer. sparkContext does not have createDeataFrame, sqlContext does (as Thiago mentioned). So:
我更喜欢 Arun 的答案,但有一个小问题,我无法评论或编辑答案。sparkContext 没有 createDeataFrame,sqlContext 有(正如蒂亚戈提到的)。所以:
from pyspark.sql import SQLContext
# assuming the spark environemnt is set and sc is spark.sparkContext
sqlContext = SQLContext(sc)
schemaPeople = sqlContext.createDataFrame(RDDName)
schemaPeople.createOrReplaceTempView("RDDName")