Python Pyspark 将标准列表转换为数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48448473/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:42:42  来源:igfitidea点击:

Pyspark convert a standard list to data frame

pythonapache-sparkpysparkpyspark-sql

提问by seiya

The case is really simple, I need to convert a python list into data frame with following code

案例非常简单,我需要使用以下代码将python列表转换为数据框

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType, IntegerType

schema = StructType([StructField("value", IntegerType(), True)])
my_list = [1, 2, 3, 4]
rdd = sc.parallelize(my_list)
df = sqlContext.createDataFrame(rdd, schema)

df.show()

it failed with following error:

它失败并出现以下错误:

    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 1 in type <class 'int'>

回答by E. Ducateme

This solution is also an approach that uses less code, avoids serialization to RDD and is likely easier to understand:

该解决方案也是一种使用较少代码、避免序列化为 RDD 并且可能更易于理解的方法:

from pyspark.sql.types import IntegerType

# notice the variable name (more below)
mylist = [1, 2, 3, 4]

# notice the parens after the type name
spark.createDataFrame(mylist, IntegerType()).show()

NOTE: About naming your variable list: the term listis a Python builtin function and as such, it is strongly recommended that we avoid using builtin names as the name/label for our variables because we end up overwriting things like the list()function. When prototyping something fast and dirty, a number of folks use something like: mylist.

注意:关于命名变量list:该术语list是 Python 内置函数,因此,强烈建议我们避免使用内置名称作为变量的名称/标签,因为我们最终会覆盖list()函数之类的内容。在对快速而肮脏的东西进行原型设计时,许多人使用类似的东西:mylist.

回答by user15051990

Please see the below code:

请看下面的代码:

    from pyspark.sql import Row
    li=[1,2,3,4]
    rdd1 = sc.parallelize(li)
    row_rdd = rdd1.map(lambda x: Row(x))
    df=sqlContext.createDataFrame(row_rdd,['numbers']).show()

df

df

+-------+
|numbers|
+-------+
|      1|
|      2|
|      3|
|      4|
+-------+