Python Pyspark 将标准列表转换为数据框

Question

提问by seiya

The case is really simple, I need to convert a python list into data frame with following code

案例非常简单，我需要使用以下代码将python列表转换为数据框

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType, IntegerType

schema = StructType([StructField("value", IntegerType(), True)])
my_list = [1, 2, 3, 4]
rdd = sc.parallelize(my_list)
df = sqlContext.createDataFrame(rdd, schema)

df.show()

it failed with following error:

它失败并出现以下错误：

    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 1 in type <class 'int'>

Answer 1

回答by E. Ducateme

This solution is also an approach that uses less code, avoids serialization to RDD and is likely easier to understand:

该解决方案也是一种使用较少代码、避免序列化为 RDD 并且可能更易于理解的方法：

from pyspark.sql.types import IntegerType

# notice the variable name (more below)
mylist = [1, 2, 3, 4]

# notice the parens after the type name
spark.createDataFrame(mylist, IntegerType()).show()

NOTE: About naming your variable list: the term listis a Python builtin function and as such, it is strongly recommended that we avoid using builtin names as the name/label for our variables because we end up overwriting things like the list()function. When prototyping something fast and dirty, a number of folks use something like: mylist.

注意：关于命名变量list：该术语list是 Python 内置函数，因此，强烈建议我们避免使用内置名称作为变量的名称/标签，因为我们最终会覆盖list()函数之类的内容。在对快速而肮脏的东西进行原型设计时，许多人使用类似的东西：mylist.

Answer 2

回答by user15051990

Please see the below code:

请看下面的代码：

    from pyspark.sql import Row
    li=[1,2,3,4]
    rdd1 = sc.parallelize(li)
    row_rdd = rdd1.map(lambda x: Row(x))
    df=sqlContext.createDataFrame(row_rdd,['numbers']).show()

df

+-------+
|numbers|
+-------+
|      1|
|      2|
|      3|
|      4|
+-------+

Python Pyspark 将标准列表转换为数据框

提问by seiya

回答by E. Ducateme

回答by user15051990

相关推荐

最近更新

标签

Python Pyspark 将标准列表转换为数据框

提问by seiya

回答by E. Ducateme

回答by user15051990

相关推荐

Python 如何在使用 pip 安装的 Anaconda 中卸载软件包

Python sklearn 基于列的分层抽样

Python 如何在 Google Colab 中读取 csv 到数据框

Python 将元素添加到 numpy 数组

相关推荐

最近更新

标签