Python 如何在 Spark 中分配和使用列标题?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36608559/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:06:45  来源:igfitidea点击:

How to assign and use column headers in Spark?

pythonhadoopapache-sparkpysparkmultiple-columns

提问by GoldenPlatinum

I am reading a dataset as below.

我正在阅读如下数据集。

 f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

我的文件包含 50 多个字段,我想为每个字段分配列标题,以便稍后在我的脚本中引用。

How do I do that in PySpark ? Is DataFrame way to go here ?

我如何在 PySpark 中做到这一点?DataFrame 是通往这里的路吗?

PS - Newbie to Spark.

PS - 新手到 Spark。

回答by BushMinusZero

The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.

这个问题的解决方案实际上取决于您运行的 Spark 版本。假设您使用的是 Spark 2.0+,那么您可以将 CSV 作为 DataFrame 读取并添加带有 toDF 的列,这有助于将 RDD 转换为 DataFrame 或将列添加到现有数据帧。

filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")

回答by Ida

Here is how to add column names using DataFrame:

以下是如何使用 DataFrame 添加列名:

Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:

假设您的 csv 有分隔符“,”。在将数据传输到 DataFrame 之前按如下方式准备数据:

f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])

Suppose the data has 3 columns:

假设数据有 3 列:

data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]

Now, you can specify the column names when transferring this RDD to DataFrame using toDF():

现在,您可以使用以下命令将此 RDD 传输到 DataFrame 时指定列名toDF()

df_withcol = data_rdd.toDF(['height','color','width'])

df_withcol.printSchema()

    root
     |-- height: string (nullable = true)
     |-- color: string (nullable = true)
     |-- width: string (nullable = true)

If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:

如果你不指定列名,你会得到一个带有默认列名 '_1', '_2', ... 的 DataFrame:

df_default = data_rdd.toDF()

df_default.printSchema()

    root
     |-- _1: string (nullable = true)
     |-- _2: string (nullable = true)
     |-- _3: string (nullable = true)

回答by Vinod Kumar

f = sc.textFile("s3://test/abc.csv") <br />
header = f.first()

header will give you as below:-<br />
u'col1,col2,col3'  --> for example taking 3 columns name

head = str(header).split(",") <br />
head will give you a list<br/>
['col1','col2','col3']

fDF = f.filter(lambda row: row!=header).map(lambda x: str(x).split(",")).toDF(head)<br/>
fdF.show() <br/>

will give you header as well as the data in dataframe as required.

将根据需要为您提供标题以及数据框中的数据。