Python 如何在 Spark 中分配和使用列标题？

Question

提问by GoldenPlatinum

I am reading a dataset as below.

我正在阅读如下数据集。

 f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

我的文件包含 50 多个字段，我想为每个字段分配列标题，以便稍后在我的脚本中引用。

How do I do that in PySpark ? Is DataFrame way to go here ?

我如何在 PySpark 中做到这一点？DataFrame 是通往这里的路吗？

PS - Newbie to Spark.

PS - 新手到 Spark。

Answer 1

回答by BushMinusZero

The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.

这个问题的解决方案实际上取决于您运行的 Spark 版本。假设您使用的是 Spark 2.0+，那么您可以将 CSV 作为 DataFrame 读取并添加带有 toDF 的列，这有助于将 RDD 转换为 DataFrame 或将列添加到现有数据帧。

filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")

Answer 2

回答by Ida

Here is how to add column names using DataFrame:

以下是如何使用 DataFrame 添加列名：

Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:

假设您的 csv 有分隔符“,”。在将数据传输到 DataFrame 之前按如下方式准备数据：

f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])

Suppose the data has 3 columns:

假设数据有 3 列：

data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]

Now, you can specify the column names when transferring this RDD to DataFrame using toDF():

现在，您可以使用以下命令将此 RDD 传输到 DataFrame 时指定列名toDF()：

df_withcol = data_rdd.toDF(['height','color','width'])

df_withcol.printSchema()

    root
     |-- height: string (nullable = true)
     |-- color: string (nullable = true)
     |-- width: string (nullable = true)

If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:

如果你不指定列名，你会得到一个带有默认列名 '_1', '_2', ... 的 DataFrame：

df_default = data_rdd.toDF()

df_default.printSchema()

    root
     |-- _1: string (nullable = true)
     |-- _2: string (nullable = true)
     |-- _3: string (nullable = true)

Answer 3

回答by Vinod Kumar

f = sc.textFile("s3://test/abc.csv") <br />
header = f.first()

header will give you as below:-<br />
u'col1,col2,col3'  --> for example taking 3 columns name

head = str(header).split(",") <br />
head will give you a list<br/>
['col1','col2','col3']

fDF = f.filter(lambda row: row!=header).map(lambda x: str(x).split(",")).toDF(head)<br/>
fdF.show() <br/>

will give you header as well as the data in dataframe as required.

将根据需要为您提供标题以及数据框中的数据。

Python 如何在 Spark 中分配和使用列标题？

提问by GoldenPlatinum

回答by BushMinusZero

回答by Ida

回答by Vinod Kumar

相关推荐

最近更新

标签

Python 如何在 Spark 中分配和使用列标题？

提问by GoldenPlatinum

回答by BushMinusZero

回答by Ida

回答by Vinod Kumar

相关推荐

Python DJANGO - 使用数据从 POST 重定向到不同的页面

python 2.7 functools_lru_cache 虽然安装了但不导入

Python PySpark 逐行函数组合

Python 如何使用 numpy 数组在 Keras 中设置权重？

相关推荐

最近更新

标签