使用 .csv 格式的 HDFS 文件创建 Pandas DataFrame

Question

提问by Raghav Gupta

Im trying to create a Spark workflow from fetching .csv data from a hadoop cluster and putting it into Pandas DataFrame. I'm able to pull the Data from HDFS and put it in an RDD, but unable to process it into the Pandas Dataframe. The following is my code:

我试图通过从 hadoop 集群中获取 .csv 数据并将其放入 Pandas DataFrame 来创建 Spark 工作流。我能够从 HDFS 中提取数据并将其放入 RDD，但无法将其处理到 Pandas 数据帧中。以下是我的代码：

import pandas as pd
import numpy as nm

A=sc.textFile("hdfs://localhost:9000/sales_ord_univ.csv") # this creates the RDD
B=pd.DataFrame(A) # this gives me the following error:pandas.core.common.PandasError: DataFrame constructor not properly called!

I'm pretty sure this error is as such due to the RDD being a big single list, Hence I tried splitting the data by ';'( i.e each new row is a different string) But that didn't seem to help either.

我很确定这个错误是由于 RDD 是一个大的单一列表，因此我尝试用 ';' 分割数据（即每个新行都是不同的字符串）但这似乎也没有帮助。

My overall goal is to use Pandas to change CSV into JSON and output into MongoDB. I have done this project using DictReader, PysparkSQL, but wanted to check if it is possible using Pandas.

我的总体目标是使用 Pandas 将 CSV 转换为 JSON 并输出到 MongoDB。我已经使用 DictReader、PysparkSQL 完成了这个项目，但想检查是否可以使用 Pandas。

Any help would be appreciated Thanks!

任何帮助将不胜感激谢谢！

Answer 1

回答by Aeck

I would recommend to load the csv into a Spark DataFrame and convert it to a Pandas DataFrame.

我建议将 csv 加载到 Spark DataFrame 并将其转换为 Pandas DataFrame。

csvDf = sqlContext.read.format("csv").option("header", "true").option("inferschema", "true").option("mode", "DROPMALFORMED").load("hdfs://localhost:9000/sales_ord_univ.csv") 
B = csvDf.toPandas()

If you are still using a Spark version < 2.0, you have to use read.format("com.databricks.spark.csv")and include the com.databricks.spark.csv package (e.g. with the --packagesparameter when using the pyspark shell).

如果您仍在使用低于 2.0 的 Spark 版本，则必须使用read.format("com.databricks.spark.csv")并包含 com.databricks.spark.csv 包（例如，--packages在使用 pyspark shell 时使用参数）。

Answer 2

回答by zinking

you need hdfs (2.0.16)

你需要 hdfs (2.0.16)

from hdfs import Config
zzodClient = Config().get_client('zzod') #refer to the docs to set up config
with zzodClient.read(q2Path) as r2Reader:
    r2 = pandas.read_csv(r2Reader)

使用 .csv 格式的 HDFS 文件创建 Pandas DataFrame

提问by Raghav Gupta

回答by Aeck

回答by zinking

相关推荐

最近更新

标签

使用 .csv 格式的 HDFS 文件创建 Pandas DataFrame

提问by Raghav Gupta

回答by Aeck

回答by zinking

相关推荐

pandas 如何在groupby之后绘制数据

pandas 导入但未使用的 Python 库

许多数据帧上的高效 Python Pandas Stock Beta 计算

用 Pandas 上的值注释条形图（在 Seaborn factorplot 条形图上）

相关推荐

最近更新

标签