使用 .csv 格式的 HDFS 文件创建 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39623858/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Creating a Pandas DataFrame with HDFS file in .csv format
提问by Raghav Gupta
Im trying to create a Spark workflow from fetching .csv data from a hadoop cluster and putting it into Pandas DataFrame. I'm able to pull the Data from HDFS and put it in an RDD, but unable to process it into the Pandas Dataframe. The following is my code:
我试图通过从 hadoop 集群中获取 .csv 数据并将其放入 Pandas DataFrame 来创建 Spark 工作流。我能够从 HDFS 中提取数据并将其放入 RDD,但无法将其处理到 Pandas 数据帧中。以下是我的代码:
import pandas as pd
import numpy as nm
A=sc.textFile("hdfs://localhost:9000/sales_ord_univ.csv") # this creates the RDD
B=pd.DataFrame(A) # this gives me the following error:pandas.core.common.PandasError: DataFrame constructor not properly called!
I'm pretty sure this error is as such due to the RDD being a big single list, Hence I tried splitting the data by ';'( i.e each new row is a different string) But that didn't seem to help either.
我很确定这个错误是由于 RDD 是一个大的单一列表,因此我尝试用 ';' 分割数据(即每个新行都是不同的字符串)但这似乎也没有帮助。
My overall goal is to use Pandas to change CSV into JSON and output into MongoDB. I have done this project using DictReader, PysparkSQL, but wanted to check if it is possible using Pandas.
我的总体目标是使用 Pandas 将 CSV 转换为 JSON 并输出到 MongoDB。我已经使用 DictReader、PysparkSQL 完成了这个项目,但想检查是否可以使用 Pandas。
Any help would be appreciated Thanks!
任何帮助将不胜感激谢谢!
回答by Aeck
I would recommend to load the csv into a Spark DataFrame and convert it to a Pandas DataFrame.
我建议将 csv 加载到 Spark DataFrame 并将其转换为 Pandas DataFrame。
csvDf = sqlContext.read.format("csv").option("header", "true").option("inferschema", "true").option("mode", "DROPMALFORMED").load("hdfs://localhost:9000/sales_ord_univ.csv")
B = csvDf.toPandas()
If you are still using a Spark version < 2.0, you have to use read.format("com.databricks.spark.csv")
and include the com.databricks.spark.csv package (e.g. with the --packages
parameter when using the pyspark shell).
如果您仍在使用低于 2.0 的 Spark 版本,则必须使用read.format("com.databricks.spark.csv")
并包含 com.databricks.spark.csv 包(例如,--packages
在使用 pyspark shell 时使用参数)。
回答by zinking
you need hdfs (2.0.16)
你需要 hdfs (2.0.16)
from hdfs import Config
zzodClient = Config().get_client('zzod') #refer to the docs to set up config
with zzodClient.read(q2Path) as r2Reader:
r2 = pandas.read_csv(r2Reader)