Python 使用列名编写 csv 并读取从 Pyspark 中的 sparksql 数据框生成的 csv 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38611418/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark
提问by Satya
i have started the shell with databrick csv package
我已经用 databrick csv 包启动了 shell
#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0
Then i read a csv file did some groupby op and dump that to a csv.
然后我读了一个 csv 文件做了一些 groupby 操作并将它的转储到一个 csv。
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv') ####it has columns and df.columns works fine
type(df) #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names
Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.
请不要像在 read_csv 之后或在阅读时提及列名时向数据帧添加架构那样回答。
Question1- while giving csv dump is there any way i can add column name with that???
问题 1- 在进行 csv 转储时,有什么方法可以添加列名吗???
Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???
问题 2 - 有没有办法创建单个 csv 文件(不是目录),可以由 ms office 或记事本++打开???
note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.
注意:我目前没有使用集群,因为它对于像我这样的 Spark 初学者来说太复杂了。如果有人可以提供有关如何在集群环境中将 to_csv 处理为单个文件的链接,那将是一个很大的帮助。
回答by Mike Metzger
Try
尝试
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.
请注意,这可能不是您当前设置的问题,但在极大的数据集上,您可能会遇到驱动程序的内存问题。这也将需要更长的时间(在集群场景中),因为一切都必须推回到一个位置。
回答by FrancescoM
Just in case, on spark 2.1 you can create a single csv file with the following lines
以防万一,在 spark 2.1 上,您可以使用以下几行创建一个 csv 文件
dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")
回答by Satya
with spark >= 2.o, we can do something like
使用 spark >= 2.o,我们可以做类似的事情
df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False) ###single csv(Pandas Style)
回答by Satya
got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement
得到了第一个问题的答案,这是将一个额外的参数 header = 'true' 与 csv 语句一起传递的问题
df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
#Alternative for 2nd question
#第二个问题的替代方案
Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.
使用 topandas.to_csv ,但我又不想在这里使用熊猫,所以请建议是否有其他方法。
回答by Giorgos Myrianthous
The following should do the trick:
以下应该可以解决问题:
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Alternatively, if you want the results to be in a single partition, you can use coalesce(1)
:
或者,如果您希望结果在单个分区中,您可以使用coalesce(1)
:
df \
.coalesce(1) \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Note however that this is an expensive operation and might not be feasible with extremely large datasets.
但是请注意,这是一项昂贵的操作,对于超大数据集可能不可行。