Python 使用列名编写 csv 并读取从 Pyspark 中的 sparksql 数据框生成的 csv 文件

Question

提问by Satya

i have started the shell with databrick csv package

我已经用 databrick csv 包启动了 shell

#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0

Then i read a csv file did some groupby op and dump that to a csv.

然后我读了一个 csv 文件做了一些 groupby 操作并将它的转储到一个 csv。

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv')   ####it has columns and df.columns works fine
type(df)   #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names

Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.

请不要像在 read_csv 之后或在阅读时提及列名时向数据帧添加架构那样回答。

Question1- while giving csv dump is there any way i can add column name with that???

问题 1- 在进行 csv 转储时，有什么方法可以添加列名吗？？？

Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???

问题 2 - 有没有办法创建单个 csv 文件（不是目录），可以由 ms office 或记事本++打开？？？

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

注意：我目前没有使用集群，因为它对于像我这样的 Spark 初学者来说太复杂了。如果有人可以提供有关如何在集群环境中将 to_csv 处理为单个文件的链接，那将是一个很大的帮助。

Answer 1

回答by Mike Metzger

Try

尝试

df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')

Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.

请注意，这可能不是您当前设置的问题，但在极大的数据集上，您可能会遇到驱动程序的内存问题。这也将需要更长的时间（在集群场景中），因为一切都必须推回到一个位置。

Answer 2

回答by FrancescoM

Just in case, on spark 2.1 you can create a single csv file with the following lines

以防万一，在 spark 2.1 上，您可以使用以下几行创建一个 csv 文件

dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")

Answer 3

回答by Satya

with spark >= 2.o, we can do something like

使用 spark >= 2.o，我们可以做类似的事情

df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False)  ###single csv(Pandas Style)

Answer 4

回答by Satya

got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement

得到了第一个问题的答案，这是将一个额外的参数 header = 'true' 与 csv 语句一起传递的问题

df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')

#Alternative for 2nd question

#第二个问题的替代方案

Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.

使用 topandas.to_csv ，但我又不想在这里使用熊猫，所以请建议是否有其他方法。

Answer 5

回答by Giorgos Myrianthous

The following should do the trick:

以下应该可以解决问题：

df \
  .write \
  .mode('overwrite') \
  .option('header', 'true') \
  .csv('output.csv')

Alternatively, if you want the results to be in a single partition, you can use coalesce(1):

或者，如果您希望结果在单个分区中，您可以使用coalesce(1)：

df \
  .coalesce(1) \
  .write \
  .mode('overwrite') \
  .option('header', 'true') \
  .csv('output.csv')

Note however that this is an expensive operation and might not be feasible with extremely large datasets.

但是请注意，这是一项昂贵的操作，对于超大数据集可能不可行。

Python 使用列名编写 csv 并读取从 Pyspark 中的 sparksql 数据框生成的 csv 文件

提问by Satya

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

注意：我目前没有使用集群，因为它对于像我这样的 Spark 初学者来说太复杂了。如果有人可以提供有关如何在集群环境中将 to_csv 处理为单个文件的链接，那将是一个很大的帮助。

回答by Mike Metzger

回答by FrancescoM

回答by Satya

回答by Satya

回答by Giorgos Myrianthous

相关推荐

最近更新

标签

Python 使用列名编写 csv 并读取从 Pyspark 中的 sparksql 数据框生成的 csv 文件

提问by Satya

note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.

注意：我目前没有使用集群，因为它对于像我这样的 Spark 初学者来说太复杂了。如果有人可以提供有关如何在集群环境中将 to_csv 处理为单个文件的链接，那将是一个很大的帮助。

回答by Mike Metzger

回答by FrancescoM

回答by Satya

回答by Satya

回答by Giorgos Myrianthous

相关推荐

Python 将列名从int转换为pandas中的字符串

python sklearn多元线性回归显示r平方

Python 大熊猫相当于 np.where

Python Tkinter Canvas 创建矩形

相关推荐

最近更新

标签