使用python将csv转换为镶木地板文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50604133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert csv to parquet file using python
提问by inquisitiveProgrammer
I am trying to convert a .csv file to a .parquet file.
The csv file (Temp.csv
) has the following format
我正在尝试将 .csv 文件转换为 .parquet 文件。
csv 文件 ( Temp.csv
) 具有以下格式
1,Jon,Doe,Denver
I am using the following python code to convert it into parquet
我正在使用以下 python 代码将其转换为镶木地板
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import os
if __name__ == "__main__":
sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True),
StructField("col4", StringType(), True)])
dirname = os.path.dirname(os.path.abspath(__file__))
csvfilename = os.path.join(dirname,'Temp.csv')
rdd = sc.textFile(csvfilename).map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
parquetfilename = os.path.join(dirname,'output.parquet')
df.write.mode('overwrite').parquet(parquetfilename)
The result is only a folder named, output.parquet
and not a parquet file that I'm looking for, followed by the following error on the console.
结果只是一个命名的文件夹,output.parquet
而不是我正在寻找的镶木地板文件,然后在控制台上出现以下错误。
I have also tried running the following code to face a similar issue.
我也尝试运行以下代码来面对类似的问题。
from pyspark.sql import SparkSession
import os
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# read csv
dirname = os.path.dirname(os.path.abspath(__file__))
csvfilename = os.path.join(dirname,'Temp.csv')
df = spark.read.csv(csvfilename)
# Displays the content of the DataFrame to stdout
df.show()
parquetfilename = os.path.join(dirname,'output.parquet')
df.write.mode('overwrite').parquet(parquetfilename)
How to best do it? Using windows, python 2.7.
怎么做最好?使用 Windows,python 2.7。
回答by Uwe L. Korn
Using the packages pyarrow
and pandas
you can convert CSVs to Parquet without using a JVM in the background:
使用这些包pyarrow
,pandas
您可以将 CSV 转换为 Parquet,而无需在后台使用 JVM:
import pandas as pd
df = pd.read_csv('example.csv')
df.to_parquet('output.parquet')
One limitation in which you will run is that pyarrow
is only available for Python 3.5+ on Windows. Either use Linux/OSX to run the code as Python 2 or upgrade your windows setup to Python 3.6.
您将运行的一个限制pyarrow
是仅适用于 Windows 上的 Python 3.5+。要么使用 Linux/OSX 将代码作为 Python 2 运行,要么将您的 Windows 设置升级到 Python 3.6。
回答by Amol More
import boto3
import pandas as pd
import pyarrow as pa
from s3fs import S3FileSystem
import pyarrow.parquet as pq
s3 = boto3.client('s3',region_name='us-east-2')
obj = s3.get_object(Bucket='ssiworkoutput', Key='file_Folder/File_Name.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
output_file = "s3://ssiworkoutput/file/output.parquet" # S3 Path need to mention
s3 = S3FileSystem()
pq.write_to_dataset(table=table,
root_path=output_file,partition_cols=['Year','Month'],
filesystem=s3)
print("File converted from CSV to parquet completed")
回答by Powers
There are a few different ways to convert a CSV file to Parquet with Python.
有几种不同的方法可以使用 Python 将 CSV 文件转换为 Parquet。
Uwe L. Korn's Pandas approach works perfectly well.
Uwe L. Korn 的 Pandas 方法非常有效。
Here's a PySpark snippet that works in a Spark environment:
这是在 Spark 环境中工作的 PySpark 片段:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("parquet_example") \
.getOrCreate()
df = spark.read.csv('data/us_presidents.csv', header = True)
df.repartition(1).write.mode('overwrite').parquet('tmp/pyspark_us_presidents')
You can also use Koalasin a Spark environment:
您还可以在 Spark 环境中使用Koalas:
import databricks.koalas as ks
df = ks.read_csv('data/us_presidents.csv')
df.to_parquet('tmp/koala_us_presidents')
Read this blog postfor more information.
阅读此博文了解更多信息。
回答by ishwar
You can write as a PARQUET FILE using spark:
您可以使用spark将其编写为 PARQUET FILE :
spark = SparkSession.builder.appName("Test_Parquet").master("local[*]").getOrCreate()
parquetDF = spark.read.csv("data.csv")
parquetDF.coalesce(1).write.mode("overwrite").parquet("Parquet")
I hope this helps
我希望这有帮助
回答by taras
You can convert csv to parquet using pyarrow only - without pandas. It might be useful when you need to minimize your code dependencies (ex. with AWS Lambda).
您可以仅使用 pyarrow 将 csv 转换为镶木地板 - 没有熊猫。当您需要最小化代码依赖项(例如使用 AWS Lambda)时,它可能很有用。
import pyarrow.csv as pv
import pyarrow.parquet as pq
table = pv.read_csv(filename)
pq.write_table(table, filename.replace('csv', 'parquet'))
Refer to the pyarrow docs to fine-tune read_csv
and write_table
functions.
请参阅 pyarrow 文档以进行微调read_csv
和write_table
功能。