使用python将csv转换为镶木地板文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50604133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:33:31  来源:igfitidea点击:

Convert csv to parquet file using python

pythoncsvparquet

提问by inquisitiveProgrammer

I am trying to convert a .csv file to a .parquet file.
The csv file (Temp.csv) has the following format

我正在尝试将 .csv 文件转换为 .parquet 文件。
csv 文件 ( Temp.csv) 具有以下格式

1,Jon,Doe,Denver

I am using the following python code to convert it into parquet

我正在使用以下 python 代码将其转换为镶木地板

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import os

if __name__ == "__main__":
    sc = SparkContext(appName="CSV2Parquet")
    sqlContext = SQLContext(sc)

    schema = StructType([
            StructField("col1", IntegerType(), True),
            StructField("col2", StringType(), True),
            StructField("col3", StringType(), True),
            StructField("col4", StringType(), True)])
    dirname = os.path.dirname(os.path.abspath(__file__))
    csvfilename = os.path.join(dirname,'Temp.csv')    
    rdd = sc.textFile(csvfilename).map(lambda line: line.split(","))
    df = sqlContext.createDataFrame(rdd, schema)
    parquetfilename = os.path.join(dirname,'output.parquet')    
    df.write.mode('overwrite').parquet(parquetfilename)

The result is only a folder named, output.parquetand not a parquet file that I'm looking for, followed by the following error on the console.

结果只是一个命名的文件夹,output.parquet而不是我正在寻找的镶木地板文件,然后在控制台上出现以下错误。

CSV to Parquet Error

CSV 到 Parquet 错误

I have also tried running the following code to face a similar issue.

我也尝试运行以下代码来面对类似的问题。

from pyspark.sql import SparkSession
import os

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
dirname = os.path.dirname(os.path.abspath(__file__))
csvfilename = os.path.join(dirname,'Temp.csv')    
df = spark.read.csv(csvfilename)

# Displays the content of the DataFrame to stdout
df.show()
parquetfilename = os.path.join(dirname,'output.parquet')    
df.write.mode('overwrite').parquet(parquetfilename)

How to best do it? Using windows, python 2.7.

怎么做最好?使用 Windows,python 2.7。

回答by Uwe L. Korn

Using the packages pyarrowand pandasyou can convert CSVs to Parquet without using a JVM in the background:

使用这些包pyarrowpandas您可以将 CSV 转换为 Parquet,而无需在后台使用 JVM:

import pandas as pd
df = pd.read_csv('example.csv')
df.to_parquet('output.parquet')

One limitation in which you will run is that pyarrowis only available for Python 3.5+ on Windows. Either use Linux/OSX to run the code as Python 2 or upgrade your windows setup to Python 3.6.

您将运行的一个限制pyarrow是仅适用于 Windows 上的 Python 3.5+。要么使用 Linux/OSX 将代码作为 Python 2 运行,要么将您的 Windows 设置升级到 Python 3.6。

回答by Amol More

import boto3
import pandas as pd
import pyarrow as pa
from s3fs import S3FileSystem
import pyarrow.parquet as pq

s3 = boto3.client('s3',region_name='us-east-2')
obj = s3.get_object(Bucket='ssiworkoutput', Key='file_Folder/File_Name.csv')
df = pd.read_csv(obj['Body'])

table = pa.Table.from_pandas(df)

output_file = "s3://ssiworkoutput/file/output.parquet"  # S3 Path need to mention
s3 = S3FileSystem()

pq.write_to_dataset(table=table,
                    root_path=output_file,partition_cols=['Year','Month'],
                    filesystem=s3)

print("File converted from CSV to parquet completed")

回答by Powers

There are a few different ways to convert a CSV file to Parquet with Python.

有几种不同的方法可以使用 Python 将 CSV 文件转换为 Parquet。

Uwe L. Korn's Pandas approach works perfectly well.

Uwe L. Korn 的 Pandas 方法非常有效。

Here's a PySpark snippet that works in a Spark environment:

这是在 Spark 环境中工作的 PySpark 片段:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .master("local") \
  .appName("parquet_example") \
  .getOrCreate()

df = spark.read.csv('data/us_presidents.csv', header = True)
df.repartition(1).write.mode('overwrite').parquet('tmp/pyspark_us_presidents')

You can also use Koalasin a Spark environment:

您还可以在 Spark 环境中使用Koalas

import databricks.koalas as ks

df = ks.read_csv('data/us_presidents.csv')
df.to_parquet('tmp/koala_us_presidents')

Read this blog postfor more information.

阅读此博文了解更多信息。

回答by ishwar

You can write as a PARQUET FILE using spark:

您可以使用spark将其编写为 PARQUET FILE :

spark = SparkSession.builder.appName("Test_Parquet").master("local[*]").getOrCreate()

parquetDF = spark.read.csv("data.csv")

parquetDF.coalesce(1).write.mode("overwrite").parquet("Parquet")

I hope this helps

我希望这有帮助

回答by taras

You can convert csv to parquet using pyarrow only - without pandas. It might be useful when you need to minimize your code dependencies (ex. with AWS Lambda).

您可以仅使用 pyarrow 将 csv 转换为镶木地板 - 没有熊猫。当您需要最小化代码依赖项(例如使用 AWS Lambda)时,它可能很有用。

import pyarrow.csv as pv
import pyarrow.parquet as pq

table = pv.read_csv(filename)
pq.write_table(table, filename.replace('csv', 'parquet'))

Refer to the pyarrow docs to fine-tune read_csvand write_tablefunctions.

请参阅 pyarrow 文档以进行微调read_csvwrite_table功能。