使用python将csv转换为镶木地板文件

Question

提问by inquisitiveProgrammer

I am trying to convert a .csv file to a .parquet file.
The csv file (Temp.csv) has the following format

我正在尝试将 .csv 文件转换为 .parquet 文件。
csv 文件 ( Temp.csv) 具有以下格式

1,Jon,Doe,Denver

I am using the following python code to convert it into parquet

我正在使用以下 python 代码将其转换为镶木地板

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import os

if __name__ == "__main__":
    sc = SparkContext(appName="CSV2Parquet")
    sqlContext = SQLContext(sc)

    schema = StructType([
            StructField("col1", IntegerType(), True),
            StructField("col2", StringType(), True),
            StructField("col3", StringType(), True),
            StructField("col4", StringType(), True)])
    dirname = os.path.dirname(os.path.abspath(__file__))
    csvfilename = os.path.join(dirname,'Temp.csv')    
    rdd = sc.textFile(csvfilename).map(lambda line: line.split(","))
    df = sqlContext.createDataFrame(rdd, schema)
    parquetfilename = os.path.join(dirname,'output.parquet')    
    df.write.mode('overwrite').parquet(parquetfilename)

The result is only a folder named, output.parquetand not a parquet file that I'm looking for, followed by the following error on the console.

结果只是一个命名的文件夹，output.parquet而不是我正在寻找的镶木地板文件，然后在控制台上出现以下错误。

I have also tried running the following code to face a similar issue.

我也尝试运行以下代码来面对类似的问题。

from pyspark.sql import SparkSession
import os

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
dirname = os.path.dirname(os.path.abspath(__file__))
csvfilename = os.path.join(dirname,'Temp.csv')    
df = spark.read.csv(csvfilename)

# Displays the content of the DataFrame to stdout
df.show()
parquetfilename = os.path.join(dirname,'output.parquet')    
df.write.mode('overwrite').parquet(parquetfilename)

How to best do it? Using windows, python 2.7.

怎么做最好？使用 Windows，python 2.7。

Answer 1

回答by Uwe L. Korn

Using the packages pyarrowand pandasyou can convert CSVs to Parquet without using a JVM in the background:

使用这些包pyarrow，pandas您可以将 CSV 转换为 Parquet，而无需在后台使用 JVM：

import pandas as pd
df = pd.read_csv('example.csv')
df.to_parquet('output.parquet')

One limitation in which you will run is that pyarrowis only available for Python 3.5+ on Windows. Either use Linux/OSX to run the code as Python 2 or upgrade your windows setup to Python 3.6.

您将运行的一个限制pyarrow是仅适用于 Windows 上的 Python 3.5+。要么使用 Linux/OSX 将代码作为 Python 2 运行，要么将您的 Windows 设置升级到 Python 3.6。

Answer 2

回答by Amol More

import boto3
import pandas as pd
import pyarrow as pa
from s3fs import S3FileSystem
import pyarrow.parquet as pq

s3 = boto3.client('s3',region_name='us-east-2')
obj = s3.get_object(Bucket='ssiworkoutput', Key='file_Folder/File_Name.csv')
df = pd.read_csv(obj['Body'])

table = pa.Table.from_pandas(df)

output_file = "s3://ssiworkoutput/file/output.parquet"  # S3 Path need to mention
s3 = S3FileSystem()

pq.write_to_dataset(table=table,
                    root_path=output_file,partition_cols=['Year','Month'],
                    filesystem=s3)

print("File converted from CSV to parquet completed")

Answer 3

回答by Powers

There are a few different ways to convert a CSV file to Parquet with Python.

有几种不同的方法可以使用 Python 将 CSV 文件转换为 Parquet。

Uwe L. Korn's Pandas approach works perfectly well.

Uwe L. Korn 的 Pandas 方法非常有效。

Here's a PySpark snippet that works in a Spark environment:

这是在 Spark 环境中工作的 PySpark 片段：

from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .master("local") \
  .appName("parquet_example") \
  .getOrCreate()

df = spark.read.csv('data/us_presidents.csv', header = True)
df.repartition(1).write.mode('overwrite').parquet('tmp/pyspark_us_presidents')

You can also use Koalasin a Spark environment:

您还可以在 Spark 环境中使用Koalas：

import databricks.koalas as ks

df = ks.read_csv('data/us_presidents.csv')
df.to_parquet('tmp/koala_us_presidents')

Read this blog postfor more information.

阅读此博文了解更多信息。

Answer 4

回答by ishwar

You can write as a PARQUET FILE using spark:

您可以使用spark将其编写为 PARQUET FILE ：

spark = SparkSession.builder.appName("Test_Parquet").master("local[*]").getOrCreate()

parquetDF = spark.read.csv("data.csv")

parquetDF.coalesce(1).write.mode("overwrite").parquet("Parquet")

I hope this helps

我希望这有帮助

Answer 5

回答by taras

You can convert csv to parquet using pyarrow only - without pandas. It might be useful when you need to minimize your code dependencies (ex. with AWS Lambda).

您可以仅使用 pyarrow 将 csv 转换为镶木地板 - 没有熊猫。当您需要最小化代码依赖项（例如使用 AWS Lambda）时，它可能很有用。

import pyarrow.csv as pv
import pyarrow.parquet as pq

table = pv.read_csv(filename)
pq.write_table(table, filename.replace('csv', 'parquet'))

Refer to the pyarrow docs to fine-tune read_csvand write_tablefunctions.

请参阅 pyarrow 文档以进行微调read_csv和write_table功能。

使用python将csv转换为镶木地板文件

提问by inquisitiveProgrammer

回答by Uwe L. Korn

回答by Amol More

回答by Powers

回答by ishwar

回答by taras

相关推荐

最近更新

标签

使用python将csv转换为镶木地板文件

提问by inquisitiveProgrammer

回答by Uwe L. Korn

回答by Amol More

回答by Powers

回答by ishwar

回答by taras

相关推荐

Python 如何检查文件是否已打开（在同一进程中）

Python 用户在对话框中输入

Python OpenCV 在 Windows 上安装 opencv_contrib

Python 检查一个数据框中的值是否存在于另一个数据框中

相关推荐

最近更新

标签