Java 如何将 csv 文件转换为镶木地板
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26124417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert a csv file to parquet
提问by author243
I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?
我是 BigData 的新手。我需要将 csv/txt 文件转换为 Parquet 格式。我搜索了很多,但找不到任何直接的方法。有什么方法可以实现吗?
回答by Milad Khajavi
Read csv files as Dataframe in Apache Sparkwith spark-csv package. after loading data to Dataframe save dataframe to parquetfile.
使用spark-csv 包在 Apache Spark 中读取 csv 文件作为数据帧。将数据加载到 Dataframe 后,将 dataframe 保存到 parquetfile。
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.load("/home/myuser/data/log/*.csv")
df.saveAsParquetFile("/home/myuser/data.parquet")
回答by ostrokach
You can use Apache Drill, as described in Convert a CSV File to Apache Parquet With Drill.
您可以使用Apache Drill,如使用Drill将 CSV 文件转换为 Apache Parquet 中所述。
In brief:
简单来说:
Start Apache Drill:
启动 Apache Drill:
$ cd /opt/drill/bin $ sqlline -u jdbc:drill:zk=local
Create the Parquet file:
创建 Parquet 文件:
-- Set default table format to parquet ALTER SESSION SET `store.format`='parquet'; -- Create a parquet table containing all data from the CSV table CREATE TABLE dfs.tmp.`/stats/airport_data/` AS SELECT CAST(SUBSTR(columns[0],1,4) AS INT) `YEAR`, CAST(SUBSTR(columns[0],5,2) AS INT) `MONTH`, columns[1] as `AIRLINE`, columns[2] as `IATA_CODE`, columns[3] as `AIRLINE_2`, columns[4] as `IATA_CODE_2`, columns[5] as `GEO_SUMMARY`, columns[6] as `GEO_REGION`, columns[7] as `ACTIVITY_CODE`, columns[8] as `PRICE_CODE`, columns[9] as `TERMINAL`, columns[10] as `BOARDING_AREA`, CAST(columns[11] AS DOUBLE) as `PASSENGER_COUNT` FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`;
Try selecting data from the new Parquet file:
尝试从新 Parquet 文件中选择数据:
-- Select data from parquet table SELECT * FROM dfs.tmp.`/stats/airport_data/*`
You can change the dfs.tmp
location by going to http://localhost:8047/storage/dfs
(source: CSV and Parquet).
您可以dfs.tmp
通过转到http://localhost:8047/storage/dfs
(来源:CSV 和 Parquet)来更改位置。
回答by Madhu Kiran Seelam
The following code is an example using spark2.0. Reading is much faster than inferSchema option. Spark 2.0 convert into parquet file in much more efficient than spark1.6.
以下代码是使用 spark2.0 的示例。读取比 inferSchema 选项快得多。Spark 2.0 转换为镶木地板文件的效率比 spark1.6 高得多。
import org.apache.spark.sql.types._
var df = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
df = spark.read
.schema(df)
.option("header", "true")
.option("delimiter", "\t")
.csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
df.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")
回答by Hemant Kumar
1) You can create an external hive table
1)您可以创建一个外部配置单元表
create external table emp(name string,job_title string,department string,salary_per_year int)
row format delimited
fields terminated by ','
location '.. hdfs location of csv file '
2) Another hive table that will store parquet file
2)另一个将存储镶木地板文件的配置单元表
create external table emp_par(name string,job_title string,department string,salary_per_year int)
row format delimited
stored as PARQUET
location 'hdfs location were you want the save parquet file'
Insert the table one data into table two :
将表一数据插入表二:
insert overwrite table emp_par select * from emp
回答by ostrokach
I already posted an answeron how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandasand PyArrow!
我已经发布了有关如何使用 Apache Drill 执行此操作的答案。但是,如果您熟悉 Python,现在可以使用Pandas和PyArrow来执行此操作!
Install dependencies
安装依赖
Using pip
:
使用pip
:
pip install pandas pyarrow
or using conda
:
或使用conda
:
conda install pandas pyarrow -c conda-forge
Convert CSV to Parquet in chunks
将 CSV 分块转换为 Parquet
# csv_to_parquet.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = '/path/to/my.tsv'
parquet_file = '/path/to/my.parquet'
chunksize = 100_000
csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i == 0:
# Guess the schema of the CSV file from the first chunk
parquet_schema = pa.Table.from_pandas(df=chunk).schema
# Open a Parquet file for writing
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
# Write CSV chunk to the parquet file
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
I haven't benchmarked this code against the Apache Drill version, but in my experience it's plenty fast, converting tens of thousands of rows per second (this depends on the CSV file of course!).
我没有针对 Apache Drill 版本对此代码进行基准测试,但根据我的经验,它非常快,每秒转换数万行(这当然取决于 CSV 文件!)。
回答by Shuli Hakim
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import sys
sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)
schema = StructType([
StructField("col1", StringType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True),
StructField("col4", StringType(), True),
StructField("col5", StringType(), True)])
rdd = sc.textFile('/input.csv').map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet('/output.parquet')
回答by Pranav Gupta
[For Python]
[对于 Python]
Pandas now has direct support for it.
Pandas 现在直接支持它。
Just read the csv file into dataframe by pandas using read_csvand writing that dataframe to parquet file using to_parquet.
刚才看了CSV文件导入数据帧通过大熊猫read_csv和使用编写数据帧到拼花文件to_parquet。