Java 如何将 csv 文件转换为镶木地板

Question

提问by author243

I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?

我是 BigData 的新手。我需要将 csv/txt 文件转换为 Parquet 格式。我搜索了很多，但找不到任何直接的方法。有什么方法可以实现吗？

Answer 1

回答by Milad Khajavi

Read csv files as Dataframe in Apache Sparkwith spark-csv package. after loading data to Dataframe save dataframe to parquetfile.

使用spark-csv 包在 Apache Spark 中读取 csv 文件作为数据帧。将数据加载到 Dataframe 后，将 dataframe 保存到 parquetfile。

val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("mode", "DROPMALFORMED")
      .load("/home/myuser/data/log/*.csv")
df.saveAsParquetFile("/home/myuser/data.parquet")

Answer 2

回答by ostrokach

You can use Apache Drill, as described in Convert a CSV File to Apache Parquet With Drill.

您可以使用Apache Drill，如使用Drill 将 CSV 文件转换为 Apache Parquet 中所述。

In brief:

简单来说：

Start Apache Drill:

启动 Apache Drill：

$ cd /opt/drill/bin
$ sqlline -u jdbc:drill:zk=local

Create the Parquet file:

创建 Parquet 文件：

-- Set default table format to parquet
ALTER SESSION SET `store.format`='parquet';

-- Create a parquet table containing all data from the CSV table
CREATE TABLE dfs.tmp.`/stats/airport_data/` AS
SELECT
CAST(SUBSTR(columns[0],1,4) AS INT)  `YEAR`,
CAST(SUBSTR(columns[0],5,2) AS INT) `MONTH`,
columns[1] as `AIRLINE`,
columns[2] as `IATA_CODE`,
columns[3] as `AIRLINE_2`,
columns[4] as `IATA_CODE_2`,
columns[5] as `GEO_SUMMARY`,
columns[6] as `GEO_REGION`,
columns[7] as `ACTIVITY_CODE`,
columns[8] as `PRICE_CODE`,
columns[9] as `TERMINAL`,
columns[10] as `BOARDING_AREA`,
CAST(columns[11] AS DOUBLE) as `PASSENGER_COUNT`
FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`;

Try selecting data from the new Parquet file:

尝试从新 Parquet 文件中选择数据：

-- Select data from parquet table
SELECT *
FROM dfs.tmp.`/stats/airport_data/*`

You can change the dfs.tmplocation by going to http://localhost:8047/storage/dfs(source: CSV and Parquet).

您可以dfs.tmp通过转到http://localhost:8047/storage/dfs（来源：CSV 和 Parquet）来更改位置。

Answer 3

回答by Madhu Kiran Seelam

The following code is an example using spark2.0. Reading is much faster than inferSchema option. Spark 2.0 convert into parquet file in much more efficient than spark1.6.

以下代码是使用 spark2.0 的示例。读取比 inferSchema 选项快得多。Spark 2.0 转换为镶木地板文件的效率比 spark1.6 高得多。

import org.apache.spark.sql.types._
var df = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
df = spark.read
          .schema(df)
          .option("header", "true")
          .option("delimiter", "\t")
          .csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
df.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")

Answer 4

回答by Hemant Kumar

1) You can create an external hive table

1）您可以创建一个外部配置单元表

create  external table emp(name string,job_title string,department string,salary_per_year int)
row format delimited
fields terminated by ','
location '.. hdfs location of csv file '

2) Another hive table that will store parquet file

2）另一个将存储镶木地板文件的配置单元表

create  external table emp_par(name string,job_title string,department string,salary_per_year int)
row format delimited
stored as PARQUET
location 'hdfs location were you want the save parquet file'

Insert the table one data into table two :

将表一数据插入表二：

insert overwrite table emp_par select * from emp

Answer 5

回答by ostrokach

I already posted an answeron how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandasand PyArrow!

我已经发布了有关如何使用 Apache Drill 执行此操作的答案。但是，如果您熟悉 Python，现在可以使用Pandas和PyArrow来执行此操作！

Install dependencies

安装依赖

Using pip:

使用pip：

pip install pandas pyarrow

or using conda:

或使用conda：

conda install pandas pyarrow -c conda-forge

Convert CSV to Parquet in chunks

将 CSV 分块转换为 Parquet

# csv_to_parquet.py

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

csv_file = '/path/to/my.tsv'
parquet_file = '/path/to/my.parquet'
chunksize = 100_000

csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False)

for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:
        # Guess the schema of the CSV file from the first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
    # Write CSV chunk to the parquet file
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)

parquet_writer.close()

I haven't benchmarked this code against the Apache Drill version, but in my experience it's plenty fast, converting tens of thousands of rows per second (this depends on the CSV file of course!).

我没有针对 Apache Drill 版本对此代码进行基准测试，但根据我的经验，它非常快，每秒转换数万行（这当然取决于 CSV 文件！）。

Answer 6

回答by Shuli Hakim

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import sys

sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)

schema = StructType([
    StructField("col1", StringType(), True),
    StructField("col2", StringType(), True),
    StructField("col3", StringType(), True),
    StructField("col4", StringType(), True),
    StructField("col5", StringType(), True)])
rdd = sc.textFile('/input.csv').map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet('/output.parquet')

Answer 7

回答by Pranav Gupta

[For Python]

[对于 Python]

Pandas now has direct support for it.

Pandas 现在直接支持它。

Just read the csv file into dataframe by pandas using read_csvand writing that dataframe to parquet file using to_parquet.

刚才看了CSV文件导入数据帧通过大熊猫read_csv和使用编写数据帧到拼花文件to_parquet。

Java 如何将 csv 文件转换为镶木地板

提问by author243

回答by Milad Khajavi

回答by ostrokach

回答by Madhu Kiran Seelam

回答by Hemant Kumar

回答by ostrokach

Install dependencies

安装依赖

Convert CSV to Parquet in chunks

将 CSV 分块转换为 Parquet

回答by Shuli Hakim

回答by Pranav Gupta

相关推荐

最近更新

标签

Java 如何将 csv 文件转换为镶木地板

提问by author243

回答by Milad Khajavi

回答by ostrokach

回答by Madhu Kiran Seelam

回答by Hemant Kumar

回答by ostrokach

Install dependencies

安装依赖

Convert CSV to Parquet in chunks

将 CSV 分块转换为 Parquet

回答by Shuli Hakim

回答by Pranav Gupta

相关推荐

如何在 Java 中将字符转换为布尔值？

JavaFX 从控制器获取场景

Java 找不到类 org.json.JSONObject 的序列化程序，也没有发现创建 BeanSerializer 的属性

Java 如何编写一个不等于某物的匹配器

相关推荐

最近更新

标签