如何转换 Pandas 数据帧架构

Question

提问by Hamed

I am reading a CSV file with pandas.read_csvand it detects the schema automatically which is like

我正在阅读一个 CSV 文件，pandas.read_csv它会自动检测架构，就像

Column1: string
Column2: string
Column3: string
Column4: int64
Column5: double
Column6: double
__index_level_0__: int64

Then, I am trying to write it with pyarrow.parquet.write_tableas a Parquet table. However, I want to use the following schema for the new parquet file

然后，我试图将它pyarrow.parquet.write_table写成 Parquet 表。但是，我想对新的镶木地板文件使用以下架构

Column1: string
Column2: string
Column3: string
Column4: string
Column5: string
Column6: string
__index_level_0__: int64

But I get an error saying "Table schema does not match schema used to create file". Here is the piece of code I have used to convert a CSV file to a Parquet file borrowed from here

但是我收到一条错误消息，指出“表架构与用于创建文件的架构不匹配”。这是我用来将 CSV 文件转换为从这里借来的 Parquet 文件的一段代码

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

csv_file = 'C:/input.csv'
parquet_file = 'C:/putput.parquet'
chunksize = 100_000

csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1")

for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:
        # Guess the schema of the CSV file from the first chunk
        # parquet_schema = pa.Table.from_pandas(df=chunk).schema
        parquet_schema = pa.schema([
            ('c1', pa.string()),
            ('c2', pa.string()),
            ('c3', pa.string()),
            ('c4', pa.string()),
            ('c5', pa.string()),
            ('c6', pa.string())
        ])
        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
    # Write CSV chunk to the parquet file
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)

parquet_writer.close()

Answer 1

回答by G. Anderson

df=df.astype(str)will convert all of the data in a pandas dataframe in strings, with objectdtypes using the built-in astype()method

df=df.astype(str)将object使用内置的astype()方法将Pandas数据帧中的所有数据转换为字符串，并使用dtypes

You can also change the type of a single column, for example df['Column4'] = df['Column4'].astype(str).

您还可以更改单个列的类型，例如df['Column4'] = df['Column4'].astype(str)。

All you need to do is to change the type of your dataframe or a subset of its columns before parquet_writer.write_table(table). Altogether, your code would look like this.

您需要做的就是在parquet_writer.write_table(table). 总而言之，您的代码将如下所示。

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

csv_file = 'C:/input.csv'
parquet_file = 'C:/putput.parquet'
chunksize = 100_000

def convert(df):
    df['Column4'] = df['Column4'].astype(str)
    return df

csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1")

for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:            
        converted = convert(chunk)
        parquet_schema = pa.Table.from_pandas(df=converted).schema

        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

    # Write CSV chunk to the parquet file
    converted = convert(chunk)
    table = pa.Table.from_pandas(converted, parquet_schema)
    parquet_writer.write_table(table)

parquet_writer.close()

如何转换 Pandas 数据帧架构

提问by Hamed

回答by G. Anderson

相关推荐

最近更新

标签

如何转换 Pandas 数据帧架构

提问by Hamed

回答by G. Anderson

相关推荐

Pandas，将日期时间格式 mm/dd/yyyy 转换为 dd/mm/yyyy

Pandas DataFrame - 用空白替换 NULL 字符串，用 0 替换 NULL 数字

pandas Jupyter Notebook - ModuleNotFoundError

pandas 无法在字符串类型上加入熊猫数据框

相关推荐

最近更新

标签