如何转换 Pandas 数据帧架构
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53233613/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to Convert Pandas Data Frame Schema
提问by Hamed
I am reading a CSV file with pandas.read_csv
and it detects the schema automatically which is like
我正在阅读一个 CSV 文件,pandas.read_csv
它会自动检测架构,就像
Column1: string
Column2: string
Column3: string
Column4: int64
Column5: double
Column6: double
__index_level_0__: int64
Then, I am trying to write it with pyarrow.parquet.write_table
as a Parquet table. However, I want to use the following schema for the new parquet file
然后,我试图将它pyarrow.parquet.write_table
写成 Parquet 表。但是,我想对新的镶木地板文件使用以下架构
Column1: string
Column2: string
Column3: string
Column4: string
Column5: string
Column6: string
__index_level_0__: int64
But I get an error saying "Table schema does not match schema used to create file". Here is the piece of code I have used to convert a CSV file to a Parquet file borrowed from here
但是我收到一条错误消息,指出“表架构与用于创建文件的架构不匹配”。这是我用来将 CSV 文件转换为从这里借来的 Parquet 文件的一段代码
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = 'C:/input.csv'
parquet_file = 'C:/putput.parquet'
chunksize = 100_000
csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1")
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i == 0:
# Guess the schema of the CSV file from the first chunk
# parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_schema = pa.schema([
('c1', pa.string()),
('c2', pa.string()),
('c3', pa.string()),
('c4', pa.string()),
('c5', pa.string()),
('c6', pa.string())
])
# Open a Parquet file for writing
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
# Write CSV chunk to the parquet file
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
回答by G. Anderson
df=df.astype(str)
will convert all of the data in a pandas dataframe in strings, with object
dtypes using the built-in astype()method
df=df.astype(str)
将object
使用内置的astype()方法将Pandas数据帧中的所有数据转换为字符串,并使用dtypes
You can also change the type of a single column, for example df['Column4'] = df['Column4'].astype(str)
.
您还可以更改单个列的类型,例如df['Column4'] = df['Column4'].astype(str)
。
All you need to do is to change the type of your dataframe or a subset of its columns before parquet_writer.write_table(table)
. Altogether, your code would look like this.
您需要做的就是在parquet_writer.write_table(table)
. 总而言之,您的代码将如下所示。
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = 'C:/input.csv'
parquet_file = 'C:/putput.parquet'
chunksize = 100_000
def convert(df):
df['Column4'] = df['Column4'].astype(str)
return df
csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1")
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i == 0:
converted = convert(chunk)
parquet_schema = pa.Table.from_pandas(df=converted).schema
# Open a Parquet file for writing
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
# Write CSV chunk to the parquet file
converted = convert(chunk)
table = pa.Table.from_pandas(converted, parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()