如何读取存储在 Python Pandas 本地的 ORC 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/52889647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read an ORC file stored locally in Python Pandas?
提问by Della
Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?
我可以将 ORC 文件视为类似于带有包含数据的列标题和行标签的 CSV 文件吗?如果是这样,我可以以某种方式将它读入一个简单的Pandas数据帧吗?我对 Hadoop 或 Spark 之类的工具不是很熟悉,但是为了在 Python 中查看本地 ORC 文件的内容,是否有必要了解它们?
The filename is someFile.snappy.orc
文件名是 someFile.snappy.orc
I can see online that spark.read.orc('someFile.snappy.orc')
works, but even after import pyspark
, it is throwing error.
我可以在网上看到spark.read.orc('someFile.snappy.orc')
有效,但即使在 之后import pyspark
,它也会抛出错误。
回答by Rafal Janik
I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.
我一直找不到任何好的选择,有一些死项目试图包装 java 阅读器。但是,pyarrow 确实有一个不需要您使用 pyspark 的 ORC 阅读器。它有点有限,但它有效。
import pandas as pd
import pyarrow.orc as orc
with open(filename) as file:
data = orc.ORCFile(file)
df = data.read().to_pandas()
回答by Duy Tran
In case import pyarrow.orc as orc
does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas
's data frame
如果import pyarrow.orc as orc
不起作用(在 Windows 10 中对我不起作用),您可以将它们读取到 Spark 数据帧,然后转换为pandas
的数据帧
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()
回答by Andrea
ORC, like AVRO and PARQUET, are format specifically designed for massive storage. You can think about them "like a csv", they are all files containing data, with their particular structure (different than csv, or a json of course!).
ORC 与 AVRO 和 PARQUET 一样,都是专为海量存储设计的格式。您可以将它们视为“像 csv”,它们都是包含数据的文件,具有特定的结构(不同于 csv,当然也不同于 json!)。
Using pyspark
should be easy reading an orc file, as soon as your environment grants the Hive support.
Answering your question, I'm not sure that in a local environment without Hive you will be able to read it, I've never done it (you can do a quick test with the following code):
一旦您的环境授予 Hive 支持,使用pyspark
应该很容易阅读 orc 文件。回答您的问题,我不确定在没有 Hive 的本地环境中您是否能够阅读它,我从未这样做过(您可以使用以下代码进行快速测试):
Loads ORC files, returning the result as a DataFrame.
Note: Currently ORC support is only available together with Hive support.
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
加载 ORC 文件,将结果作为 DataFrame 返回。
注意:目前 ORC 支持仅与 Hive 支持一起提供。
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
Hive is a data warehouse system, that allows you to query your data on HDFS (distributed file system) through Map-Reduce like a traditional relational database (creating queries SQL-like, doesn't support 100% all the standard SQL features!).
Hive 是一个数据仓库系统,它允许您像传统关系数据库一样通过 Map-Reduce 查询 HDFS(分布式文件系统)上的数据(创建类似 SQL 的查询,不支持 100% 的所有标准 SQL 功能!) .
Edit: Try the following to create a new Spark Session. Not to be rude, but I suggest you to follow one of many PySpark tutorial in order to understand the basics of this "world". Everything will be much clearer.
编辑:尝试以下操作来创建一个新的 Spark 会话。不要粗鲁,但我建议您遵循许多 PySpark 教程之一,以了解这个“世界”的基础知识。一切都会清晰很多。
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()