如何将 xml 文件加载到 Hive 中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20852166/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 11:58:07  来源:igfitidea点击:

How to load xml file into Hive

xmlhadoophive

提问by backtrack

Im working on Hive tables im having the following problem. I am having more than 1 billion of xml files in my HDFS. What i want to do is, Each xml file having the 4 different sections. Now i want to split and load the each part in the each table for every xml file

我在 Hive 表上工作时遇到以下问题。我的 HDFS 中有超过 10 亿个 xml 文件。我想要做的是,每个 xml 文件都有 4 个不同的部分。现在我想为每个 xml 文件拆分和加载每个表中的每个部分

Example :

例子 :

            <?xml version='1.0' encoding='iso-8859-1'?>

            <section1>
                <id> 1233222 </id>
               // having lot of xml tages 
            </section1>

            <section2>
               // having lot of xml tages 
            </section2>

            <section3>
               // having lot of xml tages 
            </section3>

            <section4>
               // having lot of xml tages 
            </section4>

            </xml>

And i have the four tables

我有四张桌子

        section1Table

        id       section1    // fields 

        section2Table

        id       section2

        section3Table 

        id       section3

        section4Table

        id       section4

Now i want to split and load the data into each table.

现在我想将数据拆分并加载到每个表中。

How can i achieve this . Can anyone help me

我怎样才能做到这一点。谁能帮我

Thanks

谢谢

UPDATE

更新

I have tried the following

我已经尝试了以下

CREATE EXTERNAL TABLE test(name STRING) LOCATION '/user/sornalingam/zipped/output/Tagged/t1';\


SELECT xpath (name, '//section1') FROM test LIMIT 1 ;

but i got the following error

但我收到以下错误

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"name":"<?xml version='1.0' encoding='iso-8859-1'?>"}

回答by Vidya

You have several options:

您有多种选择:

  • Load the XML into a Hive table with a string column, one per row (e.g. CREATE TABLE xmlfiles (id int, xmlfile string). Then use an XPath UDFto do work on the XML.
  • Since you know the XPath's of what you want (e.g. //section1), follow the instructions in the second half of this tutorialto ingest directly into Hive via XPath.
  • Map your XML to Avro as described herebecause a SerDeexists for seamless Avro-to-Hive mapping.
  • Use XPath to store your data in a regular text file in HDFS and then ingest that into Hive.
  • 将 XML 加载到带有字符串列的 Hive 表中,每行一个(例如CREATE TABLE xmlfiles (id int, xmlfile string)。然后使用XPath UDF处理 XML。
  • 由于您知道所需的 XPath(例如//section1),请按照本教程后半部分中的说明通过 XPath 直接摄取到 Hive 中。
  • 按照此处所述将您的 XML 映射到 Avro,因为存在SerDe以实现 Avro 到 Hive 的无缝映射。
  • 使用 XPath 将数据存储在 HDFS 中的常规文本文件中,然后将其摄取到 Hive 中。

It depends on your level of experience and comfort with these approaches.

这取决于您对这些方法的经验水平和舒适度。

回答by Sweety

Use this:

用这个:

CREATE EXTERNAL TABLE test(name STRING) LOCATION '/user/sornalingam/zipped/output/Tagged/t1'

tblproperties ("skip.header.line.count"="1", "skip.footer.line.count"="1");

And then use xpath function

然后使用 xpath 函数