将 XML 文件导入 PostgreSQL

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19007884/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 14:11:20  来源:igfitidea点击:

Import XML files to PostgreSQL

xmlbashpostgresql

提问by Tomas Greif

I do have a lot of XML files I would like to import in the table xml_data:

我确实有很多要导入表中的 XML 文件xml_data

create table xml_data(result xml);

To do this I have a simple bash script with loop:

为此,我有一个带循环的简单 bash 脚本:

#!/bin/sh
FILES=/folder/with/xml/files/*.xml
for f in $FILES
do
  psql psql -d mydb -h myhost -U usr -c \'\copy xml_data from $f \'
done

However this will try to import each line of every file as separate row. This leads to error:

但是,这将尝试将每个文件的每一行作为单独的行导入。这会导致错误:

ERROR:  invalid XML content
CONTEXT:  COPY address_results, line 1, column result: "<?xml version="1.0" encoding="UTF-8"?>"

I understand why it fails, but cannot figure out how to make \copyto import the whole file at once into single row.

我明白它为什么会失败,但无法弄清楚如何\copy将整个文件一次导入到单行中。

采纳答案by Erwin Brandstetter

I would try a different approach: read the XML file directly into variable inside a plpgsql function and proceed from there. Should be a lot fasterand a lot more robust.

我会尝试不同的方法:将 XML 文件直接读入 plpgsql 函数内的变量并从那里继续。应该更快,更健壮。

CREATE OR REPLACE FUNCTION f_sync_from_xml()
  RETURNS boolean AS
$BODY$
DECLARE
    myxml    xml;
    datafile text := 'path/to/my_file.xml';
BEGIN
   myxml := pg_read_file(datafile, 0, 100000000);  -- arbitrary 100 MB max.

   CREATE TEMP TABLE tmp AS
   SELECT (xpath('//some_id/text()', x))[1]::text AS id
   FROM   unnest(xpath('/xml/path/to/datum', myxml)) x;
   ...

You need superuserprivileges, and file must be local to the DB server, in an accessible directory.
Complete code example with more explanation and links:

您需要超级用户权限,并且文件必须位于数据库服务器本地的可访问目录中。
带有更多解释和链接的完整代码示例:

回答by Stefan Steiger

Necromancing: For those that need a working example:

死灵法师:对于那些需要工作示例的人:

DO $$
   DECLARE myxml xml;
BEGIN

myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'));

DROP TABLE IF EXISTS mytable;
CREATE TEMP TABLE mytable AS 

SELECT 
     (xpath('//ID/text()', x))[1]::text AS id
    ,(xpath('//Name/text()', x))[1]::text AS Name 
    ,(xpath('//RFC/text()', x))[1]::text AS RFC
    ,(xpath('//Text/text()', x))[1]::text AS Text
    ,(xpath('//Desc/text()', x))[1]::text AS Desc
FROM unnest(xpath('//record', myxml)) x
;

END$$;


SELECT * FROM mytable;

Or with less noise

或者噪音更小

SELECT 
     (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
    ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
    ,myTempTable.myXmlColumn as myXmlElement
FROM unnest(
    xpath
    (    '//record'
        ,XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'))
    )
) AS myTempTable(myXmlColumn)
;

With this example XML file (MyData.xml):

使用此示例 XML 文件 (MyData.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
    <record>
        <ID>1</ID>
        <Name>A</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Address record</Text>
        <Desc>Returns a 32-bit?IPv4?address, most commonly used to map?hostnames?to an IP address of the host, but it is also used for?DNSBLs, storing?subnet masks?in?RFC 1101, etc.</Desc>
    </record>
    <record>
        <ID>2</ID>
        <Name>NS</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Name server record</Text>
        <Desc>Delegates a?DNS zone?to use the given?authoritative name servers</Desc>
    </record>
</data-set>

Note:
MyData.xml needs to be in the PG_Data directory (the parent-directory of the pg_stat directory).
e.g. /var/lib/postgresql/9.3/main/MyData.xml
This requires PostGreSQL 9.1+

注意:
MyData.xml 需要在 PG_Data 目录下(pg_stat 目录的父目录)。
例如,/var/lib/postgresql/9.3/main/MyData.xml
这需要 PostGreSQL 9.1+

Overall, you can achive it fileless, like this:

总的来说,您可以实现无文件,如下所示:

SELECT 
     (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
    ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
    ,myTempTable.myXmlColumn as myXmlElement 
    -- Source: https://en.wikipedia.org/wiki/List_of_DNS_record_types
FROM unnest(xpath('//record', 
 CAST('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
    <record>
        <ID>1</ID>
        <Name>A</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Address record</Text>
        <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
    </record>
    <record>
        <ID>2</ID>
        <Name>NS</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Name server record</Text>
        <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
    </record>
</data-set>
' AS xml)   
)) AS myTempTable(myXmlColumn)
;

Note that unlike in MS-SQL, xpath text() returns NULL on a NULL value, and not an empty string.
If for whatever reason you need to explicitly check for the existence of NULL, you can use [not(@xsi:nil="true")], to which you need to pass an array of namespaces, because otherwise, you get an error (however, you can omit all namespaces but xsi).

请注意,与 MS-SQL 不同,xpath text() 对 NULL 值返回 NULL,而不是空字符串。
如果出于某种原因需要显式检查 NULL 是否存在,则可以使用[not(@xsi:nil="true")], 向其传递名称空间数组,否则,您会收到错误消息(但是,您可以省略除 xsi 之外的所有名称空间)。

SELECT 
     (xpath('//xmlEncodeTest[1]/text()', myTempTable.myXmlColumn))[1]::text AS c1

    ,(
    xpath('//xmlEncodeTest[1][not(@xsi:nil="true")]/text()', myTempTable.myXmlColumn
    ,
    ARRAY[
        -- ARRAY['xmlns','http://www.w3.org/1999/xhtml'], -- defaultns
        ARRAY['xsi','http://www.w3.org/2001/XMLSchema-instance'],
        ARRAY['xsd','http://www.w3.org/2001/XMLSchema'],        
        ARRAY['svg','http://www.w3.org/2000/svg'],
        ARRAY['xsl','http://www.w3.org/1999/XSL/Transform']
    ]
    )
    )[1]::text AS c22


    ,(xpath('//nixda[1]/text()', myTempTable.myXmlColumn))[1]::text AS c2 
    --,myTempTable.myXmlColumn as myXmlElement
    ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e
    ,xmlexists('//nixda[1]' PASSING BY REF myTempTable.myXmlColumn) AS c2e
    ,xmlexists('//xmlEncodeTestAbc[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1ea
FROM unnest(xpath('//row', 
     CAST('<?xml version="1.0" encoding="utf-8"?>
    <table xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <row>
        <xmlEncodeTest xsi:nil="true" />
        <nixda>noob</nixda>
      </row>
    </table>
    ' AS xml)   
    )
) AS myTempTable(myXmlColumn)
;

You can also check if a field is contained in an XML-text, by doing

您还可以通过执行以下操作来检查字段是否包含在 XML 文本中

 ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e

for example when you pass an XML-value to a stored-procedure/function for CRUD. (see above)

例如,当您将 XML 值传递给 CRUD 的存储过程/函数时。(看上面)

Also, note that the correct way to pass a null-value in XML is <elementName xsi:nil="true" />and not <elementName />or nothing. There is no correct way to pass NULL in attributes (you can only omit the attribute, but then it gets difficult/slow to infer the number of columns and their names in a large dataset).

另请注意,在 XML 中传递空值的正确方法是<elementName xsi:nil="true" />,不是<elementName />或没有。没有在属性中传递 NULL 的正确方法(您只能省略该属性,但是在大型数据集中推断列数及其名称变得困难/缓慢)。

e.g.

例如

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table>
    <row column1="a" column2="3" />
    <row column1="b" column2="4" column3="true" />
</table>

(is more compact, but very bad if you need to import it, especially if from XML-files with multiple GB of data - see a wonderful example of that in the stackoverflow data dump)

(更紧凑,但如果您需要导入它,则非常糟糕,尤其是从具有多 GB 数据的 XML 文件中时 - 请参阅 stackoverflow 数据转储中的一个很好的示例)

SELECT 
     myTempTable.myXmlColumn
    ,(xpath('//@column1', myTempTable.myXmlColumn))[1]::text AS c1
    ,(xpath('//@column2', myTempTable.myXmlColumn))[1]::text AS c2
    ,(xpath('//@column3', myTempTable.myXmlColumn))[1]::text AS c3
    ,xmlexists('//@column3' PASSING BY REF myTempTable.myXmlColumn) AS c3e
    ,case when (xpath('//@column3', myTempTable.myXmlColumn))[1]::text is null then 1 else 0 end AS is_null 
FROM unnest(xpath('//row', '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table>
    <row column1="a" column2="3" />
    <row column1="b" column2="4" column3="true" />
</table>'
))  AS myTempTable(myXmlColumn) 

回答by Victoria Stuart

Extending @stefan-steiger's excellent answer, here is an example that extracts XML elements from child nodes that contain multiple siblings (e.g., multiple <synonym>elements, for a particular <synomyms>parent node).

扩展@stefan-steiger 的出色回答,这里有一个示例,该示例从包含多个兄弟节点的子节点(例如,<synonym>特定<synomyms>父节点的多个元素)中提取 XML 元素。

I encountered this issue with my data and searched quite a bit for a solution; his answer was the most helpful, to me.

我的数据遇到了这个问题,并搜索了很多解决方案;他的回答对我来说是最有帮助的。

Example data file, hmdb_metabolites_test.xml:

示例数据文件,hmdb_metabolites_test.xml

<?xml version="1.0" encoding="UTF-8"?>
<hmdb>
<metabolite>
  <accession>HMDB0000001</accession>
  <name>1-Methylhistidine</name>
  <synonyms>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
    <synonym>1-Methylhistidine</synonym>
    <synonym>Pi-methylhistidine</synonym>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
  </synonyms>
</metabolite>
<metabolite>
  <accession>HMDB0000002</accession>
  <name>1,3-Diaminopropane</name>
  <synonyms>
    <synonym>1,3-Propanediamine</synonym>
    <synonym>1,3-Propylenediamine</synonym>
    <synonym>Propane-1,3-diamine</synonym>
    <synonym>1,3-diamino-N-Propane</synonym>
  </synonyms>
</metabolite>
<metabolite>
  <accession>HMDB0000005</accession>
  <name>2-Ketobutyric acid</name>
  <synonyms>
    <synonym>2-Ketobutanoic acid</synonym>
    <synonym>2-Oxobutyric acid</synonym>
    <synonym>3-Methyl pyruvic acid</synonym>
    <synonym>alpha-Ketobutyrate</synonym>
  </synonyms>
</metabolite>
</hmdb>

Aside:the original XML file had a URL in the Document Element

旁白:原始 XML 文件在 Document Element 中有一个 URL

<hmdb xmlns="http://www.hmdb.ca">

that prevented xpathfrom parsing the data. It willrun (without error messages), but the relation/table is empty:

这阻止xpath了解析数据。它运行(没有错误消息),但关系/表是空的:

[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/sql/hmdb_test.sql
DO
 accession | name | synonym 
-----------+------+---------

Since the source file is 3.4GB, I decided to edit that line using sed:

由于源文件是 3.4GB,我决定使用sed以下命令编辑该行:

sed -i '2s/.*hmdb xmlns.*/<hmdb>/' hmdb_metabolites.xml

[Adding the 2(instructs sedto edit "line 2") also -- coincidentally, in this instance -- doubling the sedcommand execution speed.]

[添加2(指示sed编辑“第 2 行”) - 巧合的是,在这种情况下 - 使sed命令执行速度加倍。]



My postgres data folder (PSQL: SHOW data_directory;) is

我的 postgres 数据文件夹 (PSQL:)SHOW data_directory;

/mnt/Vancouver/Programming/RDB/postgres/postgres/data

so, as sudo, I needed to copy my XML data file there and chownit for use in PostgreSQL:

因此,sudo我需要将我的 XML 数据文件复制到那里,chown以便在 PostgreSQL 中使用:

sudo chown postgres:postgres /mnt/Vancouver/Programming/RDB/postgres/postgres/data/hmdb_metabolites_test.xml


Script (hmdb_test.sql):

脚本 ( hmdb_test.sql):

DO $$DECLARE myxml xml;

BEGIN

myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('hmdb_metabolites_test.xml'), 'UTF8'));

DROP TABLE IF EXISTS mytable;

-- CREATE TEMP TABLE mytable AS 
CREATE TABLE mytable AS 
SELECT 
    (xpath('//accession/text()', x))[1]::text AS accession
    ,(xpath('//name/text()', x))[1]::text AS name 
    -- The "synonym" child/subnode has many sibling elements, so we need to
    -- "unnest" them,otherwise we only retrieve the first synonym per record:
    ,unnest(xpath('//synonym/text()', x))::text AS synonym
FROM unnest(xpath('//metabolite', myxml)) x
;

END$$;

-- select * from mytable limit 5;
SELECT * FROM mytable;


Execution, output (in PSQL):

执行,输出(in PSQL):

[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/hmdb_test.sql

accession  |        name        |                         synonym                          
-------------+--------------------+----------------------------------------------------------
HMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid
HMDB0000001 | 1-Methylhistidine  | 1-Methylhistidine
HMDB0000001 | 1-Methylhistidine  | Pi-methylhistidine
HMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate
HMDB0000002 | 1,3-Diaminopropane | 1,3-Propanediamine
HMDB0000002 | 1,3-Diaminopropane | 1,3-Propylenediamine
HMDB0000002 | 1,3-Diaminopropane | Propane-1,3-diamine
HMDB0000002 | 1,3-Diaminopropane | 1,3-diamino-N-Propane
HMDB0000005 | 2-Ketobutyric acid | 2-Ketobutanoic acid
HMDB0000005 | 2-Ketobutyric acid | 2-Oxobutyric acid
HMDB0000005 | 2-Ketobutyric acid | 3-Methyl pyruvic acid
HMDB0000005 | 2-Ketobutyric acid | alpha-Ketobutyrate

[hmdb_test]#

回答by Tomas Greif

I've used trto replace all newlines with space. This will create XML file with one line only. Such file I can import easily into one row using \copy.

我曾经用tr空格替换所有换行符。这将创建只有一行的 XML 文件。这样的文件我可以使用\copy.

Obviously, this is not a good idea in case where you have multi-line values in XML. Fortunately, this is not my case.

显然,在 XML 中有多行值的情况下,这不是一个好主意。幸运的是,这不是我的情况。

To import all XML files in folder you can use this bash script:

要导入文件夹中的所有 XML 文件,您可以使用此 bash 脚本:

#!/bin/sh
FILES=/folder/with/xml/files/*.xml
for f in $FILES
do
  tr '\n' ' ' < $f > temp.xml
  psql -d database -h localhost -U usr -c '\copy xml_data from temp.xml'
done