R:将 XML 数据转换为数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33446888/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:09:20  来源:igfitidea点击:

R: convert XML data to data frame

xmlrdataframe

提问by mapleleaf

For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet but have been unsuccessful. Here is my code so far:

对于家庭作业,我试图将 XML 文件转换为 R 中的数据框。我尝试了很多不同的方法,并且在互联网上搜索了一些想法,但都没有成功。到目前为止,这是我的代码:

library(XML)
url <- 'http://www.ggobi.org/book/data/olive.xml'
doc <- xmlParse(myUrl)
root <- xmlRoot(doc)

dataFrame <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
data.frame(t(dataFrame),row.names=NULL)

The output I get is like a giant vector of numbers. I am attempting to organize the data into a data frame, but I do not know how to properly adjust my code to obtain that.

我得到的输出就像一个巨大的数字向量。我试图将数据组织到一个数据框中,但我不知道如何正确调整我的代码以获得它。

回答by hrbrmstr

It may not be as verboseas the XMLpackage but xml2doesn't have the memory leaks and is laser-focused on data extraction. I use trimwswhich is a reallyrecent addition to R core.

它可能不像包那样冗长XMLxml2没有内存泄漏,并且专注于数据提取。我用trimws这是一个真正的最近除了与R核心。

library(xml2)

pg <- read_xml("http://www.ggobi.org/book/data/olive.xml")

# get all the <record>s
recs <- xml_find_all(pg, "//record")

# extract and clean all the columns
vals <- trimws(xml_text(recs))

# extract and clean (if needed) the area names
labs <- trimws(xml_attr(recs, "label"))

# mine the column names from the two variable descriptions
# this XPath construct lets us grab either the <categ…> or <real…> tags
# and then grabs the 'name' attribute of them
cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or
                                                      self::realvariable]"), "name")

# this converts each set of <record> columns to a data frame
# after first converting each row to numeric and assigning
# names to each column (making it easier to do the matrix to data frame conv)
dat <- do.call(rbind, lapply(strsplit(vals, "\ +"),
                                 function(x) {
                                   data.frame(rbind(setNames(as.numeric(x),cols)))
                                 }))

# then assign the area name column to the data frame
dat$area_name <- labs

head(dat)
##   region area palmitic palmitoleic stearic oleic linoleic linolenic
## 1      1    1     1075          75     226  7823      672        NA
## 2      1    1     1088          73     224  7709      781        31
## 3      1    1      911          54     246  8113      549        31
## 4      1    1      966          57     240  7952      619        50
## 5      1    1     1051          67     259  7771      672        50
## 6      1    1      911          49     268  7924      678        51
##   arachidic eicosenoic    area_name
## 1        60         29 North-Apulia
## 2        61         29 North-Apulia
## 3        63         29 North-Apulia
## 4        78         35 North-Apulia
## 5        80         46 North-Apulia
## 6        70         44 North-Apulia

UPDATE

更新

I'd prbly do the last bit this way now:

我现在最好这样做最后一点:

library(tidyverse)

strsplit(vals, "[[:space:]]+") %>% 
  map_df(~as_data_frame(as.list(setNames(., cols)))) %>% 
  mutate(area_name=labs)

回答by Parfait

Great answers above! For future readers, anytime you face a complex XML needing R import, consider re-structuring the XML document using XSLT(a special-purpose declarative programming language that manipulates XML content into various end-use needs). Then simply use R's xmlToDataFrame()function from XML package.

上面的答案很好!对于未来的读者,无论何时您遇到需要 R 导入的复杂 XML,都可以考虑使用XSLT(一种特殊用途的声明式编程语言,将 XML 内容处理为各种最终使用需求)来重新构建 XML 文档。然后简单地使用xmlToDataFrame()XML 包中的R函数。

Unfortunately, R does not have a dedicated XSLT package available on CRAN-R across all operating systems. The listed SXLTseems to be a Linux package and not able to be used on Windows. See unanswered SO questions hereand here. I understand @hrbrmstr (above) maintains a GitHub XSLT project. Nonetheless, nearly all general-purpose languages maintain XSLT processors including Java, C#, Python, PHP, Perl, and VB.

不幸的是,R 在所有操作系统的 CRAN-R 上都没有可用的专用 XSLT 包。列出的SXLT似乎是一个 Linux 包,不能在 Windows 上使用。在此处此处查看未回答的 SO 问题。我知道@hrbrmstr(以上)维护一个GitHub XSLT 项目。尽管如此,几乎所有通用语言都维护 XSLT 处理器,包括 Java、C#、Python、PHP、Perl 和 VB。

Below is the open-source Python route and because the XML document is pretty nuanced, two XSLTs are being used (of course XSLT gurus can combine them into one but tried as I might couldn't get it to work.

下面是开源 Python 路线,因为 XML 文档非常微妙,所以使用了两个 XSLT(当然 XSLT 专家可以将它们组合成一个,但尝试过,因为我可能无法让它工作。

FIRST XSLT(using a recursive template)

第一个 XSLT(使用递归模板

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>

<!-- Identity Transform -->    
<xsl:template match="node()|@*">
    <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="record/text()" name="tokenize">        
    <xsl:param name="text" select="."/>
    <xsl:param name="separator" select="' '"/>
    <xsl:choose>            
        <xsl:when test="not(contains($text, $separator))">                
            <data>
                <xsl:value-of select="normalize-space($text)"/>
            </data>              
        </xsl:when>
        <xsl:otherwise>
            <data>                  
                <xsl:value-of select="normalize-space(substring-before($text, $separator))"/>                  
            </data>                  
            <xsl:call-template name="tokenize">
                <xsl:with-param name="text" select="substring-after($text, $separator)"/>
            </xsl:call-template>                
        </xsl:otherwise>            
    </xsl:choose>        
</xsl:template>     

<xsl:template match="description|variables|categoricalvariable|realvariable">        
</xsl:template> 

SECOND XSLT

第二个 XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <!-- Identity Transform -->    
    <xsl:template match="records">
        <xsl:copy>
           <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="record">
        <record>
            <area_name><xsl:value-of select="@label"/></area_name>
            <area><xsl:value-of select="data[1]"/></area>
            <region><xsl:value-of select="data[2]"/></region>
            <palmitic><xsl:value-of select="data[3]"/></palmitic>
            <palmitoleic><xsl:value-of select="data[4]"/></palmitoleic>
            <stearic><xsl:value-of select="data[5]"/></stearic>
            <oleic><xsl:value-of select="data[6]"/></oleic>
            <linoleic><xsl:value-of select="data[7]"/></linoleic>
            <linolenic><xsl:value-of select="data[8]"/></linolenic>
            <arachidic><xsl:value-of select="data[9]"/></arachidic>
            <eicosenoic><xsl:value-of select="data[10]"/></eicosenoic>
        </record>
   </xsl:template>         

</xsl:stylesheet>

Python(using lxml module)

Python(使用 lxml 模块)

import lxml.etree as ET

cd = os.path.dirname(os.path.abspath(__file__))

# FIRST TRANSFORMATION
dom = ET.parse('http://www.ggobi.org/book/data/olive.xml')
xslt = ET.parse(os.path.join(cd, 'Olive.xsl'))
transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)

xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()    

# SECOND TRANSFORMATION
dom = ET.parse(os.path.join(cd, 'Olive_py.xml'))
xslt = ET.parse(os.path.join(cd, 'Olive2.xsl'))
transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)    

xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()

R

电阻

library(XML)

# LOADING TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("Olive_py.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record"))
View(xmldf)

Output

输出

area_name   area    region  palmitic    palmitoleic stearic oleic   linoleic    linolenic   arachidic   eicosenoic
North-Apulia 1      1       1075        75          226     7823        672          na                     60
North-Apulia 1      1       1088        73          224     7709        781          31          61         29
North-Apulia 1      1       911         54          246     8113        549          31          63         29
North-Apulia 1      1       966         57          240     7952        619          50          78         35
North-Apulia 1      1       1051        67          259     7771        672          50          80         46
   ...

(slight cleanup on very first record is needed as an extra space was added after "na" in xml doc, so arachidicand eicosenoicwere shifted forward)

(需要在第一个记录轻微的清理作为一个额外的空间被“NA”后添加在XML文档,所以arachidiceicosenoic分别前移)

回答by Rich Scriven

Here's what I came up with. It matches the olive oil csv filethat is also available on the same page. They show Xas the first column name, but I don't see it in the xml so I just added it manually.

这是我想出的。它与同一页面上的橄榄油 csv 文件相匹配。它们显示X为第一列名称,但我在 xml 中没有看到它,所以我只是手动添加了它。

It will probably be best to break it up into sections, then assemble the final data frame once we've got all the parts. We can also use the [.XML*shortcuts for XPath, and the other [[convenience accessor functions.

最好将其分成几部分,然后在我们获得所有部分后组装最终的数据框。我们还可以使用[.XML*XPath的快捷方式,以及其他[[便利的访问器功能。

library(XML)
url <- "http://www.ggobi.org/book/data/olive.xml"

## parse the xml document and get the top-level XML node
doc <- xmlParse(url)
top <- xmlRoot(doc)

## create the data frame
df <- cbind(
    ## get all the labels for the first column (groups)
    X = unlist(doc["//record//@label"], use.names = FALSE), 
    read.table(
        ## get all the records as a character vector
        text = xmlValue(top[["data"]][["records"]]), 
        ## get the column names from 'variables'
        col.names = xmlSApply(top[["data"]][["variables"]], xmlGetAttr, "name"), 
        ## assign the NA values to 'na' in the records
        na.strings = "na"
    )
)

## result
head(df)
#              X region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
# 1 North-Apulia      1    1     1075          75     226  7823      672        NA        60         29
# 2 North-Apulia      1    1     1088          73     224  7709      781        31        61         29
# 3 North-Apulia      1    1      911          54     246  8113      549        31        63         29
# 4 North-Apulia      1    1      966          57     240  7952      619        50        78         35
# 5 North-Apulia      1    1     1051          67     259  7771      672        50        80         46
# 6 North-Apulia      1    1      911          49     268  7924      678        51        70         44

## clean up
free(doc); rm(doc, top); gc()

回答by Holger Brandl

For me the canonical answer is

对我来说,规范的答案是

doc<-xmlParse("Olive_py.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record"))

which is somehow hidden in @Parfait's answer.

不知何故隐藏在@Parfait 的答案中。

However, this will fail if some of the nodes have multiple child nodes of the same type. In such cases an extractor function will solve the problem:

但是,如果某些节点有多个相同类型的子节点,这将失败。在这种情况下,提取器功能将解决问题:

example data

示例数据

<?xml version="1.0" encoding="UTF-8"?>
<testrun duration="25740" footerText="Generated by IntelliJ IDEA on 11/20/19, 9:21 PM" name="All in foo">
    <suite duration="274" locationUrl="java:suite://com.foo.bar.LoadBla" name="LoadBla"
           status="passed">
        <test duration="274" locationUrl="java:test://com.foo.bar.LoadBla/testReadWrite"
              name="LoadBla.testReadWrite" status="passed">
            <output type="stdout">ispsum ..</output>
        </test>
    </suite>
    <suite duration="9298" locationUrl="java:suite://com.foo.bar.TestFooSearch" name="TestFooSearch"
           status="passed">
        <test duration="7207" locationUrl="java:test://com.foo.bar.TestFooSearch/TestFooSearch"
              name="TestFooSearch.TestFooSearch" status="passed">
            <output type="stdout"/>
        </test>
        <test duration="2091" locationUrl="java:test://com.foo.bar.TestFooSearch/testSameSearch"
              name="TestFooSearch.testSameSearch" status="passed"/>
    </suite>
</testrun>

code

代码

require(XML)
require(tidyr)
require(dplyr)

node2df <- function(node){
    # (Optinonally) read out properties of  some optional child node
    outputNodes = getNodeSet(node, "output")
    stdout = if (length(outputNodes) > 0) xmlValue(outputNodes[[1]]) else NA

    vec_as_df <- function(namedVec, row_name="name", value_name="value"){
        data_frame(name = names(namedVec), value = namedVec) %>% set_names(row_name, value_name)
    }

    # Extract all node properties
    node %>%
        xmlAttrs %>%
        vec_as_df %>%
        pivot_wider(names_from = name, values_from = value) %>%
        mutate(stdout = stdout)
}

testResults = xmlParse(xmlFile) %>%
    getNodeSet("/testrun/suite/test", fun = node2df) %>%
    bind_rows()