如何将 XML 数据转换为 data.frame?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2067098/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:55:49  来源:igfitidea点击:

How to transform XML data into a data.frame?

xmlrdataframe

提问by larus

I'm trying to learn R's XMLpackage. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:

我正在尝试学习 R 的XML包。我正在尝试从 books.xml 示例 xml 数据文件创建一个 data.frame。这是我得到的:

library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)

Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?

这些 xpathSApply 中的每一个都没有让我接近我的意图。应该如何处理格式良好的 data.frame?

回答by Shane

Ordinarily, I would suggest trying the xmlToDataFrame()function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.

通常,我会建议尝试使用该xmlToDataFrame()功能,但我相信这实际上会相当棘手,因为它一开始就没有很好的结构。

I would recommend working with this function:

我建议使用此功能:

xmlToList(books)

One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.

一个问题是每本书有多个作者,因此您需要在构建数据框时决定如何处理。

Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply()function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).

一旦您决定如何处理多作者问题,那么将您的图书列表转换为具有ldply()plyr 中的函数的数据框(或仅使用 lapply 并使用 do 将返回值转换为 data.frame 是相当直接的) .call("rbind"...)。

Here's a complete example (excluding author):

这是一个完整的示例(不包括作者):

library(XML)
books <-  "w3schools.com/xsl/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )

   .id        title.text title..attrs year price   .attrs
 1 book  Everyday Italian           en 2005 30.00  COOKING
 2 book      Harry Potter           en 2005 29.99 CHILDREN
 3 book XQuery Kick Start           en 2003 49.99      WEB
 4 book      Learning XML           en 2003 39.95      WEB

Here's what it looks like with author included. You need to use ldplyin this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapplywith rbind.fill(also courtesy of Hadley), but why bother when plyrautomatically does it for you?]:

这是包含作者的情况。您需要ldply在这种情况下使用,因为列表是“锯齿状的”...... lapply 无法正确处理。[否则你可以使用lapplywith rbind.fill(也是由 Hadley 提供的),但是为什么要在plyr自动为你做的时候费心呢?]:

ldply(xmlToList(books), data.frame)

   .id        title.text title..attrs              author year price   .attrs
1 book  Everyday Italian           en Giada De Laurentiis 2005 30.00  COOKING
2 book      Harry Potter           en        J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start           en      James McGovern 2003 49.99      WEB
4 book      Learning XML           en         Erik T. Ray 2003 39.95      WEB
     author.1   author.2   author.3               author.4
1        <NA>       <NA>       <NA>                   <NA>
2        <NA>       <NA>       <NA>                   <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4        <NA>       <NA>       <NA>                   <NA>