如何将 XML 数据转换为 data.frame？

Question

提问by larus

I'm trying to learn R's XMLpackage. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:

我正在尝试学习 R 的XML包。我正在尝试从 books.xml 示例 xml 数据文件创建一个 data.frame。这是我得到的：

library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)

Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?

这些 xpathSApply 中的每一个都没有让我接近我的意图。应该如何处理格式良好的 data.frame？

Answer 1

回答by Shane

Ordinarily, I would suggest trying the xmlToDataFrame()function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.

通常，我会建议尝试使用该xmlToDataFrame()功能，但我相信这实际上会相当棘手，因为它一开始就没有很好的结构。

I would recommend working with this function:

我建议使用此功能：

xmlToList(books)

One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.

一个问题是每本书有多个作者，因此您需要在构建数据框时决定如何处理。

Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply()function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).

一旦您决定如何处理多作者问题，那么将您的图书列表转换为具有ldply()plyr 中的函数的数据框（或仅使用 lapply 并使用 do 将返回值转换为 data.frame 是相当直接的） .call("rbind"...)。

Here's a complete example (excluding author):

这是一个完整的示例（不包括作者）：

library(XML)
books <-  "w3schools.com/xsl/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )

   .id        title.text title..attrs year price   .attrs
 1 book  Everyday Italian           en 2005 30.00  COOKING
 2 book      Harry Potter           en 2005 29.99 CHILDREN
 3 book XQuery Kick Start           en 2003 49.99      WEB
 4 book      Learning XML           en 2003 39.95      WEB

Here's what it looks like with author included. You need to use ldplyin this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapplywith rbind.fill(also courtesy of Hadley), but why bother when plyrautomatically does it for you?]:

这是包含作者的情况。您需要ldply在这种情况下使用，因为列表是“锯齿状的”...... lapply 无法正确处理。[否则你可以使用lapplywith rbind.fill（也是由 Hadley 提供的），但是为什么要在plyr自动为你做的时候费心呢？]：

ldply(xmlToList(books), data.frame)

   .id        title.text title..attrs              author year price   .attrs
1 book  Everyday Italian           en Giada De Laurentiis 2005 30.00  COOKING
2 book      Harry Potter           en        J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start           en      James McGovern 2003 49.99      WEB
4 book      Learning XML           en         Erik T. Ray 2003 39.95      WEB
     author.1   author.2   author.3               author.4
1        <NA>       <NA>       <NA>                   <NA>
2        <NA>       <NA>       <NA>                   <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4        <NA>       <NA>       <NA>                   <NA>

如何将 XML 数据转换为 data.frame？

提问by larus

回答by Shane

相关推荐

最近更新

标签

如何将 XML 数据转换为 data.frame？

提问by larus

回答by Shane

相关推荐

xml NSXMLParserErrorDomain 错误 5. 是什么意思？

是否有类似于 XMLSpy 的带有网格视图的 XML 编辑器？

xml 什么是 xs:NCName 类型以及何时应该使用它？

使用 xsd 将 csv 转换为 xml

相关推荐

最近更新

标签