pandas 将 XML 提取到以父属性为列标题的数据框中

Question

提问by jessi

I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google, tutorials, and just trying out codes, I've been able to pull out all of this data. See, for example: Parsing xml to pandas data frame throws memory errorand Dynamic search through xml attributes using lxml and xpath in python

我将处理数以千计的 XML 文件，它们具有相似的格式，但父名称和父数量不同。通过书籍、谷歌、教程和尝试代码，我已经能够提取所有这些数据。参见，例如：Parsing xml to pandas data frame throws memory errorand Dynamic search through xml attributes using lxml and xpath in python

However, I realized that I was extracting the data poorly, with a child "Time" repeated for each parent.

但是，我意识到我提取的数据很差，每个父母都重复了一个孩子的“时间”。

Here is what I am trying to get.

这是我想要得到的。

Time   blah   abc
1200   100   2
1300   30    4
1400   70    2

Here is what I know how to get. But my current method is clunky (I'll show below the example XML)

这是我知道如何获得的。但是我目前的方法很笨拙（我将在示例 XML 下方展示）

    child      Time   grandchild
0     blah     1200    100
1     blah     1300    30
...
n-2   abc      1200    2
n-1   abc      1300    4
n     abc      1400    2

Example XML format

XML 格式示例

<outer>
   <inner>
      <parent name = "blah" id = "1"> 
         <child Time = "1200"> 
            <grandchild>100</grandchild>  
         </child>
         <child Time = "1300">
            <grandchild>30</grandchild>
         </child>
         <child Time = "1400">
            <grandchild>70</grandchild>
         </child>
      </parent>
      <parent name = "abc" id = "2"> 
         <child Time = "1200">   
            <grandchild>2</grandchild> 
         </child>
         <child Time = "1300">
            <grandchild>4</grandchild>
         </child>
         <child Time = "1400">
            <grandchild>2</grandchild>
         </child>
      </parent>      
      <parent name = "1234" id = "7734"> 
         <other> 12 </other>
      </parent> 
   </inner>
</outer>

Here is how I can get my output:

这是我获得输出的方法：

from lxml import etree, objectify
from pandas import *
dTime=[]
dparent = []
dgrandchild=[]
for df in root.xpath('/*/*/*/parent/child'):
    dparent.append(df.getparent().attrib['name'])
    ## Iterate over attributes of time for specific parent
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
        ## grandchild is a child of time, and iterate
        subfields = df.getchildren()
        for subfield in subfields:
         dgrandchild.append(subfield.text)
df=DataFrame({'Parent': dparent,'Time':dTime,'grandchild':dgrandchld})

I could just take this output and re-shape it, but that seems inefficient and a very clunky approach.

我可以只获取这个输出并重新塑造它，但这似乎效率低下，而且是一种非常笨拙的方法。

I think I need something of the flavor:

我想我需要一些味道：

#this does not work
data = []
for elem in root.xpath('/*/*/*/parent/child'):
   elem_data = {}
   for attrib in elem.attrib:
       elem_data['Time'] = elem.attrib[attrib])
   for child in elem.getchildren():
       elem_data[getparent().attrib['name'])] = child.text
       data.append(elem_data)
ndata = DataFrame(data)

Answer 1

采纳答案by Andy Hayden

I recommend just parsing to a DataFrame first, similar to how you are already (see below for my implementation) and then tweaking it to your requirements.

我建议先解析到 DataFrame，类似于您已经的方式（见下文我的实现），然后根据您的要求对其进行调整。

Then you're looking for a pivot:

那么你正在寻找一个pivot：

In [11]: df
Out[11]:
  child  Time  grandchild
0  blah  1200         100
1  blah  1300          30
2   abc  1200           2
3   abc  1300           4
4   abc  1400           2

In [12]: df.pivot('Time', 'child', 'grandchild')
Out[12]:
child  abc  blah
Time
1200     2   100
1300     4    30
1400     2   NaN

I recommend first parse from a fileand take out the things you want into a list of tuples:

我建议首先从文件中解析并将您想要的内容取出到元组列表中：

from lxml import etree
root = etree.parse(file_name)

parents = root.getchildren()[0].getchildren()

In [21]: elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
                      for p in parents
                      for c in p
                      for gc in c]

In [22]: elems
Out[22]:
[('blah', 1200, 100),
 ('blah', 1300, 30),
 ('blah', 1400, 70),
 ('abc', 1200, 2),
 ('abc', 1300, 4),
 ('abc', 1400, 2)]

For multiple files you could just whack it in an even longer list comprehension. Which shouldn't be too slow unless you have a huge number of xmls (here filesis the list of xmls)...

对于多个文件，您可以在更长的列表理解中对其进行重击。除非你有大量的 xmls（这里files是 xmls 列表），否则应该不会太慢......

elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
            for f in files
            for p in etree.parse(f).getchildren()[0].getchildren()
            for c in p
            for gc in c]

Put them in a DataFrame:

将它们放入 DataFrame 中：

In [23]: pd.DataFrame(elems, columns=['child', 'Time', 'grandchild'])
Out[23]:
  child  Time grandchild
0  blah  1200        100
1  blah  1300         30
2  blah  1400         70
3   abc  1200          2
4   abc  1300          4
5   abc  1400          2

thendo the pivot. :)

然后做枢轴。:)

pandas 将 XML 提取到以父属性为列标题的数据框中

提问by jessi

采纳答案by Andy Hayden

相关推荐

最近更新

标签

pandas 将 XML 提取到以父属性为列标题的数据框中

提问by jessi

采纳答案by Andy Hayden

相关推荐

在 Pandas 中迭代写入 HDF5 存储

Pandas 中非唯一索引的性能影响是什么？

pandas 无法将 DataFrame 保存到 HDF5（“对象头消息太大”）

pandas HDF5 比 CSV 占用更多空间？

相关推荐

最近更新

标签