Python:将 XML 提取到 DataFrame (Pandas)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50774222/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:40:35  来源:igfitidea点击:

Python: Extracting XML to DataFrame (Pandas)

pythonxmlpandasdataframe

提问by jabba

a have an XML file that looks like this:

有一个如下所示的 XML 文件:

?<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>

What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tryied following:

我想要做的是将 ID、Text 和 CreationDate 列提取到 Pandas DF 中,我尝试了以下操作:

import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)

root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
    ID = row.find('Id')
    text = row.find('Text')
    date = row.find('CreationDate')
    print(ID, text, date)
    df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)

print(df_xml)

But the output is: None None None

但输出是:无无无

Could you please tell how to fix this? THanks

你能告诉我如何解决这个问题吗?谢谢

采纳答案by Parfait

As advised in this solutionby gold member Python/pandas/numpy guru, @unutbu:

正如黄金会员 Python/pandas/numpy 大师在此解决方案中所建议的,@unutbu:

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

永远不要在 for 循环中调用 DataFrame.append 或 pd.concat。它导致二次复制。

Therefore, consider parsing your XML data into a separate list then pass list into the DataFrameconstructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:

因此,请考虑将您的 XML 数据解析为一个单独的列表,然后DataFrame在任何循环之外的一次调用中将列表传递给构造函数。实际上,您可以将带有列表推导式的嵌套列表直接传递给构造函数:

path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']

root = et.parse(path)
rows = root.findall('.//row')

# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')] 
            for row in rows]

df_xml = pd.DataFrame(xml_data, columns=dfcols)

print(df_xml)

#   ID   Text             CreationDate
# 0  1  (...)  2011-08-30T21:15:28.063
# 1  2  (...)  2011-08-30T21:24:56.573
# 2  3  (...)                     None

回答by Prany

Just a minor change in your code

只需对您的代码稍作更改

ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')

回答by bcosta12

Based on @Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.

基于@Parfait 解决方案,我编写了将列作为参数并返回 Pandas DataFrame 的版本。

test.xml:

测试.xml:

<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>

xml_to_pandas.py:

xml_to_pandas.py:

'''Xml to Pandas DataFrame Convertor.'''

import xml.etree.cElementTree as et
import pandas as pd


def xml_to_pandas(root, columns, row_name):
  '''get xml.etree root, the columns and return Pandas DataFrame'''
  df = None
  try:

    rows = root.findall('.//{}'.format(row_name))

    xml_data = [[row.get(c) for c in columns] for row in rows]  # NESTED LIST

    df = pd.DataFrame(xml_data, columns=columns)
  except Exception as e:
    print('[xml_to_pandas] Exception: {}.'.format(e))

  return df


path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']

root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)

output:

输出:

enter image description here

在此处输入图片说明