pandas 如何打开此 XML 文件以在 Python 中创建数据框？

Question

提问by SarahB87

Does anyone have a suggestion for the best way to open the xml data on the site below to put it in a dataframe (I prefer working with pandas) in python? The file is on the "Data - XML (sdmx/zip)" link on this site:

有没有人有关于在下面的站点上打开 xml 数据的最佳方法的建议，以将其放入 python 中的数据帧（我更喜欢使用 Pandas）？该文件位于此站点的“数据 - XML (sdmx/zip)”链接上：

http://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html

I've tried using the following by copying from http://timhomelab.blogspot.com/2014/01/how-to-read-xml-file-into-dataframe.html, and it seems I'm getting close:

我已经尝试通过从http://timhomelab.blogspot.com/2014/01/how-to-read-xml-file-into-dataframe.html复制来使用以下内容，看来我已经接近了：

from lxml import objectify
import pandas as pd

path = 'feds200628.xml'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('id', 'name'))

for i in range(0,4):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['id', 'name'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

Still, I don't know enough about xml to get me the rest of the way.

尽管如此，我对 xml 的了解还不够多，无法让我完成剩下的工作。

Any help would be awesome - I don't even needit to be in a dataframe, I just need to figure out how to parse this content in python somehow.

任何帮助都会很棒 - 我什至不需要它在数据框中，我只需要弄清楚如何以某种方式在 python 中解析这些内容。

Answer 1

回答by unutbu

XML is a tree-like structure, while a Pandas DataFrame is a 2D table-like structure. So there is no automatic way to convert between the two. You have to understand the XML structure and know how you want to map its data onto a 2D table. Thus, every XML-to-DataFrame problem is different.

XML 是一种树状结构，而 Pandas DataFrame 是一种 2D 表状结构。所以没有自动的方法在两者之间进行转换。您必须了解 XML 结构并知道如何将其数据映射到 2D 表。因此，每个 XML-to-DataFrame 问题都是不同的。

Your XML has 2 DataSets, each containing a number of Series. Each Series contains a number of Obs elements.

您的 XML 有 2 个数据集，每个包含多个系列。每个系列包含许多 Obs 元素。

Each Series has a NAME attribute, and each Obs has OBS_STATUS, TIME_PERIOD and OBS_VALUE attributes. So perhaps it would be reasonable to create a table with NAME, OBS_STATUS, TIME_PERIOD, and OBS_VALUE columns.

每个Series都有一个NAME属性，每个Obs都有OBS_STATUS、TIME_PERIOD和OBS_VALUE属性。所以也许创建一个包含 NAME、OBS_STATUS、TIME_PERIOD 和 OBS_VALUE 列的表是合理的。

I found pulling the desired data out of the XML a bit complicated, which makes me doubtful that I've found the best way to do it. But here is one way (PS. Thomas Maloney's idea of starting with the 2D table-like XLS data should be way simpler):

我发现从 XML 中提取所需的数据有点复杂，这让我怀疑我是否找到了最好的方法。但这是一种方法（PS。Thomas Maloney 从类似 2D 表格的 XLS 数据开始的想法应该更简单）：

import lxml.etree as ET
import pandas as pd

path = 'feds200628.xml'

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    http://stackoverflow.com/a/7171543/190597 (unutbu)
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

data = list()
obs_keys = ['OBS_STATUS', 'TIME_PERIOD', 'OBS_VALUE']
columns = ['NAME'] + obs_keys

def process_obs(elem, name):
    dct = elem.attrib
    # print(dct)
    data.append([name] + [dct[key] for key in obs_keys])

def process_series(elem):
    dct = elem.attrib
    # print(dct)
    context = ET.iterwalk(
        elem, events=('end', ),
        tag='{http://www.federalreserve.gov/structure/compact/common}Obs'
        )
    fast_iter(context, process_obs, dct['SERIES_NAME'])

def process_dataset(elem):
    nsmap = elem.nsmap
    # print(nsmap)
    context = ET.iterwalk(
        elem, events=('end', ),
        tag='{{{prefix}}}Series'.format(prefix=elem.nsmap['kf'])
        )
    fast_iter(context, process_series)

with open(path, 'rb') as f:
    context = ET.iterparse(
        f, events=('end', ),
        tag='{http://www.federalreserve.gov/structure/compact/common}DataSet'
        )
    fast_iter(context, process_dataset)
    df = pd.DataFrame(data, columns=columns)

yields

产量

            NAME OBS_STATUS TIME_PERIOD   OBS_VALUE
0        SVENY01          A  1961-06-14      2.9825
1        SVENY01          A  1961-06-15      2.9941
2        SVENY01          A  1961-06-16      3.0012
3        SVENY01          A  1961-06-19      2.9949
4        SVENY01          A  1961-06-20      2.9833
5        SVENY01          A  1961-06-21      2.9993
6        SVENY01          A  1961-06-22      2.9837
...
1029410     TAU2          A  2014-09-19  3.72896779
1029411     TAU2          A  2014-09-22  3.12836171
1029412     TAU2          A  2014-09-23  3.20146575
1029413     TAU2          A  2014-09-24  3.29972110

Answer 2

回答by Thomas Maloney

I would export the XLS formatted fileto a CSV file (using a freely available program like Gnumeric or LibreOffice, or if you have it, Excel), and then read the CSV file into pandas. I know this is not exactly an answer to your final question, but parsing XML is an overly complicated solution to what you're trying to do.

我会将XLS 格式的文件导出为 CSV 文件（使用免费提供的程序，如 Gnumeric 或 LibreOffice，或者如果您有 Excel），然后将 CSV 文件读入 Pandas。我知道这不完全是您最后一个问题的答案，但解析 XML 是您尝试做的事情的一个过于复杂的解决方案。

Regarding parsing XML in Python, the lxml library is my favorite library to use. I find using the XPath query language together with an lxml parser to be the best route.

关于在 Python 中解析 XML，lxml 库是我最喜欢使用的库。我发现将 XPath 查询语言与 lxml 解析器一起使用是最佳途径。

Answer 3

回答by jrovegno

This code it work to transform to df this type of Excel XML file:

这段代码可以将这种类型的 Excel XML 文件转换为 df：

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('feds200628.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][10:], columns=excelHandler.tables[0][9])
print df1.head()

I can't make comment (low reputation), but the answer of this question about "How to open Excel XML file programmatically" (with python and pandas) it should work.

我无法发表评论（低声誉），但是关于“如何以编程方式打开 Excel XML 文件”（使用 python 和Pandas）这个问题的答案应该可以工作。

pandas 如何打开此 XML 文件以在 Python 中创建数据框？

提问by SarahB87

回答by unutbu

回答by Thomas Maloney

回答by jrovegno

相关推荐

最近更新

标签

pandas 如何打开此 XML 文件以在 Python 中创建数据框？

提问by SarahB87

回答by unutbu

回答by Thomas Maloney

回答by jrovegno

相关推荐

Pandas DataFrameGroupBy 导出到 Excel

pandas 熊猫合并返回空数据框

仅使用一行交换 Pandas 数据框中选定行的列值的正确语法是什么？

如何使用 pandas.date_range() 在指定的开始日期和结束日期之间获取具有 n 个指定周期（相等）的时间序列

相关推荐

最近更新

标签