Python 中的 XML 到 CSV
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20714038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
XML to CSV in Python
提问by plnnr
I'm having a lot of trouble converting an XML file to a CSV in Python. I've looked at many forums, tried both lxml and xmlutils.xml2csv, but I can't get it to work. It's GPS data from a Garmin GPS device.
我在 Python 中将 XML 文件转换为 CSV 时遇到了很多麻烦。我看过很多论坛,尝试过 lxml 和 xmlutils.xml2csv,但我无法让它工作。它是来自 Garmin GPS 设备的 GPS 数据。
Here's what my XML file looks like, shortened of course:
这是我的 XML 文件的样子,当然缩短了:
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
<name>2013-12-03T21:08:56Z</name>
<trkseg>
<trkpt lat="45.4852855" lon="-122.6347885">
<ele>0.0000000</ele>
<time>2013-12-03T21:08:56Z</time>
</trkpt>
<trkpt lat="45.4852961" lon="-122.6347926">
<ele>0.0000000</ele>
<time>2013-12-03T21:09:00Z</time>
</trkpt>
<trkpt lat="45.4852982" lon="-122.6347897">
<ele>0.2000000</ele>
<time>2013-12-03T21:09:01Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
There are several trk tags in my massive XML file, but I can manage to separate them out -- they represent different "segments" or trips on the GPS device. All I want is a CSV file that plots something like this:
在我庞大的 XML 文件中有几个 trk 标签,但我可以设法将它们分开——它们代表 GPS 设备上的不同“段”或行程。我想要的只是一个绘制如下内容的 CSV 文件:
LAT LON TIME ELE
45.4... -122.6... 2013-12... 0.00...
... ... ... ...
Here's the code I have so far:
这是我到目前为止的代码:
## Call libraries
import csv
from xmlutils.xml2csv import xml2csv
inputs = "myfile.xml"
output = "myfile.csv"
converter = xml2csv(inputs, output)
converter.convert(tag="WHATEVER_GOES_HERE_RENDERS_EMPTY_CSV")
This is another alternative code. It merely outputs a CSV file with no data, just the headers latand lon.
这是另一个替代代码。它只输出一个没有数据的 CSV 文件,只有标题lat和lon.
import csv
import lxml.etree
x = '''
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
<name>2013-12-03T21:08:56Z</name>
<trkseg>
<trkpt lat="45.4852855" lon="-122.6347885">
<ele>0.0000000</ele>
<time>2013-12-03T21:08:56Z</time>
</trkpt>
<trkpt lat="45.4852961" lon="-122.6347926">
<ele>0.0000000</ele>
<time>2013-12-03T21:09:00Z</time>
</trkpt>
<trkpt lat="45.4852982" lon="-122.6347897">
<ele>0.2000000</ele>
<time>2013-12-03T21:09:01Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
'''
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(('lat', 'lon'))
root = lxml.etree.fromstring(x)
for trkpt in root.iter('trkpt'):
row = trkpt.get('lat'), trkpt.get('lon')
writer.writerow(row)
How do I do this? Please realize I'm a novice, so a more comprehensive explanation would be super awesome!
我该怎么做呢?请意识到我是新手,所以更全面的解释会非常棒!
采纳答案by Lukas Graf
This is a namespacedXML document. Therefore you need to address the nodes using their respective namespaces.
这是一个命名空间的XML 文档。因此,您需要使用节点各自的命名空间来寻址节点。
The namespaces used in the document are defined at the top:
文档中使用的命名空间定义在顶部:
xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
xmlns="http://www.topografix.com/GPX/1/1"
So the first namespace is mapped to the short form tc2, and would be used in an element like <tc2:foobar/>. The last one, which doesn't have a short form after the xmlns, is called the default namespace, and it applies to all elements in the document that don't explicitely use a namespace - so it applies to your <trkpt />elements as well.
所以第一个命名空间被映射到简写形式tc2,并将用于像<tc2:foobar/>. 最后一个,在 之后没有简短形式xmlns,称为默认命名空间,它适用于文档中没有明确使用命名空间的所有元素 - 因此它也适用于您的<trkpt />元素。
Therefore you would need to write root.iter('{http://www.topografix.com/GPX/1/1}trkpt')to select these elements.
因此,您需要编写root.iter('{http://www.topografix.com/GPX/1/1}trkpt')以选择这些元素。
In order to also get time and elevation, you can use trkpt.find()to access these elements below the trkptnode, and then element.textto retrieve those elements' text content (as opposed to attributes like latand lon). Also, because the timeand eleelements also use the default namespace you'll have to use the {namespace}elementsyntax again to select those nodes.
为了还可以获得时间和高度,你可以使用trkpt.find()访问下面的这些内容trkpt节点,然后element.text检索这些元素的文本内容(而不是属性,如lat和lon)。此外,由于time和ele元素也使用默认命名空间,因此您必须{namespace}element再次使用语法来选择这些节点。
So you could use something like this:
所以你可以使用这样的东西:
NS = 'http://www.topografix.com/GPX/1/1'
header = ('lat', 'lon', 'ele', 'time')
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(header)
root = lxml.etree.fromstring(x)
for trkpt in root.iter('{%s}trkpt' % NS):
lat = trkpt.get('lat')
lon = trkpt.get('lon')
ele = trkpt.find('{%s}ele' % NS).text
time = trkpt.find('{%s}time' % NS).text
row = lat, lon, ele, time
writer.writerow(row)
For more information on XML namespaces, see the Namespaces section in the lxml tutorialand the Wikipedia article on XML Namespaces. Also see GPS eXchange Formatfor some details on the .gpxformat.
有关 XML 命名空间的更多信息,请参阅lxml 教程中的命名空间部分和关于 XML 命名空间的维基百科文章。另请参阅GPS 交换格式以了解有关格式的一些详细信息.gpx。
回答by Nikhil VJ
Apologies for using already-made tools here, but this did the job with your data :
很抱歉在这里使用已经制作的工具,但这对您的数据起到了作用:
- Convert XML to JSON : http://convertjson.com/xml-to-json.htm
- Take that JSON and convert JSON to CSV : https://konklone.io/json/
- 将 XML 转换为 JSON:http: //convertjson.com/xml-to-json.htm
- 获取该 JSON 并将 JSON 转换为 CSV:https: //konklone.io/json/
It worked like a charm with your data.
它对您的数据很有用。
ele,time,_lat,_lon
0.0000000,2013-12-03T21:08:56Z,45.4852855,-122.6347885
0.0000000,2013-12-03T21:09:00Z,45.4852961,-122.6347926
0.2000000,2013-12-03T21:09:01Z,45.4852982,-122.6347897
So for coding, I reckon XML > JSON > CSV may be a good approach. You many find the relevant scripts pointed to in those links.
所以对于编码,我认为 XML > JSON > CSV 可能是一个很好的方法。您可以在这些链接中找到指向的相关脚本。

