python 在python中的xml或html文件的标签之间获取数据的简单方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2097921/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:46:18  来源:igfitidea点击:

Easy way to get data between tags of xml or html files in python?

pythonhtmlxml

提问by Recursion

I am using Python and need to find and retrieve all character data between tags:

我正在使用 Python,需要查找和检索标签之间的所有字符数据:

<tag>I need this stuff</tag>

I then want to output the found data to another file. I am just looking for a very easy and efficient way to do this.

然后我想将找到的数据输出到另一个文件。我只是在寻找一种非常简单有效的方法来做到这一点。

If you can post a quick code snippet to portray the ease of use. Because I am having a bit of trouble understanding the parsers.

如果您可以发布一个快速代码片段来描绘易用性。因为我在理解解析器时遇到了一些麻烦。

回答by ghostdog74

without external modules, eg

没有外部模块,例如

>>> myhtml = """ <tag>I need this stuff</tag>
... blah blah
... <tag>I need this stuff too
... </tag>
... blah blah """
>>> for item in myhtml.split("</tag>"):
...   if "<tag>" in item:
...       print item [ item.find("<tag>")+len("<tag>") : ]
...
I need this stuff
I need this stuff too

回答by Andrew Hare

Beautiful Soupis a wonderful HTML/XML parser for Python:

Beautiful Soup是一个很棒的 Python 的 HTML/XML 解析器:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

  1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
  3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

Beautiful Soup 是一个 Python HTML/XML 解析器,专为屏幕抓取等快速周转项目而设计。三大特点使其功能强大:

  1. 如果你给它不好的标记,Beautiful Soup 不会窒息。它生成的解析树与原始文档的意义大致相同。这通常足以收集您需要的数据并逃跑。
  2. Beautiful Soup 提供了一些用于导航、搜索和修改解析树的简单方法和 Pythonic 习惯用法:一个用于剖析文档并提取所需内容的工具包。您不必为每个应用程序创建自定义解析器。
  3. Beautiful Soup 自动将传入文档转换为 Unicode,将传出文档转换为 UTF-8。您不必考虑编码,除非文档未指定编码并且 Beautiful Soup 无法自动检测编码。然后你只需要指定原始编码。

回答by Aiden Bell

I quite like parsing into element treeand then using element.textand element.tail.

我非常喜欢解析元素树,然后使用element.textand element.tail

It also has xpathlike searching

它也有像搜索一样的xpath

>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("index.xhtml")
<Element html at b7d3f1ec>
>>> p = tree.find("body/p")     # Finds first occurrence of tag p in body
>>> p
<Element p at 8416e0c>
>>> p.text
"Some text in the Paragraph"
>>> links = p.getiterator("a")  # Returns list of all links
>>> links
[<Element a at b7d4f9ec>, <Element a at b7d4fb0c>]
>>> for i in links:             # Iterates through all found links
...     i.attrib["target"] = "blank"
>>> tree.write("output.xhtml")

回答by Shravya K

This is how I am doing it:

这就是我的做法:

    (myhtml.split('<tag>')[1]).split('</tag>')[0]

Tell me if it worked!

告诉我它是否有效!

回答by torger

Use xpath and lxml;

使用 xpath 和 lxml;

from lxml import etree

pageInMemory = open("pageToParse.html", "r")

parsedPage = etree.HTML(pageInMemory)

yourListOfText = parsedPage.xpath("//tag//text()")

saveFile = open("savedFile", "w")
saveFile.writelines(yourListOfText)

pageInMemory.close()
saveFile.close()

Faster than Beautiful soup.

比美丽的汤更快。

If you want to test out your Xpath's - I find FireFox's Xpather extremely helpful.

如果您想测试您的 Xpath - 我发现FireFox 的 Xpather 非常有用

Further Notes:

补充说明:

回答by E.G. Cortes

def value_tag(s):
    i = s.index('>')
    s = s[i+1:]
    i = s.index('<')
    s = s[:i]
    return s