javascript 在谷歌应用程序脚本中解析 html 的最佳方法是什么
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19455158/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the best way to parse html in google apps script
提问by copperhead
var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
上面的代码在使用时会出现解析错误,但是如果我将 XmlService 类替换为已弃用的 Xml 类,并设置了 lenient 标志,它会正确解析 html。
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
该问题主要是由于 html 的 javascript 部分中没有 CDATA 引起的,并且解析器抱怨以下错误。
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script>
using regex, it still complains because the <br>
tags aren't closed.
Is there a clean way of parsing html into a DOM tree.
即使我删除了所有<script>(.*?)</script>
使用的正则表达式,它仍然会抱怨,因为<br>
标签没有关闭。是否有一种干净的方法可以将 html 解析为 DOM 树。
回答by Justin Bicknell
I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse
, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse
method:
我遇到了这个完全相同的问题。我能够通过首先使用 deprecated 来规避它Xml.parse
,因为它仍然有效,然后选择主体 XmlElement,然后将其 Xml 字符串传递到新XmlService.parse
方法中:
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse
is completely removed from Google Scripts.
注意:如果旧版本Xml.parse
已从 Google Scripts 中完全删除,则此解决方案可能不起作用。
回答by Eric Koleda
Xml.parse()
has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml
service is deprecated however, and the newer XmlService
doesn't have this functionality.
Xml.parse()
有一个选项可以打开宽松解析,这有助于解析 HTML。请注意,该Xml
服务已被弃用,并且较新的XmlService
没有此功能。
回答by Ivan de Leon
For simple tasks such as grabbing one value from a webpage, you could use a regular expression. Regex is notoriously bad for parsing HTML as there's all sorts of weird cases it can get tripped up, but if you're confident about the HTML you're accessing this can sometimes be the simplest way.
对于简单的任务,例如从网页中获取一个值,您可以使用正则表达式。正则表达式对于解析 HTML 是出了名的糟糕,因为它可能会遇到各种奇怪的情况,但是如果您对访问的 HTML 充满信心,这有时可能是最简单的方法。
Here's an example that fetches the contents of the page's <title>
tag:
下面是一个获取页面<title>
标签内容的示例:
var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("<title>(.*)</title>", "gi");
var result = regExp.exec(page.getContentText());
// [1] is the match group when using parenthesis in the pattern
var value = result ? result[1] : 'No title found';
回答by Yves R
I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.
我发现在 google 应用程序中解析 html 的最佳方法是避免使用 XmlService.parse 或 Xml.parse。XmlService.parse 不适用于某些网站的错误 html 代码。
Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States" whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.
这里有一个基本示例,说明如何在不使用 XmlService.parse 或 Xml.parse 的情况下轻松解析任何网站。在此示例中,我从“wikipedia.org/wiki/President_of_the_United_States”中检索总统列表,并使用常规 javascript document.getElementsByTagName(),并将这些值粘贴到我的谷歌电子表格中。
1- Create a new Google Sheet;
1- 创建一个新的 Google Sheet;
2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:
2- 单击菜单工具 > 脚本编辑器...以打开带有代码编辑器窗口的新选项卡,并将以下代码复制到您的 Code.gs 中:
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu("Parse Menu")
.addItem("Parse", "parserMenuItem")
.addToUi();
}
function parserMenuItem() {
var sideBar = HtmlService.createHtmlOutputFromFile("test");
SpreadsheetApp.getUi().showSidebar(sideBar);
}
function getUrlData(url) {
var doc = UrlFetchApp.fetch(url).getContentText()
return doc
}
function writeToSpreadSheet(data) {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[0];
var row=1
for (var i = 0; i < data.length; i++) {
var x = data[i];
var range = sheet.getRange(row, 1)
range.setValue(x);
var row = row+1
}
}
3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html
3- 将 HTML 文件添加到您的 Apps 脚本项目。打开脚本编辑器并选择文件 > 新建 > Html 文件,并将其命名为“test”。然后将以下代码复制到您的 test.html 中
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>
window.onload = onOpen;
function onOpen() {
var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
document.getElementById("mButon").style.visibility = "visible";
}
function writeHtmlOutput(x) {
document.getElementById('mOutput').innerHTML = x;
}
function parse() {
var list = document.getElementsByTagName("area");
var data = [];
for (var i = 0; i < list.length; i++) {
var x = list[i];
data.push(x.getAttribute("title"))
}
google.script.run.writeToSpreadSheet(data);
}
</script>
</html>
4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.
4- 保存您的 gs 和 html 文件并返回到您的电子表格。重新加载您的电子表格。单击“解析菜单”-“解析”。然后单击侧栏中的“单击此处获取列表”。
回答by Jind?ich ?ir??ek
I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.
我知道这并不完全是 OP 所问的,但是我在寻找一些 html 解析选项时发现了这个问题 - 所以它可能对其他人也有用。
There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.
有一个易于使用的库用于文本解析。如果您只想从 html(xml) 代码中获取一条信息,这会很有用。
It works like in the picture above
function getData() {
var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
var fromText = '<span class="e-f-ih" title="';
var toText = '">';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.build();
Logger.log(scraped);
return scraped;
}
回答by Zig Mandel
Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.
除非您执行已经尝试过的操作,否则如果 html 不符合 xml 格式,则无法正常工作。