Javascript 如何在不使用 XmlService 的情况下解析 Google Apps 脚本中的 HTML 字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33893143/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse an HTML string in Google Apps Script without using XmlService?
提问by user3347814
I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it.
我想使用带有 Google Apps 脚本的 Google 电子表格创建一个抓取工具。我知道这是可能的,我看过一些关于它的教程和主题。
The main idea is to use:
主要思想是使用:
var html = UrlFetchApp.fetch('http://en.wikipedia.org/wiki/Document_Object_Model').getContentText();
var doc = XmlService.parse(html);
And then get and work with the elements. However, the method
然后获取并使用元素。然而,该方法
XmlService.parse()
Does not work for some page. For example, if I try:
不适用于某些页面。例如,如果我尝试:
function test(){
var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
var parse = XmlService.parse(html);
}
I get the following error:
我收到以下错误:
Error on line 225: The entity name must immediately follow the '&' in the entity reference. (line 3, file "")
I've tried to use string.replace()
to eliminate the characters that apparently are causing the error, but it does not work. All sort of other errors appear. The following code for example:
我试图用来string.replace()
消除显然导致错误的字符,但它不起作用。出现各种其他错误。以下代码为例:
function test(){
var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
var regExp = new RegExp("&", "gi");
html = html.replace(regExp,"");
var parse = XmlService.parse(html);
}
Gives me the following error:
给了我以下错误:
Error on line 358: The content of elements must consist of well-formed character data or markup. (line 6, file "")
I believe this is a problem with the XmlService.parse()
method.
我相信这是XmlService.parse()
方法的问题。
I've read in this threads:
我读过这个线程:
Google App Script parse table from messed htmland What is the best way to parse html in google apps scriptthat one can use a deprecated method called xml.parse()
which does accept a second parameter that allows parsing HTML. However, as I've mentioned, it is deprecated and I can not find any documentation on it anywhere. The xml.parse()
seems to parse the string, but I have trouble working with the elements due to the lack of documentation. And it's also not the safest long term solution, because it can be deactivated any time soon.
Google App Script parse table from messed html以及在 google apps 脚本中解析 html 的最佳方法是什么,可以使用一种已弃用的方法xml.parse()
,该方法确实接受允许解析 HTML 的第二个参数。但是,正如我所提到的,它已被弃用,而且我在任何地方都找不到有关它的任何文档。在xml.parse()
似乎解析字符串,但我有元素麻烦的工作,由于缺乏文档。而且它也不是最安全的长期解决方案,因为它可以随时停用。
So, I want to know how do I parse this HTML in Google Apps Script?
所以,我想知道如何在 Google Apps Script 中解析这个 HTML?
I also tried:
我也试过:
function test(){
var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
var htmlOutput = HtmlService.createHtmlOutput(html).getContent();
var parse = XmlService.parse(htmlOutput);
}
But it does not work, I get this error:
但它不起作用,我收到此错误:
Malformed HTML content:
格式错误的 HTML 内容:
I thought about using a open source library to parse the HTML, but I could not find any.
我想过使用开源库来解析 HTML,但我找不到任何。
My ultimate goal is to get some information from a set of pages like Price, Link, Name of the products, etc. I've manage to do this using a series of RegEx:
我的最终目标是从一组页面中获取一些信息,例如价格、链接、产品名称等。我已经使用一系列 RegEx 设法做到了这一点:
var ss = SpreadsheetApp.getActiveSpreadsheet();
var linksSheet = ss.getSheetByName("Links");
var resultadosSheet = ss.getSheetByName("Resultados");
function scrapyLoco(){
var links = linksSheet.getRange(1, 1, linksSheet.getLastRow(), 1).getValues();
var arrayGrandao = [];
for (var row = 0, len = links.length; row < len; row++){
var link = links[row];
var arrayDeResultados = pegarAsCoisas(link[0]);
Logger.log(arrayDeResultados);
arrayGrandao.push(arrayDeResultados);
}
resultadosSheet.getRange(2, 1, arrayGrandao.length, arrayGrandao[0].length).setValues(arrayGrandao);
}
function pegarAsCoisas(linkDoProduto) {
var resultadoArray = [];
var html = UrlFetchApp.fetch(linkDoProduto).getContentText();
var regExp = new RegExp("<h1([^]*)h1>", "gi");
var h1Html = regExp.exec(html);
var h1Parse = XmlService.parse(h1Html[0]);
var h1Output = h1Parse.getRootElement().getText();
h1Output = h1Output.replace(/(\r\n|\n|\r|(^( )*))/gm,"");
regExp = new RegExp("Ref.: ([^(])*", "gi");
var codeHtml = regExp.exec(html);
var codeOutput = codeHtml[0].replace("Ref.: ","").replace(" ","");
regExp = new RegExp("margin-top: 5px; margin-bottom: 5px; padding: 5px; background-color: #699D15; color: #fff; text-align: center;([^]*)/div>", "gi");
var descriptionHtml = regExp.exec(html);
var regExp = new RegExp("<p([^]*)p>", "gi");
var descriptionHtml = regExp.exec(descriptionHtml);
var regExp = new RegExp("^[^.]*", "gi");
var descriptionHtml = regExp.exec(descriptionHtml);
var descriptionOutput = descriptionHtml[0].replace("<p>","");
descriptionOutput = descriptionOutput+".";
regExp = new RegExp("ecom(.+?)Main.png", "gi");
var imageHtml = regExp.exec(html);
var comecoDaURL = "https://www.nespresso.com/";
var imageOutput = comecoDaURL+imageHtml[0];
var regExp = new RegExp("nes_l-float nes_big-price nes_big-price-with-out([^]*)p>", "gi");
var precoHtml = regExp.exec(html);
var regExp = new RegExp("[0-9]*,", "gi");
precoHtml = regExp.exec(precoHtml);
var precoOutput = "BRL "+precoHtml[0].replace(",","");
resultadoArray = [codeOutput,h1Output,descriptionOutput,"Home & Garden > Kitchen & Dining > Kitchen Appliances > Coffee Makers & Espresso Machines",
"Máquina",linkDoProduto,imageOutput,"new","in stock",precoOutput,"","","","Nespresso",codeOutput];
return resultadoArray;
}
But this is very timing consuming to program, it is very hard to change it dynamically and is not very reliable.
但这对编程来说非常耗时,很难动态更改并且不是很可靠。
I need a way to parse this HTML and easily access its elements. It′s actually not a add on. but a simple google app script..
我需要一种方法来解析此 HTML 并轻松访问其元素。它实际上不是一个附加项。但一个简单的谷歌应用程序脚本..
采纳答案by Fabian Thommen
I have done this in vanilla js. Not real html parsing. Just try to get some content out of a string (url):
我在 vanilla js 中做到了这一点。不是真正的 html 解析。尝试从字符串(url)中获取一些内容:
function getLKKBTC() {
var url = 'https://www.lykke.com/exchange';
var html = UrlFetchApp.fetch(url).getContentText();
var searchstring = '<td class="ask_BTCLKK">';
var index = html.search(searchstring);
if (index >= 0) {
var pos = index + searchstring.length
var rate = html.substring(pos, pos + 6);
rate = parseFloat(rate)
rate = 1/rate
return parseFloat(rate);
}
throw "Failed to fetch/parse data from " + url;
}
回答by asciian
I made cheeriogs for your problem. it's works on GAS as cheerio which is jQuery-like api. You can do that like this.
我为你的问题做了啦啦队。它在 GAS 上作为 Cheerio 工作,它是类似 jQuery 的 api。你可以这样做。
const content = UrlFetchApp.fetch('https://example.co/').getContentText();
const $ = Cheerio.load(content);
Logger.log($('p .blah').first().text()); // blah blah blah ...
See also https://github.com/asciian/cheeriogs
回答by Sujay Phadke
This has been discussed before. See here: What is the best way to parse html in google apps script
这个之前已经讨论过了。请参阅此处:在 google 应用程序脚本中解析 html 的最佳方法是什么
Unlike XMLservice, the XMLServiceis not very forgiving of malformed html. The trick in the answer by Justin Bicknell does the job. Even though XMLservice has been deprecated, it still continues to work.
与XML服务不同,XMLService对格式错误的 html 的容忍度并不高。贾斯汀·比克内尔 (Justin Bicknell) 的答案中的技巧可以完成这项工作。即使XML服务已被弃用,它仍然可以继续工作。
回答by Eric Koleda
Please be aware that certain web sites may not permit automated scraping of their content, so please consult their terms or service before using Apps Script to extract the content.
请注意,某些网站可能不允许自动抓取其内容,因此请在使用 Apps 脚本提取内容之前查阅其条款或服务。
The XmlService
only works against valid XML documents, and most HTML (especially HTML5), is not valid XML. A previous version of the XmlService
, simply called Xml
, allowed for "lenient" parsing, which would allow it to parse HTML as well. This service was sunset in 2013, but for the time being still functions. The reference docs are no longer available, but this old tutorialshows it's usage.
将XmlService
只能对有效的XML文档,大多数HTML(尤其是HTML5),不是有效的XML。之前版本的XmlService
,简称为Xml
,允许进行“宽松”解析,这也将允许它解析 HTML。这项服务在 2013 年已停止,但目前仍在运作。参考文档不再可用,但这个旧教程显示了它的用法。
Another alternative is to use a service like Kimono, which handles the scraping and parsing parts and provides a simple API you can call via UrlFetchApp
to retrieve the structured data.
另一种选择是使用Kimono 之类的服务,它处理抓取和解析部分,并提供一个简单的 API,您可以通过调用它UrlFetchApp
来检索结构化数据。
回答by user3347814
I′ve found a very neat alternative to scrape using Google App Script. It is called PhantomJS Cloud. One can use the urlFetchAppto access the API. This allows to execute Jquery code on the pages, which makes life so much simpler.
我找到了一个非常巧妙的替代方案来使用 Google App Script 进行抓取。它被称为PhantomJS Cloud。可以使用urlFetchApp访问 API。这允许在页面上执行 Jquery 代码,这使生活变得更加简单。
回答by Eric Dauenhauer
Could you use javascriptto parse the html? If your Google Apps Script retrieved the html as a string and then returned it to a javascript function, it seems like you could parse it just fine outside of the Google Apps script. Any tags you want to scrape, you could send to a dedicated Google Apps function that would save the content.
你能用javascript来解析html吗?如果您的 Google Apps 脚本将 html 作为字符串检索,然后将其返回给 javascript 函数,那么您似乎可以在 Google Apps 脚本之外很好地解析它。您可以将要抓取的任何标签发送到专用的 Google Apps 功能来保存内容。
You could probably accomplish this more easily with jQuery.
您可能可以使用 jQuery更轻松地完成此操作。
回答by vchrizz
maybe not the cleanest approach, but simple string processing does the job too without xmlservice:
也许不是最干净的方法,但简单的字符串处理也可以在没有 xmlservice 的情况下完成这项工作:
var url = 'https://somewebsite.com/?q=00:11:22:33:44:55';
var html = UrlFetchApp.fetch(url).getContentText();
// we want only the link text displayed from here:
//<td><a href="/company/ubiquiti-networks-inc">Ubiquiti Networks Inc.</a></td>
var string1 = html.split('<td><a href="/company/')[1]; // all after '<td><a href="/company/'
var string2 = string1.split('</a></td>')[0]; // all before '</a></td>'
var string3 = string2.split('>')[1]; // all after '>'
Logger.log('link text: '+string3); // string3 => "Ubiquiti Networks Inc."