如何使用 python 解析 Javascript 变量?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18368058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I parse Javascript variables using python?
提问by Alex Ketay
The problem: A website I am trying to gather data from uses Javascript to produce a graph. I'd like to be able to pull the data that is being used in the graph, but I am not sure where to start. For example, the data might be as follows:
问题:我试图从使用 Javascript 收集数据的网站生成图表。我希望能够提取图表中正在使用的数据,但我不确定从哪里开始。例如,数据可能如下所示:
var line1=
[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],
["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],
["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],
["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],
["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],
["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],
["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];
This is pricing data (Date, Price, Volume). I've found another question here - Parsing variable data out of a js tag using python- which suggests that I use JSON and BeautifulSoup, but I am unsure how to apply it to this particular problem because the formatting is slightly different. In fact, in this problem the code looks more like python than any type of JSON dictionary format.
这是定价数据(日期、价格、数量)。我在这里发现了另一个问题 -使用 python 从 js 标签中解析变量数据- 这表明我使用 JSON 和 BeautifulSoup,但我不确定如何将它应用于这个特定问题,因为格式略有不同。事实上,在这个问题中,代码看起来更像 python,而不是任何类型的 JSON 字典格式。
I suppose I could read it in as a string, and then use XPATH and some funky string editing to convert it, but this seems like too much work for something that is already formatted as a Javascript variable.
我想我可以将它作为字符串读入,然后使用 XPATH 和一些时髦的字符串编辑来转换它,但这对于已经格式化为 Javascript 变量的东西来说似乎太多了。
So, what can I do here to pull this type of organized data from this variable while using python? (I am most familiar with python and BS4)
那么,在使用 python 时,我可以在这里做什么来从这个变量中提取这种类型的有组织的数据?(我最熟悉python和BS4)
采纳答案by Alex Ketay
Okay, so there are a few ways to do it, but I ended up simply using a regular expression to find everything between line1=
and ;
好的,所以有几种方法可以做到,但我最终只是使用正则表达式来查找line1=
和;
#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#evaluate list as python code => turn into list in python
newParsed = eval(parsed[0])
Regex is nice when you have good coding, but is this method better (EDIT: or worse!) than any of the other answers here?
当您有良好的编码时,正则表达式很好,但是这种方法是否比这里的任何其他答案更好(编辑:或更糟!)?
EDIT: I ultimately used the following:
编辑:我最终使用了以下内容:
#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#load as JSON instead of using evaluate to prevent risky execution of unknown code
newParsed = json.loads(parsed[0])
回答by abarnert
If your format really is just one or more var foo = [JSON array or object literal];
, you can just write a dotall regex to extract them, then parse each one as JSON. For example:
如果您的格式真的只是一个或多个var foo = [JSON array or object literal];
,您可以编写一个 dotall 正则表达式来提取它们,然后将每个格式解析为 JSON。例如:
>>> j = '''var line1=
[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],
["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],
["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],
["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],
["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],
["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],
["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];\s*$'''
>>> values = re.findall(r'var.*?=\s*(.*?);', j, re.DOTALL | re.MULTILINE)
>>> for value in values:
... print(json.loads(value))
[[['Wed, 12 Jun 2013 01:00:00 +0000', 22.4916114807, '2 sold'],
['Fri, 14 Jun 2013 01:00:00 +0000', 27.4950008392, '2 sold'],
['Sun, 16 Jun 2013 01:00:00 +0000', 19.5499992371, '1 sold'],
['Tue, 18 Jun 2013 01:00:00 +0000', 17.25, '1 sold'],
['Sun, 23 Jun 2013 01:00:00 +0000', 15.5420341492, '2 sold'],
['Thu, 27 Jun 2013 01:00:00 +0000', 8.79045295715, '3 sold'],
['Fri, 28 Jun 2013 01:00:00 +0000', 10, '1 sold']]]
Of course this makes a few assumptions:
当然,这做了一些假设:
- A semicolon at the end of the line must be an actual statement separator, not the middle of a string. This should be safe because JS doesn't have Python-style multiline strings.
- The code actually does have semicolons at the end of each statement, even though they're optional in JS. Most JS code has those semicolons, but it obviously isn't guaranteed.
- The array and object literals really are JSON-compatible. This definitely isn't guaranteed; for example, JS can use single-quoted strings, but JSON can't. But it does work for your example.
- Your format really is this well-defined. For example, if there might be a statement like
var line2 = [[1]] + line1;
in the middle of your code, it's going to cause problems.
- 行尾的分号必须是实际的语句分隔符,而不是字符串的中间。这应该是安全的,因为 JS 没有 Python 风格的多行字符串。
- 代码实际上在每个语句的末尾都有分号,即使它们在 JS 中是可选的。大多数 JS 代码都有那些分号,但显然不能保证。
- 数组和对象文字确实是 JSON 兼容的。这绝对不能保证;例如,JS 可以使用单引号字符串,但 JSON 不能。但它确实适用于您的示例。
- 您的格式确实是定义明确的。例如,如果
var line2 = [[1]] + line1;
您的代码中间可能有这样的语句,就会导致问题。
Note that if the data might contain JavaScript literals that aren't all valid JSON, but are all valid Python literals (which isn't likely, but isn't impossible, either), you can use ast.literal_eval
on them instead of json.loads
. But I wouldn't do that unless you know this is the case.
请注意,如果数据可能包含并非都是有效 JSON 的 JavaScript 文字,但都是有效的 Python 文字(这不太可能,但也不是不可能),您可以ast.literal_eval
在它们上使用json.loads
. 但除非你知道是这种情况,否则我不会这样做。
回答by Paul S.
The following makes a few assumptions such as knowing how the page is formatted, but a way of getting your example into memory on Pythonis like this
下面做了一些假设,比如知道页面是如何格式化的,但是在Python上让你的示例进入内存的一种方法是这样的
# example data
data = 'foo bar foo bar foo bar foo bar\r\nfoo bar foo bar foo bar foo bar \r\nvar line1=\r\n[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],\r\n["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],\r\n["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],\r\n["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],\r\n["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],\r\n["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],\r\n["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];\r\nfoo bar foo bar foo bar foo bar\r\nfoo bar foo bar foo bar foo bar'
# find your variable's start and end
x = data.find('line1=') + 6
y = data.find(';', x)
# so you can get just the relevant bit
interesting = data[x:y].strip()
# most dangerous step! don't do this on unknown sources
parsed = eval(interesting)
# maybe you'd want to use JSON instead, if the data has the right syntax
from json import loads as JSON
parsed = JSON(interesting)
# now parsed is your data
回答by IPDGino
Assuming you have a python variable with a javascript line/block as a string like"var line1 = [[a,b,c], [d,e,f]];"
, you could use the following few lines of code.
假设您有一个带有 javascript 行/块作为字符串的 python 变量,例如"var line1 = [[a,b,c], [d,e,f]];"
,您可以使用以下几行代码。
>>> code = """var line1 = [['a','b','c'], ['d','e','f'], ['g','h','i']];"""
>>> python_readable_code = code.strip("var ;")
>>> exec(python_readable_code)
>>> print(line1)
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
exec()
Will run the code that is formatted as a string. In this case it will set the variable line1
to a list with lists.
exec()
将运行格式化为字符串的代码。在这种情况下,它会将变量设置line1
为带有列表的列表。
And than you could use something like this:
而且你可以使用这样的东西:
for list in line1:
print(list[0], list[1], list[2])
# Or do something else with those values, like save them to a file