Python 解析从 BeautifulSoup 返回的 JavaScript

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21069294/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:58:34  来源:igfitidea点击:

Parse the JavaScript returned from BeautifulSoup

javascriptpythonbeautifulsouphtml-parsing

提问by Wade

I would like to parse the webpage http://dcsd.nutrislice.com/menu/meadow-view/lunch/to grab today's lunch menu. (I've built an Adafruit #IoT Thermal Printer and I'd like to automatically print the menu each day.)

我想解析网页http://dcsd.nutrislice.com/menu/meadow-view/lunch/来获取今天的午餐菜单。(我已经构建了一个 Adafruit #IoT 热敏打印机,我想每天自动打印菜单。)

I initially approached this using BeautifulSoup but it turns out that most of the data is loaded in JavaScript and I'm not sure BeautifulSoup can handle it. If you view source you'll see the relevant data stored in bootstrapData['menuMonthWeeks'].

我最初使用 BeautifulSoup 来解决这个问题,但结果证明大部分数据都是用 JavaScript 加载的,我不确定 BeautifulSoup 可以处理它。如果您查看源代码,您将看到存储在bootstrapData['menuMonthWeeks'].

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
soup = BeautifulSoup(urllib2.urlopen(url).read())

This is an easy way to get the source and review.

这是获取来源和评论的简单方法。

My question is: what is the easiest way to extract this data so that I can do something with it? Literally, all I want is a string something like:

我的问题是:提取这些数据以便我可以用它做某事的最简单方法是什么?从字面上看,我想要的只是一个类似于以下内容的字符串:

Southwest Cheese Omelet, Potato Wedges, The Harvest Bar (THB), THB - Cheesy Pesto Bread, Ham Deli Sandwich, Red Pepper Sticks, Strawberries

Southwest Cheese Omelet, Potato Wedges, The Harvest Bar (THB), THB - Cheesy Pesto Bread, Ham Deli Sandwich, Red Pepper Sticks, Strawberry

I've thought about using webkit to process the page and get the HTML (i.e. what a browser does) but that seems unnecessarily complex. I'd rather simply find something that can parse the bootstrapData['menuMonthWeeks']data.

我想过使用 webkit 来处理页面并获取 HTML(即浏览器所做的),但这似乎不必要地复杂。我宁愿简单地找到可以解析bootstrapData['menuMonthWeeks']数据的东西。

采纳答案by user94559

Something like PhantomJS may be more robust, but here's some basic Python code to extract it the full menu:

像 PhantomJS 这样的东西可能更健壮,但这里有一些基本的 Python 代码来提取它的完整菜单:

import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menu = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))

print menu

After that, you'll want to search through the menu for the date you're interested in.

之后,您需要在菜单中搜索您感兴趣的日期。

EDIT: Some overkill on my part:

编辑:我有些矫枉过正:

import itertools
import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menus = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))

days = itertools.chain.from_iterable(menu['days'] for menu in menus)

day = next(itertools.dropwhile(lambda day: day['date'] != '2014-01-13', days), None)

if day:
    print '\n'.join(item['food']['description'] for item in day['menu_items'])
else:
    print 'Day not found.'

回答by Martijn Pieters

All you need is a little string slicing:

您只需要一点字符串切片:

import json

soup = BeautifulSoup(urllib2.urlopen(url).read())
script = soup.findAll('script')[1].string
data = script.split("bootstrapData['menuMonthWeeks'] = ", 1)[-1].rsplit(';', 1)[0]
data = json.loads(data)

JSON is, after all, a subset of JavaScript.

毕竟,JSON 是 JavaScript 的一个子集。

回答by Guy Gavriely

without BeautifulSoup, one simple way can we:

没有 BeautifulSoup,我们可以通过一种简单的方法:

import urllib2
import json
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
for line in urllib2.urlopen(url):
    if "bootstrapData['menuMonthWeeks']" in line:
        data = json.loads(line.split("=")[1].strip('\n;'))
        print data[0]["last_updated"]

output:

输出:

2013-11-11T11:18:13.636

for a more generic way see JavaScript parser in Python

有关更通用的方法,请参阅Python 中的 JavaScript 解析器

回答by alvas

Without messing with json, if you prefer, which it's not recommended, you can try the following:

没有搞乱json,如果你愿意,它不建议,您可以尝试以下方法:

import urllib2
import re

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
data = urllib2.urlopen(url).readlines()[60].partition('=')[2].strip()

foodlist = []

prev = 'name'
for i in re.findall('"([^"]*)"', data):
    if "The Harvest Bar (THB)" in i or i == "description" or i == "start_date":
        prev = i
        continue
    if prev == 'name':
        if i.startswith("THB - "):
            i = i[6:]
        foodlist.append(i)
    prev = i

I guess this is what you'll ultimately need:

我想这就是你最终需要的:

Orange Chicken Bowl
Roasted Veggie Pesto Pizza
Cheese Sandwich & Yogurt Tube
Steamed Peas
Peaches
Southwest Cheese Omelet
Potato Wedges
Cheesy Pesto Bread
Ham Deli Sandwich
Red Pepper Sticks
Strawberries
Hamburger
Cheeseburger
Potato Wedges
Chicken Minestrone Soup
Veggie Deli Sandwich
Baked Beans
Green Beans
Fruit Cocktail
Cheese Pizza
Pepperoni Pizza
Diced Chicken w/ Cornbread
Turkey Deli Sandwich
Celery Sticks
Blueberries
Cowboy Mac
BYO Asian Salad
Sunbutter Sandwich
Stir Fry Vegetables
Pineapple Tidbits
Enchilada Blanco
Sausage & Black Olive Pizza
Cheese Sandwich & Yogurt Tube
Southwest Black Beans
Red Pepper Sticks
Applesauce
BBQ Roasted Chicken.
Hummus Cup w/  Pita bread
Ham Deli Sandwich
Mashed potatoes w/ gravy
Celery Sticks
Kiwi
Popcorn Chicken Bowl
Tuna Salad w/  Pita Bread
Veggie Deli Sandwich
Corn Niblets
Blueberries
Cheese Pizza
Pepperoni Pizza
BYO Chef Salad
BYO Vegetarian Chef Salad
Turkey Deli Sandwich
Steamed Cauliflower
Banana, Whole
Bosco Sticks
Chicken Egg Roll & Chow Mein Noodles
Sunbutter Sandwich
California Blend Vegetables
Fresh Pears
Baked Mac & Cheese
Italian Dunker
Ham Deli Sandwich
Red Pepper Sticks
Pineapple Tidbits
Hamburger
Cheeseburger
Baked Fries
BYO Taco Salad
Veggie Deli Sandwich
Baked Beans
Coleslaw
Fresh Grapes
Cheese Pizza
Pepperoni Pizza
Diced Chicken w/ Cornbread
Turkey Deli Sandwich
Steamed Cauliflower
Fruit Cocktail
French Dip w/ Au Jus
Baked Fries
Turkey Noodle Soup
Sunbutter Sandwich
Green Beans
Warm Cinnamon Apples
Rotisserie Chicken
Mashed potatoes w/ gravy
Bacon Cheeseburger Pizza
Cheese Sandwich & Yogurt Tube
Steamed Peas
Apple Wedges
Turkey Chili 
Cornbread Muffins
BYO Chef Salad
BYO Vegetarian Chef Salad
Ham Deli Sandwich
Celery Sticks
Fresh Pears
Beef, Bean & Red Chili Burrito
Popcorn Chicken & Breadstick
Veggie Deli Sandwich
California Blend Vegetables
Strawberries
Cheese Pizza
Pepperoni Pizza
Hummus Cup w/  Pita bread
Turkey Deli Sandwich
Green Beans
Orange Wedges
Bosco Sticks
Cheesy Bean Soft Taco Roll Up
Sunbutter Sandwich
Pinto Bean Cup
Baby Carrots
Blueberries

With json:

json

import urllib2
import json
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
for line in urllib2.urlopen(url):
    if "bootstrapData['menuMonthWeeks']" in line:
        data = json.loads(line.split("=")[1].strip('\n;'))
        print data[0]["name"]
    break

回答by chad

I realize this is about four years later, but nutrislice (at least now) has an api you can get direct JSON from. Your kid's lunch from a couple days ago: http://dcsd.nutrislice.com/menu/api/digest/school/meadow-view/menu-type/lunch/date/2018/03/14/

我意识到这是大约四年后,但 nutrislice(至少现在)有一个 api,你可以从中直接获取 JSON。您孩子几天前的午餐:http: //dcsd.nutrislice.com/menu/api/digest/school/meadow-view/menu-type/lunch/date/2018/03/14/