Bob blog

返回上页首页

Python爬虫(四)使用selenium和headless浏览器

2020年5月11日 - 由Bo 0 评论 2503 阅读

python 爬虫 spider

当只是爬静态网页时，用requests并解析页面是很方便的。如果我们看到加载页面时有调用API(浏览器的dev tool里network标签里看)，也可以直接向这个api发送请求。

不过当遇到ajax异步加载或者需要执行javascript时，仅仅使用requests就不够了。这时我们可以用上selenium。selenium对于做web自动化测试的人来说是必备技能工具，它可以模拟用户真实操作浏览器来访问网站的行为比如点击滑动等。(对于selenium的原理和使用不在这里详述，之后会另开一个系列写)。既然是操作浏览器，那么就相当于是直接加载了所有页面元素，javascript的执行也会反应出来，然后我们就可以根据元素的属性或文本等来获取需要的信息。

但是直接用浏览器比如chrome会渲染界面，但是做爬虫时并不在乎页面长什么样，我们就不需要浪费资源和时间来等待它加载所有。于是我们可以用headless browser(无头浏览器)，避免加载界面以节省时间，减少资源消耗可用于多进程。

headless browser在前几年用得比较多的是PhantomJS，不过selenium已经不推荐用它了。另外用headless模式的chrome或者firefox也可以。接下来会分别介绍一下。

PhantomJS

在这里可以下载：https://phantomjs.org/download.html 。虽然不再更新了。

比如我们现在准备获取网易新闻点击数前三位的文章。类似下面的代码和注释。

from selenium import webdriver  # 引入selenium的包
from selenium.webdriver.common.by import By

driver = webdriver.PhantomJS("./phantomjs")  # 初始化driver，并指明用PhantomJS
driver.get("https://news.163.com/rank/")  # 打开页面
# 获取点击量最多的前三位的文章标题，用xpath在页面查找
article_title_elements = driver.find_elements(By.XPATH, "//div[@class='area areabg1']/div[2]//div[@class='tabContents active']//td[@class='red']/a")
# 获取点击量最多的前三位的文章的点击数，用xpath在页面查找
article_view_elements = driver.find_elements(By.XPATH, "//div[@class='area areabg1']/div[2]//div[@class='tabContents active']//td[@class='red']/following-sibling::td")
for seq in range(len(article_title_elements)):  # 分别打出标题和点击数
    print(article_title_elements[seq].text)
    print(article_view_elements[seq].text)
driver.quit()  # 退出driver

简单的输出如下：

Headless Chrome

当使用PhantomJS时会看到警告信息："Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead"，这意味着selenium建议使用headless的chrome或firefox来代替。

首先需要下载chromedriver，一样的配置在环境变量中。可以在这里下载："https://sites.google.com/a/chromium.org/chromedriver"，如果访问不了可以在taobao的镜像中下载："http://npm.taobao.org/mirrors/chromedriver/".

用headless chrome时只是启动driver的代码不一样，需要指定为headless，其他都可以一样。

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=chrome_options)

下一篇: Python爬虫(五)关于headless浏览器被反爬虫禁止访问

上一篇: Tensorflow基础图像分类

Bob's Blog

Python爬虫(四)使用selenium和headless浏览器

共有0条评论

添加评论

暂无评论

Python爬虫(四)使用selenium和headless浏览器

共有0条评论 添加评论

暂无评论

共有0条评论

添加评论