模拟登录 · Python爬虫

scrapy实现登录有两种思路： 1. 直接携带cookie登录；应用场景：（1）cookie过期时间很长,常见于一些不规范的网站（2）能在cookie过期之前把所有的数据拿到（3）配合其他程序使用，比如其使用selenium把登陆之后的cookie获取到保存到本地，scrapy发送请求之前先读取本地cookie 2. 找到登录的url，发送post请求存储cookie；例：登录github **1. 直接携带cookie登录** （1）创建爬虫项目 ```shell > scrapy startproject git > cd git > scrapy genspider git1 github.com ``` （2）配置`settings.py` ```python # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False ``` （3）先手动登录到github，复制cookie ![](https://img.kancloud.cn/7a/ae/7aae9e676664575c584101b6874be222_1303x445.jpg) （4）重写 `start_requests` 方法 ```python import scrapy class Git1Spider(scrapy.Spider): name = 'git1' allowed_domains = ['github.com'] # 注意：请求的url应该是 https://github.com/你的github用户名 start_urls = ['https://github.com/你的github用户名'] def parse(self, response): # 登录前github上的title是 GitHub . GitHub # 登录成功后为用户名 . GitHub # 输出用户名 · GitHub，说明登录成功 print(response.xpath('/html/head/title/text()').extract_first())) pass def start_requests(self): """ 重写该方法 """ url = self.start_urls[0] cookie = '_ga=GA1.2.534025100（cookie太长了这里省略不写了）...3D' # 1. 将cookie转换为字典 cookies = {data.split('=')[0]: data.split('=')[-1] for data in cookie.split(';')} # 2. 携带cookies发送请求 yield scrapy.Request( url=url, callback=self.parse, cookies=cookies ) ``` **2. 找到的url，携带相关参数发送post请求** 其分析过程这里就省略了，下面只提供了scrapy中用于发送 POST 请求的代码。 ```python import scrapy class Git2Spider(scrapy.Spider): name = 'git2' allowed_domains = ['github.com'] start_urls = ['http://github.com/login'] def parse(self, response): # 1. 解析出登录需要的所有参数 post_data = {} # 2. 找到登录的url，提交请求 # 发送 POST请求可以调用scrapy.FormRequest # 或者 scrapy.Request(url, method='POST') yield scrapy.FormRequest( url='https://github.com/session', # github提交表单的地址 callback=self.login_github, # 登录成功后的解析函数 formdata=post_data # 进行登录时所需要的参数 ) pass ```