工程师招聘网站,广告设计与制作专业就业方向有哪些,html网站成品下载,价格低是什么意思创建crawlspider爬虫文件#xff1a;
scrapy genspider -t crawl 爬虫文件名 爬取的域名scrapy genspider -t crawl read https://www.dushu.com/book/1206.htmlLinkExtractor 链接提取器通过它#xff0c;Spider可以知道从爬取的页面中提取出哪些链接#xff0c;提取出的链…创建crawlspider爬虫文件
scrapy genspider -t crawl 爬虫文件名 爬取的域名scrapy genspider -t crawl read https://www.dushu.com/book/1206.htmlLinkExtractor 链接提取器通过它Spider可以知道从爬取的页面中提取出哪些链接提取出的链接会自动生成Request请求对象
class ReadSpider(CrawlSpider):name readallowed_domains [www.dushu.com]start_urls [https://www.dushu.com/book/1206_1.html]# LinkExtractor 链接提取器通过它Spider可以知道从爬取的页面中提取出哪些链接。提取出的链接会自动生成Request请求对象rules (Rule(LinkExtractor(allowr/book/1206_\d\.html), callbackparse_item, followFalse),)def parse_item(self, response):name_list response.xpath(//div[classbook-info]//img/alt)src_list response.xpath(//div[classbook-info]//img/data-original)for i in range(len(name_list)):name name_list[i].extract()src src_list[i].extract()book ScarpyReadbook41Item(namename, srcsrc)yield book开启管道、 写入文件
class ScarpyReadbook41Pipeline:def open_spider(self, spider):self.fp open(books.json, w, encodingutf-8)def process_item(self, item, spider):self.fp.write(str(item))return itemdef close_spider(self, spider):self.fp.close()运行之后发现没有第一页数据 需要在start_urls里加上_1不然不会读取第一页数据
start_urls [https://www.dushu.com/book/1206_1.html]