建网,登封网站关键词优化软件,网站空间ip需不需要备案,网站建设属于会计的什么科目Scrapy入门 1、Scrapy概述2、PyCharm搭建Scrapy环境3、Scrapy使用四部曲4、Scrapy入门案例4.1、明确目标4.2、制作爬虫4.3、存储数据4.4、运行爬虫 1、Scrapy概述
Scrapy是一个由Python语言开发的适用爬取网站数据、提取结构性数据的Web应用程序框架。主要用于数据挖掘、信息处… Scrapy入门 1、Scrapy概述2、PyCharm搭建Scrapy环境3、Scrapy使用四部曲4、Scrapy入门案例4.1、明确目标4.2、制作爬虫4.3、存储数据4.4、运行爬虫 1、Scrapy概述
Scrapy是一个由Python语言开发的适用爬取网站数据、提取结构性数据的Web应用程序框架。主要用于数据挖掘、信息处理、数据存储和自动化测试等。通过Scrapy框架实现一个爬虫只需要少量的代码就能够快速的网络抓取
Scrapy框架5大组件架构 Scrapy引擎(Scrapy Engine)Scrapy引擎是整个框架的核心负责Spider、ItemPipeline、Downloader、Scheduler间的通讯、数据传递等调度器(Scheduler)网页URL的优先队列主要负责处理引擎发送的请求并按一定方式排列调度当引擎需要时交还给引擎下载器(Downloader)负责下载引擎发送的所有Requests请求资源并将其获取到的Responses交还给引擎由引擎交给Spider来处理爬虫(Spider)用户定制的爬虫用于从特定网页中提取信息(实体Item)负责处理所有Responses从中提取数据并将需要跟进的URL提交给引擎再次进入调度器实体管道(Item Pipeline)用于处理Spider中获取的实体并进行后期处理详细分析、过滤、持久化存储等
其他组件
下载中间件(Downloader Middlewares)一个可以自定义扩展下载功能的组件Spider中间件(Spider Middlewares)一个可以自定扩展和操作引擎和Spider间通信的组件
官方文档https://docs.scrapy.org
入门文档https://doc.scrapy.org/en/latest/intro/tutorial.html
2、PyCharm搭建Scrapy环境
1新建一个爬虫项目ScrapyDemo
2在Terminal终端安装所需模块
Scrapy基于TwistedTwisted是一个异步网络框架主要用于提高爬虫的下载速度
pip install scrapy
pip install twisted如果报错
ERROR: Failed building wheel for twisted
error: Microsoft Visual C 14.0 or greater is required则需要下载对应的whl文件安装
Python扩展包whl文件下载https://www.lfd.uci.edu/~gohlke/pythonlibs/#
ctrlf查找需要的whl文件点击下载对应版本
安装
pip install whl文件绝对路径例如
pip install F:\PyWhl\Twisted-20.3.0-cp38-cp38m-win_amd64.whl3在Terminal终端创建爬虫项目ScrapyDemo
scrapy startproject ScrapyDemo生成项目目录结构
4在spiders文件夹下创建核心爬虫文件SpiderDemo.py
最终项目结构及说明
ScrapyDemo/ 爬虫项目├── ScrapyDemo/ 爬虫项目目录 │ ├── spiders/ 爬虫文件│ │ ├── __init__.py │ │ └── SpiderDemo.py 自定义核心功能文件│ ├── __init__.py │ ├── items.py 爬虫目标数据│ ├── middlewares.py 中间件、代理 │ ├── pipelines.py 管道用于处理爬取的数据 │ └── settings.py 爬虫配置文件└── scrapy.cfg 项目配置文件3、Scrapy使用四部曲
1明确目标
明确爬虫的目标网站
明确需要爬取实体属性items.py
定义属性名 scrapy.Field()
2制作爬虫
自定义爬虫核心功能文件spiders/SpiderDemo.py
3存储数据
设计管道存储爬取内容settings.py、pipelines.py
4运行爬虫
方式1在Terminal终端执行cmd执行需要切到项目根目录下
scrapy crawl dangdang(爬虫名)cmd切换操作
切盘F:
切换目录cd A/B/...方式2在PyCharm执行文件
在爬虫项目目录下创建运行文件run.py右键运行
4、Scrapy入门案例
4.1、明确目标
1爬取当当网手机信息https://category.dangdang.com/cid4004279.html
2明确需要爬取实体属性items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapy# 1明确目标
# 1.2明确需要爬取实体属性
class ScrapyDemoItem(scrapy.Item):# define the fields for your item here like:# name scrapy.Field()# 名称name scrapy.Field()# 价格price scrapy.Field()4.2、制作爬虫
SpiderDemo.py
# 入门案例# 1明确目标
# 1.1爬取当当网手机信息https://category.dangdang.com/cid4004279.html# 2制作爬虫
import scrapy
from scrapy.http import Response
from ..items import ScrapyDemoItemclass SpiderDemo(scrapy.Spider):# 爬虫名称运行爬虫时使用的值name dangdang# 爬虫域允许访问的域名allowed_domains [category.dangdang.com]# 爬虫地址起始URL第一次访问是域名start_urls [https://category.dangdang.com/cid4004279.html]# 翻页分析# 第1页https://category.dangdang.com/cid4004279.html# 第2页https://category.dangdang.com/pg2-cid4004279.html# 第3页https://category.dangdang.com/pg3-cid4004279.html# ......page 1# 请求响应处理def parse(self, response: Response):li_list response.xpath(//ul[idcomponent_47]/li)for li in li_list:# 商品名称name li.xpath(.//img/alt).extract_first()print(name)# 商品价格price li.xpath(.//p[classprice]/span[1]/text()).extract_first()print(price)# 获取一个实体对象就交给管道pipelinesdemo ScrapyDemoItem(namename, priceprice)# 封装item数据后调用yield将控制权给管道管道拿到item后返回该程序yield demo# 每一页爬取逻辑相同只需要将执行下一页的请求再次调用parse()方法即可if self.page 10:self.page 1url rfhttps://category.dangdang.com/pg{str(self.page)}-cid4004279.html# scrapy.Request为scrapy的请求# yield中断yield scrapy.Request(urlurl, callbackself.parse)补充Response对象的属性和方法 1获取响应的字符串
response.text
2获取响应的二进制数据
response.body
3解析响应内容
response.xpath()4.3、存储数据
settings.py
# Scrapy settings for ScrapyDemo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html# 3存储数据
# 3.1爬虫配置、打开通道和添加通道# 爬虫项目名
BOT_NAME ScrapyDemoSPIDER_MODULES [ScrapyDemo.spiders]
NEWSPIDER_MODULE ScrapyDemo.spiders# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT ScrapyDemo (http://www.yourdomain.com)
# User-Agent配置
USER_AGENT Mozilla/5.0# Obey robots.txt rules
# 是否遵循机器人协议默认True为了避免一些爬取限制需要改为False
ROBOTSTXT_OBEY False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并发数
#CONCURRENT_REQUESTS 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟单位s用于控制爬取的频率
#DOWNLOAD_DELAY 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN 16
#CONCURRENT_REQUESTS_PER_IP 16# Disable cookies (enabled by default)
# 是否保存Cookies默认False
#COOKIES_ENABLED False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS {
# Accept: text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8,
# Accept-Language: en,
#}
# 请求头
DEFAULT_REQUEST_HEADERS {Accept: text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8,Accept-Language: en,
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES {
# ScrapyDemo.middlewares.ScrapydemoSpiderMiddleware: 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES {
# ScrapyDemo.middlewares.ScrapydemoDownloaderMiddleware: 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS {
# scrapy.extensions.telnet.TelnetConsole: None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES {
# ScrapyDemo.pipelines.ScrapydemoPipeline: 300,
#}# 项目管道
ITEM_PIPELINES {# 管道可以有多个后面的数字是优先级范围1-1000值越小优先级越高# 爬取网页scrapy_dangdang.pipelines.ScrapyDemoPipeline: 300,# 保存数据scrapy_dangdang.pipelines.ScrapyDemoSinkPiepline: 301,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED True
# The initial download delay
#AUTOTHROTTLE_START_DELAY 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED True
#HTTPCACHE_EXPIRATION_SECS 0
#HTTPCACHE_DIR httpcache
#HTTPCACHE_IGNORE_HTTP_CODES []
#HTTPCACHE_STORAGE scrapy.extensions.httpcache.FilesystemCacheStorage# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION 2.7
TWISTED_REACTOR twisted.internet.asyncioreactor.AsyncioSelectorReactor
FEED_EXPORT_ENCODING utf-8# 设置日志输出等级默认DEBUG与日志存放的路径
LOG_LEVEL INFO
# LOG_FILE spider.logpipelines.py
# Define your item pipelines here
#
# Dont forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 3存储数据
# 3.2使用管道存储数据
# 若使用管道则必须在settings.py中开启管道import os
import csv# 爬取网页
class ScrapyDemoPipeline:# 数据item交给管道输出def process_item(self, item, spider):print(item)return item# 保存数据
class ScrapyDemoSinkPiepline:# item为yield后面的ScrapyDemoItem对象字典类型def process_item(self, item, spider):with open(rC:\Users\cc\Desktop\scrapy_test.csv, a, newline, encodingutf-8) as csvfile:# 定义表头fields [name, price]writer csv.DictWriter(csvfile, fieldnamesfields)writer.writeheader()# 写入数据writer.writerow(item)4.4、运行爬虫
run.py
# 4运行爬虫from scrapy import cmdlinecmdline.execute(scrapy crawl dangdang.split())其他文件不动本案例运行会报错
ERROR: Twisted-20.3.0-cp38-cp38m-win_amd64.whl is not a supported wheel on this platform
builtins.ModuleNotFoundError: No module named scrapy_dangdang原因大概是Twisted版本兼容问题暂未解决后续补充