当前位置：首页 > news >正文

网站备案怎么取消网站发布教程视频教程

news 2025/10/21 8:05:32

网站备案怎么取消,网站发布教程视频教程,建设部网站举报,wordpress 问卷调查python爬虫入门#xff08;实践#xff09; 一、对目标网站进行分析二、博客爬取获取博客所有h2标题的路由确定目标#xff0c;查看源码代码实现获取博客所有h2标题的路由 url http://www.crazyant.netimport re…python爬虫入门实践一、对目标网站进行分析二、博客爬取获取博客所有h2标题的路由确定目标查看源码代码实现获取博客所有h2标题的路由 url http://www.crazyant.netimport requests from bs4 import BeautifulSoup#发送请求获取页面所有内容 r requests.get(url) if r.status_code ! 200:raise Exception(请求失败) # 抛出异常 html_doc r.text# 解析html获取对应信息 soup BeautifulSoup(html_doc,html.parser)h2_nodes soup.find_all(h2,class_entry-title)for h2_node in h2_nodes:link h2_node.find(a)print(link[href],link.get_text())通过标题爬取所有博客文章爬取所有博客文章import refrom utils import url_manager import requests from bs4 import BeautifulSouproot_urlhttp://www.crazyant.net# 将root_url添加到urls中 urls url_manager.UrlManager() urls.add_new_url(root_url)# 获取所有页面内容并保存到文件 fout open(craw_all_pages.txt,w,encodingutf-8) while urls.has_new_url():curr_url urls.get_url()r requests.get(curr_url,timeout2)if r.status_code ! 200:print(请求失败,curr_url)continuesoup BeautifulSoup(r.text,html.parser)title soup.title.string # 获取标题fout.write(%s\t%s\n % (curr_url, title))# 写入文件fout.flush()# 刷新缓冲区,直接写入文件print(success: %s, %s, %d%(curr_url,title,len(urls.new_urls)))# 获取所有链接, 并添加到urls中links soup.find_all(a)for link in links:href link.get(href)if href is None:continuepattern r^http://www.crazyant.net/\d.html$ # 匹配规则,匹配以http://www.crazyant.net/开头并且以.html结尾的url# 正则匹配, 返回一个匹配对象如果没有匹配到返回Noneif re.match(pattern,href):urls.add_new_url(href)fout.close()运行结果

查看全文

http://www.dnsts.com.cn/news/119369.html