python实验2

python > 实验 python 实验

发布时间 : 2023-12-11 13:04

字数:602 阅读 :

python网络爬虫：
分析：

python网络爬虫：

使用python编写爬虫，爬取起点中文网主页的所有小说信息,并保存到excel表格中

分析：

起点中文网主页连接

https://www.qidian.com/all/page1/

爬取范围为 page1 到page5

查看网页源代码可知，小说信息位于 <ul class="all-img-list cf">标签下

单个小说代码：

代码实现方法：

遍历五个页面，使用xpath获取每个页面的<ul class="all-img-list cf">的所有数据

然后在遍历<ul>中的每一个<li>（即单本小说），匹配获取相关信息

代码实现：

from lxml import etree
import requests
import xlwt
all_info_list=[]
index=0
def get(url,index):
    # print(url)
    html=requests.get(url)
    # print(html.text)
    selector =etree.HTML(html.text)
    # 匹配<ul>标签
    infos=selector.xpath('//ul[@class="all-img-list cf"]/li')
    # print(infos)
    for info in infos:
        title=info.xpath('div[2]/h2/a/text()')[0]
        #小说主页跳转链接
        link=info.xpath('div[2]/h2/a/@href')[0]
        #添加访问协议
        link='https:'+link
        # print(link)
        autor=info.xpath('div[2]/p[1]/a[1]/text()')[0]
        type1=info.xpath('div[2]/p[1]/a[2]/text()')[0]
        type2=info.xpath('div[2]/p[1]/a[3]/text()')[0]
        tag=type1+'.'+type2
        complete=info.xpath('div[2]/p[1]/span/text()')[0]
        introduce=info.xpath('div[2]/p[2]/text()')[0].strip()

        #从小说详情页面获取小说字数
        html1=requests.get(link)
        # print(html1.text)
        selector1=etree.HTML(html1.text)
        info1=selector1.xpath('//p[@class="count"]/em/text()')[0]+'字'
        # print(info1)
        word=info1

        #最新更新章节
        new=info.xpath('div[2]/p[3]/span/a/text()')[0].strip()
        info_list=[title,link,autor,tag,complete,introduce,word,new]
        # print(info_list)
        all_info_list.append(info_list)
        # time.sleep(1)
        # 进度条
        index +=1
        print("\r", end="")
        print("Progress: {}%: [".format(index), "=" * (index-1),">"," "*(100-index),"]", end="")

if __name__=='__main__':
    # 主页链接
    urls=['https://www.qidian.com/all/page'+str(i) for i in range(1,6)]
    for url in urls:
        get(url,index)
        index+=20# 计数
    # excel表头
    header=['title','link','autor','tag','complete','intorduce','word','new']
    #创建表格
    book=xlwt.Workbook(encoding='utf-8')
    sheet = book.add_sheet('Sheet1')
    for h in range(len(header)):
        # 写入表头
        sheet.write(0, h, header[h])

    i = 1  # 行数
    for list in all_info_list:
        j = 0  # 列数
        # 写入爬虫数据
        for data in list:
            sheet.write(i, j, data)
            j += 1
        i += 1
    # 保存文件
    book.save('xiaoshuo.xls')

运行截图：

最终输出的excel表格：

共计101行

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论