Python3.5
应朋友之约,帮他做个爬虫,并且每个网页的数据都分别导入到excel中。
目标网站:http://www.hs-bianma.com/hs_chapter_01.htm
根据我的观察,网页采取的是<td><th>制成表格来存放数据,属于非常简单的类型。因为Python自带有非常好的网页处理模块,因此前后代码花费时间在30分钟。
网站:

网页源代码:


需要模块:BeautifulSoup、Request、xlwt
废话不多说,直接上代码:
from bs4 import BeautifulSoupfrom urllib import requestimport xlwt#获取数据value=1while value<=98:value0=str(value)url = 'http://www.hs-bianma.com/hs_chapter_'+value0+'.htm'#url='http://www.hs-bianma.com/hs_chapter_01.htm''''此行可以自行更换代码用来汇集数据'''response = request.urlopen(url)html = response.read()html = html.decode('utf-8')bs = BeautifulSoup(html,'lxml')#标题处理title = bs.find_all('th')data_list_title=[]for data in title:data_list_title.append(data.text.strip())#内容处理content = bs.find_all('td')data_list_content=[]for data in content:data_list_content.append(data.text.strip())new_list=[data_list_content[i:i+16] for i in range(0,len(data_list_content),16)]#存入excel表格book=xlwt.Workbook()sheet1=book.add_sheet('sheet1',cell_overwrite_ok=True)#标题存入heads=data_list_title[:]ii=0for head in heads:sheet1.write(0,ii,head)ii+=1#print(head)#内容录入i=1for list in new_list:j=0for data in list:sheet1.write(i,j,data)j+=1i+=1#文件保存book.save('sum'+value0+'.xls')value += 1print(value0+'写入完成!')print('全部完成')
赞 (0)
