Python | Beautiful Soup模块
一、简介
Beautiful Soup是一个用来解析HTML和XML的python库,很适合用于爬取网页,提供了很人性化的parse tree,比起传统的SGMLParser给力多了
二、安装
安装Beautiful Soup
pip install beautifulsoup4
三、例子
这里使用下列html源码来让Beautiful Soup解析
<html><head> <title>The Dormouse's story</title></head><body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p></html>
格式化源码
from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc)print (soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link2"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
测试下自带的一些数据结构
soup.title# <title>The Dormouse's story</title>soup.title.name# u'title'soup.title.string# u'The Dormouse's story'soup.title.parent.name# u'head'soup.p# <p class="title"><b>The Dormouse's story</b></p>soup.p['class']# u'title'soup.a# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>soup.find_all('a')# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.find(id="link3")# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
找出所有超链接出来
for link in soup.find_all('a'): print (link.get('href'))#http://example.com/elsie#http://example.com/lacie#http://example.com/tillie
分离出所有的文本内容
print (soup.get_text())#The Dormouse's story#Once upon a time there were three little sisters#Elsie,#Lacie and#Tillie;#and they lived at the bottom of a well.#...
四、实战
任务就是提取每部电影的标题跟种子的链接
#!/usr/bin/python# _*_ coding=utf-8 _*_# Filename: spider_btscg.pyimport reimport urllib2from bs4 import BeautifulSouphost = "http://www.btscg.com"pat_torrent = re.compile(r'(/uploads/.*?\.torrent)')pat_track = re.compile(r'(tracker.*?\.html)')html_source = urllib2.urlopen(host).read().decode('gbk', 'ignore').encode('utf-8', 'ignore')soup = BeautifulSoup(html_source)def find_torrent_url(link): for url in link.find_all('a'): if pat_track.search(str(url['href'])): print "---------------------------------------------------------" print url.string if pat_torrent.search(str(url['href'])): print "\033[36m %s \033[0m" % (host + url['href'])if __name__ == '__main__': link = soup.find('div', {"id":"portal_block_8_content"}) find_torrent_url(link)
备注
原理很简单,使用urllib2打开需要解析的网页(注意编码问题),然后使用BeautifulSoup格式化HTML源码,先查找电影栏目的div id,然后再找里面的所有超链接,最后使用re模块匹配torrent种子链接