Python | Beautiful Soup模块

一、简介

Beautiful Soup是一个用来解析HTML和XML的python库,很适合用于爬取网页,提供了很人性化的parse tree,比起传统的SGMLParser给力多了

二、安装

安装Beautiful Soup

pip install beautifulsoup4

三、例子

这里使用下列html源码来让Beautiful Soup解析

<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</html>

格式化源码

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print (soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>

测试下自带的一些数据结构

soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

找出所有超链接出来

for link in soup.find_all('a'):
print (link.get('href'))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie

分离出所有的文本内容

print (soup.get_text())
#The Dormouse's story
#Once upon a time there were three little sisters
#Elsie,
#Lacie and
#Tillie;
#and they lived at the bottom of a well.
#...

四、实战

任务就是提取每部电影的标题跟种子的链接
python-bs4-1.png

#!/usr/bin/python
# _*_ coding=utf-8 _*_
# Filename: spider_btscg.py
import re
import urllib2
from bs4 import BeautifulSoup
host = "http://www.btscg.com"
pat_torrent = re.compile(r'(/uploads/.*?\.torrent)')
pat_track = re.compile(r'(tracker.*?\.html)')
html_source = urllib2.urlopen(host).read().decode('gbk', 'ignore').encode('utf-8', 'ignore')
soup = BeautifulSoup(html_source)
def find_torrent_url(link):
for url in link.find_all('a'):
if pat_track.search(str(url['href'])):
print "---------------------------------------------------------"
print url.string
if pat_torrent.search(str(url['href'])):
print "\033[36m %s \033[0m" % (host + url['href'])
if __name__ == '__main__':
link = soup.find('div', {"id":"portal_block_8_content"})
find_torrent_url(link)

备注
原理很简单,使用urllib2打开需要解析的网页(注意编码问题),然后使用BeautifulSoup格式化HTML源码,先查找电影栏目的div id,然后再找里面的所有超链接,最后使用re模块匹配torrent种子链接