9月 6 2013

Python | Beautiful Soup模块

一、简介

Beautiful Soup是一个用来解析HTML和XML的python库，很适合用于爬取网页，提供了很人性化的parse tree，比起传统的SGMLParser给力多了

二、安装

安装Beautiful Soup

pip install beautifulsoup4

三、例子

这里使用下列html源码来让Beautiful Soup解析

<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
</html>

格式化源码

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print (soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

测试下自带的一些数据结构

soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

找出所有超链接出来

for link in soup.find_all('a'):
    print (link.get('href'))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie

分离出所有的文本内容

print (soup.get_text())
#The Dormouse's story
#Once upon a time there were three little sisters
#Elsie,
#Lacie and
#Tillie;
#and they lived at the bottom of a well.
#...

四、实战

任务就是提取每部电影的标题跟种子的链接

#!/usr/bin/python
# _*_ coding=utf-8 _*_
# Filename: spider_btscg.py
import re
import urllib2
from bs4 import BeautifulSoup
host = "http://www.btscg.com"
pat_torrent = re.compile(r'(/uploads/.*?\.torrent)')
pat_track = re.compile(r'(tracker.*?\.html)')
html_source = urllib2.urlopen(host).read().decode('gbk', 'ignore').encode('utf-8', 'ignore')
soup = BeautifulSoup(html_source)
def find_torrent_url(link):
    for url in link.find_all('a'):
        if pat_track.search(str(url['href'])):
            print "---------------------------------------------------------"
            print url.string
        if pat_torrent.search(str(url['href'])):
            print "\033[36m %s \033[0m" % (host + url['href'])
if __name__ == '__main__':
    link = soup.find('div', {"id":"portal_block_8_content"})
    find_torrent_url(link)

备注
原理很简单，使用urllib2打开需要解析的网页（注意编码问题），然后使用BeautifulSoup格式化HTML源码，先查找电影栏目的div id，然后再找里面的所有超链接，最后使用re模块匹配torrent种子链接