Python 爬虫(BeautifulSoup)

网络爬虫

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。”

安装依赖库

采用easy_install或pip可以直接安装

1.安装MySQL-python

pip/easy_install  install MySQL-python

2.安装BeautifulSoup
Beautiful Soup3 的文档,Beautiful Soup 3目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4,移植到BS4

apt-get install Python-bs4  #Debain或ubuntu
pip/easy_install  install beautifulsoup4

安装解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是
lxml .根据操作系统不同,可以选择下列方法来安装lxml:

apt-get install Python-lxml
pip/easy_install install lxml

另一个可供选择的解析器是纯Python实现的 html5lib, html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

apt-get install Python-html5lib
pip/easy_install install html5lib

下表列出了主要的解析器,以及它们的优缺点:

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”]) BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

推荐使用lxml作为解析器,因为效率更高.在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib,因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

采用BeautifulSoup爬取示例

#!/usr/bin/python
# -*- coding: utf-8 -*-

#---------------------------------------------------
#   程序：糗事百科笑料爬虫
#   作者：Jason Hu
#   日期：2016-06-01
#   语言：Python
#   说明：自定义爬取页数，并将爬取的内容保存在Mysql数据库中
#---------------------------------------------------

import urllib2
from datetime import datetime
from bs4 import BeautifulSoup
import MySQLdb as db
from warnings import filterwarnings

#filterwarnings('ignore', category = db.Warning)
filterwarnings("ignore", "Table '.*' already exists")
filterwarnings("ignore", "Can't create database '.*'; database exists")

page = 1  #设置页数
url = 'http://www.qiushibaike.com/hot/page/' + str(page) #url地址
#有些网站反爬虫, 伪装为浏览器抓取
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36'
}

try:
    request = urllib2.Request(url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read()
    soup = BeautifulSoup(content, 'html.parser', from_encoding='UTF-8')
    div_article = soup.find('div', attrs={'class' : 'col1'})
    article_list = div_article.findAll('div', attrs={'class': 'article block untagged mb15'})

    #connection db to insert data
    conn = db.connect(host='localhost', user='root', passwd='', unix_socket='/opt/local/var/run/mysql55/mysqld.sock')
    cur = conn.cursor()
    cur.execute('create database if not exists python default charset utf8 collate utf8_general_ci;')
    conn.select_db('python')
    cur.execute('truncate table joke;')
    sql = '''
            create table if not exists joke(
                `id` int not null auto_increment,
                `author` varchar(30),
                `photo` text,
                `love_nums` int default 0,
                `comment_nums` int default 0,
                `content` text,
                `create_time` datetime,
                `update_time` datetime,
                primary key(`id`)
            )engine=Innodb default charset=utf8;
        '''
    cur.execute(sql)

    for article in article_list:
        author_tag = article.find('div', attrs={'class' : 'author clearfix'})
        author_img = author_tag.img.attrs['src']
        author = author_tag.h2.text

        content_tag = article.select('div.content')
        content = content_tag[0].get_text().strip()

        comment_tag = article.find('div', attrs={'class': 'stats'})
        love_nums = comment_tag.find('span', attrs={'class': 'stats-vote'}).find('i').text
        comment_nums = comment_tag.find('span', attrs={'class': 'stats-comments'}).find('i').text

        create_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        update_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

        joke = [author, author_img, love_nums, comment_nums, content, create_time, update_time]
        cur.execute('insert into joke(`author`, `photo`, `love_nums`, `comment_nums`, `content`, \
                    `create_time`, `update_time`) values(%s, %s, %s, %s, %s, %s, %s)', joke);
    conn.commit()
    cur.close()
    conn.close()
except urllib2.URLError, e:
    print e
except db.Error, e:
    conn.rollback()
    print e

参考资料