BeautifulSoup 基本使用

在这里主要记录自己项目中涉及到的用法，供以后参考。

获取bs实例

要使用BeautifulSoup，首先就要构建一个BeautifulSoup类实例

第一个参数：要解析的html文本，可以是str，也可以是打开的文件句柄
features: 指定html解析器 (关于解析器之间的区别，请戳官方文档， “安装解析器”一节)

from bs4 import BeautifulSoup as Soup

soup = Soup(open(sample.html), features="lxml")
# soup = Soup('<h1>my title</h1>', features="lxml")
soup.find_all('table') # 通过实例来使用一系列强大功能

查找所需的Tag

常用方法如下

1
2
3

soup.table # 等同于 soup.find('table')
soup.find('table') # 第一个出现的table, 类型为<class 'bs4.element.Tag'>
soup.find_all('table') # 所有table

获取Tag中的文本

比如说我的html文本是<b>this is sample</b>，我想要获得的是'this is sample'

可以通过soup.text或者soup.string属性来获得

关于text和string的区别可参考Python BeautifulSoup 中.text与.string的区别

>>> s = Soup('<b>this is sample</b>')
>>> type(s.b)
<class 'bs4.element.Tag'>
>>> s.b.text
'this is sample'

获取包含Tag的文本

那如果我现在想获得的就是<b>this is sample</b>呢？

如果是在iPython环境下，可以直接print()相应的Tag对象

1
2
3

>>> s = Soup('<b>this is sample</b>')
>>> print(s.b)
<b>this is sample</b>

如果是在脚本里面，则需要用到Python对象的内置方法__repr__()

在print一个对象的时候实际上调用了其__repr__()方法

1
2
3

>>> s = Soup('<b>this is sample</b>')
>>> s.b.__repr__()
'<b>this is sample</b>'

Reference

BeautifulSoup 4.2.0文档