beautiful soup是python的一个网页解析库,处理快捷; 支持多种解析器,功能强大。教程细致讲解beautiful soup的深入使用、节点选择器、css选择器、beautiful soup4的方法选择器等重要知识点,是学好爬虫的基础课程。
学习目标
- 掌握关联选择的方法的使用
1. 关联选择
在做选择的时候,有时候不能做到一步就选到想要的节点元素,例如示例中的第二个a节点
,
and
;
需要先选中某一个节点元素,然后以它为基准再选择它的子节点、父节点、兄弟节点等,接下来我们来介绍如何选择这些节点元素。
1. 子节点
-
格式:
soup.tag.contents
-
返回值:列表
-
示例:
html = '''
hello
- foo
- bar
- ]ay
- foo
- bar
-
格式:
soup.tag.children
-
返回值:生成器
-
示例:
html = '''
hello
- foo
- bar
- ]ay
- foo
- bar
0 once upon a time there were three little sisters; and their names were 1 2 , 3 4 and 5 6 ; and they lived at the bottom of a well.
2. 子孙节点
上面我们已经拿到了p节点的全部直系的子节点,如果我们想要获取p节点中的所有子孙节点的话,可以使用descendants
属性。
-
格式:
soup.p.descendants
-
返回值:生成器
-
示例:
html = '''
hello
- foo
- bar
- ]ay
- foo
- bar
0 once upon a time there were three little sisters; and their names were 1 2 elsie 3 elsie 4 , 5 6 lacie 7 and 8 9 tillie 10 ; and they lived at the bottom of a well.
3. 父节点
上面我们都是在选择子节点和子孙节点,接下来我们使用parent属性获取某节点元素的父节点。
-
格式:
soup.tag.parent
-
返回值:节点元素
-
示例:
html = '''
hello
- foo
- bar
- ]ay
- foo
- bar
once upon a time there were three little sisters; and their names were , and ; and they lived at the bottom of a well.
4. 祖先节点
如果想要获取,祖先节点,可以调用parents属性。
-
格式:
soup.tag.parents
-
返回值:生成器
-
示例:
html = '''
hello
- foo
- bar
- ]ay
- foo
- bar
[(0, once upon a time there were three little sisters; and their names were , and ; and they lived at the bottom of a well.
), (1,once upon a time there were three little sisters; and their names were , and ; and they lived at the bottom of a well.
...
), (2,the dormouse's story once upon a time there were three little sisters; and their names were , and ; and they lived at the bottom of a well.
...
), (3,the dormouse's story once upon a time there were three little sisters; and their names were , and ; and they lived at the bottom of a well.
...
)]
5. 兄弟节点
上面说明了子节点和父节点的获取方式,那如果想要获取同级的节点,应该怎么办呢?接下来我们来学习下,使用sibling
获取兄弟节点。
-
获取后面一个节点
-
格式:
soup.tag.next_sibling
-
返回值:节点元素
-
示例:
html = '''
hello
- foo
- bar
- ]ay
- foo
- bar
-
获取后面所有的节点
-
格式:
soup.tag.next_siblings
-
返回值:生成器
-
示例:
# 获取a节点的后面所有节点 print(soup.a.next_siblings) # 获取类型 print(type(soup.a.next_siblings)) # 获取所有内容 print(list(enumerate(soup.a.next_siblings))) # 输出结果
[(0, ',\n'), (1, ), (2, ' and\n'), (3, ), (4, ';\nand they lived at the bottom of a well.')] -
获取前面一个节点
-
格式:
soup.tag.previous_sibling
-
返回值:节点元素
-
示例:
# 获取a节点的前一个节点 print(soup.a.previous_sibling) # 获取类型 print(type(soup.a.previous_sibling)) # 输出结果 once upon a time there were three little sisters; and their names were
-
获取前面的所有节点
-
格式:
soup.tag.previous_siblings
-
返回值:生成器
-
示例:
# 获取a节点的前面所有节点 print(soup.a.previous_siblings) # 获取类型 print(type(soup.a.previous_siblings)) # 获取所有内容 print(list(enumerate(soup.a.previous_siblings))) # 输出结果
[(0, 'once upon a time there were three little sisters; and their names were\n')]
2. 总结
节点选择器 关联选择方法:
- 子节点
- soup.tag.contents
- soup.tag.children
- 子孙节点
- soup.tag.descendants
- 父节点
- soup.tag.parent
- 祖先节点
- soup.tag.parents
- 兄弟节点
- soup.tag.next_sibling
- soup.tag.next_siblings
- soup.tag…previous_sibling
- soup.tag…previous_siblings