python BeautifulSoup库常用操作

闪电

2024 年 05 月 31 日

1353 次浏览

暂无评论

3096字数

Python

1. 创建BeautifulSoup对象

首先，你需要导入BeautifulSoup模块，并创建一个BeautifulSoup对象。通常，你需要指定一个解析器，如'lxml'、'html.parser'或'html5lib'。

from bs4 import BeautifulSoup #必须从bs4模块中导入BeautifulSoup类，而不是直接导入。因为BeautifulSoup类是bs4包的一部分，而不是Python标准库的一部分

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

2. 查找元素

通过标签名查找

pythontitle_tag = soup.title
print(title_tag)  # 输出：<title>The Dormouse's story</title>

通过find()方法查找

方法名：find(name, attrs, recursive, string, **kwargs)

参数：

name：字符串或可调用对象，用于匹配标签的名称。
attrs：字典或可调用对象，用于匹配标签的属性。
recursive：布尔值，是否对文档的子标签也进行查找，默认为 True。
string：字符串或可调用对象，用于匹配标签内的文本内容。
**kwargs：其他关键字参数，用于匹配标签的属性。

示例

下面是一个使用 find 方法的简单示例：

pythonfrom bs4 import BeautifulSoup

# 假设我们有一段HTML内容
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 使用BeautifulSoup解析HTML内容
soup = BeautifulSoup(html_doc, 'html.parser')

# 使用find方法查找第一个<title>标签
title_tag = soup.find('title')
print(title_tag)  # 输出：<title>The Dormouse's story</title>

# 使用find方法查找第一个class为"sister"的<a>标签
first_sister_tag = soup.find('a', {'class': 'sister'})
print(first_sister_tag)  # 输出：第一个<a>标签的内容

# 使用find方法查找第一个包含文本"Lacie"的<a>标签
lacie_tag = soup.find(string="Lacie").parent
print(lacie_tag)  # 输出：包含"Lacie"的<a>标签

通过find_all()方法查找所有匹配的元素

pythonparagraphs = soup.find_all('p')
for p in paragraphs:
    print(p)  # 输出所有<p>标签的内容

使用select()方法通过CSS选择器查找

pythontitle_tag = soup.select('head > title')[0]
print(title_tag)  # 输出：<title>The Dormouse's story</title>

3. 获取元素属性

pythonlink = soup.find('a', {'id': 'link1'})
print(link['href'])  # 输出：http://example.com/elsie

4. 获取元素内容

pythontitle_text = soup.title.string
print(title_text)  # 输出：The Dormouse's story

# 或者使用get_text()方法获取元素及其所有子元素的文本内容
body_text = soup.body.get_text()
print(body_text)  # 输出body标签内的所有文本内容

5. 嵌套选择

pythonhead_title = soup.head.title.string
print(head_title)  # 输出：The Dormouse's story

6. 遍历子节点和兄弟节点

pythonfor child in soup.p.children:
    print(child)  # 遍历<p>标签的所有子节点

for sibling in soup.p.next_siblings:
    print(sibling)  # 遍历<p>标签的下一个兄弟节点

7. 使用class_和id属性查找元素

pythondiv_tag = soup.find('div', class_='info')  # 假设存在class为'info'的div标签
element = soup.find(id='unique-id')  # 假设存在id为'unique-id'的元素

请注意，上述代码仅为示例，并且假设了HTML文档中存在特定的标签和属性。在实际应用中，你需要根据具体的HTML结构来调整选择器和方法的使用。

另外，当使用BeautifulSoup时，请确保你已经正确安装了beautifulsoup4库和相应的解析器（如lxml）。如果没有安装，你可以使用pip进行安装：

bashpip install beautifulsoup4 lxml

同时，请注意解析器的选择可能因你的具体需求和环境而异。在某些情况下，html.parser可能足够使用，而在其他情况下，你可能需要更强大的解析器如lxml或html5lib。

python BeautifulSoup库常用操作

闪电 • 2024 年 05 月 31 日

python BeautifulSoup库常用操作

1. 创建BeautifulSoup对象

2. 查找元素

通过标签名查找

通过find()方法查找

示例

通过find_all()方法查找所有匹配的元素

使用select()方法通过CSS选择器查找

3. 获取元素属性

4. 获取元素内容

5. 嵌套选择

6. 遍历子节点和兄弟节点

7. 使用class_和id属性查找元素

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

蜂鸟H1刷群晖全记录

斐讯K2科学固件

cs

谷歌服务器开放全部端口

python爬虫之selenium使用

python requests.get()常用语句及示例

NestJS中TypeORM的使用

TypeORM分页操作

vue watch（）用法示例

CentOS7下安装Git

python BeautifulSoup库常用操作

1. 创建BeautifulSoup对象

2. 查找元素

通过标签名查找

通过find()方法查找

示例

通过find_all()方法查找所有匹配的元素

使用select()方法通过CSS选择器查找

3. 获取元素属性

4. 获取元素内容

5. 嵌套选择

6. 遍历子节点和兄弟节点

7. 使用class_和id属性查找元素

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

python BeautifulSoup库常用操作

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款