目录
1. XPath 的使用
1.1 XPath 常用规则
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点 |
/ | 从当前节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
… | 选取当前节点的父节点 |
@ | 选取属性 |
//title[@lang="eng"]
1.2 安装
pip3 install lxml
1.3 实例引入
from lxml import etree
text = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
-
-
<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div>
-
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = etree.tostring(html) print(result.decode('utf-8'))
-
1.4 所有节点
- 获取所有节点(//*)
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//*")
print(result)
- 获取所有 li 节点(//li)
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li")
print(result)
1.5 子节点
- 获取 li 节点的直接子节点 a
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li/a")
print(result)
- 获取 ul 节点的子孙节点 a
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//ul//a")
print(result)
1.6 父节点
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//a[@href="link4.html"]/../@class")
print(result)
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//a[@href="link4.html"]/parent::*/@class")
print(result)
1.7 属性匹配
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li[@class="item-0"]")
print(result)
1.8 文本获取
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li[@class="item-0"]/text()")
print(result)
-
获取特定子孙节点下的所有文本:/特定节点/
1.9 属性获取
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li/a/@href")
print(result)
1.10 属性多值匹配
from lxml import etree
text = """
<li class="li li-first"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath("//li[@class="li"]/a/text()")
print(result)
from lxml import etree
text = """
<li class="li li-first"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath("//li[contains(@class, "li")]/a/text()")
print(result)
1.11 多属性匹配
from lxml import etree
text = """
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath(
"//li[contains(@class, "li") and @name="item"]/a/text()")
print(result)
运算符 | 描述 |
---|---|
or | 或 |
and | 与 |
mod | 求余 |
| | 求两个节点集 |
+ | 加 |
– | 减 |
* | 乘 |
div | 除 |
= | 等于 |
!= | 不等于 |
< | 小于 |
<= | 小于等于 |
> | 大于 |
>= | 大于等于 |
1.12 按序选择
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
# 选取第一个节点
result = html.xpath("//li[1]/a/text()")
print(result)
# 选取最后一个节点
result = html.xpath("//li[last()]/a/text()")
print(result)
# 获取限定位置的节点
result = html.xpath("//li[position() < 3]/a/text()")
print(result)
# 获取倒数第三个节点
result = html.xpath("//li[last() - 2]/a/text()")
print(result)
1.13 节点轴选择
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
# ancestor轴: 获取所有祖先节点
result = html.xpath("//li[1]/ancestor::*")
print(result)
result = html.xpath("//li[1]/ancestor::div")
print(result)
# attribute轴: 获取所有属性值
result = html.xpath("//li[1]/attribute::*")
print(result)
# child轴: 获取所有直接子节点
result = html.xpath("//li[1]/child::a[@href="link1.html"]")
print(result)
# descendant轴: 获取所有子孙节点
result = html.xpath("//li[1]/descendant::span")
print(result)
# following轴: 获取当前节点之后的所有节点
result = html.xpath("//li[1]/following::*[2]")
print(result)
# following-sibling轴: 获取当前节点之后的所有同级节点
result = html.xpath("//li[1]/following-sibling::*")
print(result)
2. Beautiful Soup 的使用
2.1 解析器
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python 标准库 | BeautifulSoup(markup, “html.parser”) | Python 的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 或 Python 3.2.2 前的版本中文容错能力差 |
LXML HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装 C 语言库 |
LXML XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持 XML 的解析器 | 需要安装 C 语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | 提供最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
2.2 安装
pip3 install beautifulsoup4
pip3 install lxml
2.3 基本使用
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>Test</title>
</head>
<body>
<p class="title" name="name"><b>Test</b></p>
<p class="story">A B C
<a href="http://example.com/A" class="abc" id="link1"><!--A--></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
<p class="story">···</p>
"""
soup = BeautifulSoup(html, "lxml")
# 以标准的缩进格式输出
print(soup.prettify())
print(soup.title.string)
2.4 节点选择器
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Test</title></head>
<body>
<p class="title" name="name"><b>Test</b></p>
<p class="story">A B C
<a href="http://example.com/A" class="abc" id="link1"><!--A--></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
<p class="story">···</p>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)
2.5 提取信息
获取节点名称(name)
print(soup.title.name)
获取属性(attrs)
print(soup.p.attrs)
print(soup.p.attrs["name"])
获取内容(string)
print(soup.p.string)
嵌套选择
print(soup.head.title)
print(soup.head.title.string)
2.6 关联选择
子节点和子孙节点
-
直接子节点
-
from bs4 import BeautifulSoup html = """ <html> <head><title>Test</title></head> <body> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") print(soup.p.contents)
-
from bs4 import BeautifulSoup soup = BeautifulSoup(html, "lxml") print(soup.p.children) for i, child in enumerate(soup.p.children): print(i, child)
-
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
print(i, child)
父节点和祖先节点
- 父节点(parent)
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.a.parent)
- 祖先节点(parents)
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
"""
soup = BeautifulSoup(html, "lxml")
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))
兄弟节点
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
Test
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>
ABC
<a href="http://example.com/B" class="abc" id="link2">B</a>
abc
<a href="http://example.com/C" class="abc" id="link3">C </a>
123
</p>
"""
soup = BeautifulSoup(html, "lxml")
print("next sibling", soup.a.next_sibling)
print("prev sibling", soup.a.previous_sibling)
print("next siblings", list(enumerate(soup.a.next_siblings)))
print("prev siblings", list(enumerate(soup.a.previous_siblings)))
提取信息
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
Test
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a><a href="http://example.com/B" class="abc" id="link2">B</a><a href="http://example.com/C" class="abc" id="link3">C </a>
</p>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print(soup.a.parents)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs["class"])
2.7 方法选择器
find_all
findall(name, attrs, recursive, text, **kwargs)
name
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>
Hello
</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find_all(name="ul"))
print(type(soup.find_all(name="ul")[0]))
for ul in soup.find_all(name="ul"):
print(ul.find_all(name="li"))
for li in ul.find_all(name="li"):
print(li.string)
attrs
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>
Hello
</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find_all(attrs={"id": "list-1"}))
print(soup.find_all(attrs={"class": "element"}))
# 等效于
print(soup.find_all(id="list-1"))
print(soup.find_all(class_="element"))
text
import re
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-body">
<a>link1</a>
<a>link2</a>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find_all(text=re.compile("link")))
find
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>
Hello
</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find(name="ul"))
print(type(soup.find(name="ul")))
print(soup.find(class_="list"))
find_parents
- 获取所有的祖先节点
find_parent
- 获取直接父节点
find_next_siblings
- 获取后面的所有兄弟节点
find_next_sibling
- 获取后面的第一个兄弟节点
find_previous_siblings
- 获取前面的所有兄弟节点
find_previous_sibling
find_all_next
- 获取节点后面所有符合条件的节点
find_next
find_all_previous
find_previous
2.8 CSS 选择器
实例
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.select(".panel .panel-body .list"))
print(soup.select("ul li"))
print(soup.select("#list-2 .element"))
print(type(soup.select("ul")[0]))
嵌套选择
soup = BeautifulSoup(html, "lxml")
for ul in soup.select("ul"):
print(ul.select("li"))
获取属性
- 方括号获取
- .attrs[]
soup = BeautifulSoup(html, "lxml")
for ul in soup.select("ul"):
print(ul["id"])
# 等效于
print(ul.attrs["id"])
获取文本
- .get_text()
- .string
soup = BeautifulSoup(html, "lxml")
for li in soup.select("li"):
print(f"text: {li.get_text()}")
# 等效于
print(f"string: {li.string}")
3. pyquery 的使用
3.1 安装
pip3 install pyquery
3.2 初始化
字符串初始化
from pyquery import PyQuery
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc = PyQuery(html)
print(doc("li"))
URL初始化
from pyquery import PyQuery
doc = PyQuery(url="https://www.bilibili.com/")
# 等效于
doc = PyQuery(requests.get(url).text)
print(doc("title"))
文件初始化
from pyquery import PyQuery
doc = PyQuery(filename="test.html")
print(doc("li"))
3.3 基本CSS选择器
from pyquery import PyQuery
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc = PyQuery(html)
print(doc("#container .list li"))
print(type(doc("#container .list li")))
for item in doc("#container .list li").items():
print(item.text())
3.4 查找节点
直接子节点
from pyquery import PyQuery
doc = PyQuery(html)
items = doc(".list")
lis = items.children()
print(type(lis))
print(lis)
子孙节点
from pyquery import PyQuery
doc = PyQuery(html)
items = doc(".list")
print(type(items))
print(items)
lis = items.find("li")
print(type(lis))
print(lis)
直接父节点
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
items = doc(".list")
container = items.parent()
print(type(container))
print(container)
祖先节点
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
items = doc(".list")
parents = items.parents()
print(type(parents))
print(parents)
兄弟节点(有问题)
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
lis = doc(".list .item-0.active")
for item in lis.siblings().items():
print(item.text())
3.5 遍历节点
获取单个节点
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
print(str(li))
获取多个节点
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
lis = doc("li").items()
print(type(lis))
for li in lis:
print(li, type(li))
获取信息
- 信息
- 属性
- 文本
获取属性(attr())
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
a = doc(".item-0.active a")
print(a, type(a))
print(a.attr("href"))
# 等效于
print(a.attr.href)
a = doc("a")
print(a, type(a))
print(a.attr("href"))
for item in a.items():
print(item.attr("href"))
获取文本
获取内部文本(.text())
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
a = doc(".item-0.active a")
print(a)
print(a.text())
获取内部的HTML文本(.html())
- 非贪婪
- 只会输出第一个匹配的文本
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
print(li.html())
.html()的相关误区
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
a = doc(".item-0.active a")
# 输出内容相同,但内容并不相等
print(li.html())
print(a)
# 类型为str
print(type(li.html()))
# 类型为pyquery.pyquery.PyQuery
print(type(a))
3.6 节点操作
addClass和removeClass
- 动态改变节点的Class属性
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
li.remove_class("active")
print(li)
li.add_class("active")
print(li)
attr、text和html
from pyquery import PyQuery
html = """
<ul class="warp">
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
</ul>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
li.attr("name", "link")
print(li)
li.text("changed item")
print(li)
li.html("<span>changed item</span>")
print(li)
remove
from pyquery import PyQuery
html = """
<ul class="warp">
ABC
<p>abc<p>
</ul>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
warp = doc(".warp")
warp.find("p").remove()
print(warp.text())
3.7 伪类选择器
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
# 第一个li节点
li = doc("li:first-child")
print(li)
# 最后一个li节点
li = doc("li:last-child")
print(li)
# 第二个li节点
li = doc("li:nth-child(2)")
print(li)
# 下标大于2的节点
li = doc("li:gt(2)")
print(li)
# 以2为倍数位置的节点
li = doc("li:nth-child(2n)")
print(li)
# 包含second文本的节点
li = doc("li:contains(second)")
print(li)
4. parsel的使用
4.1 安装
pip3 install parsel
4.2 初始化
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
items1 = selector.css(".item-0")
print(len(items1), type(items1), items1)
items2 = selector.xpath("//li[contains(@class, 'item-0')]")
print(len(items2), type(items2), items2)
4.3 提取文本
提取单个文本(get())
- 只能获取第一个匹配的对象
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
items = selector.css(".item-0")
print(f"type of items: {type(items)}")
for item in items:
print(f"type of item: {type(item)}")
text = item.xpath(".//text()").get()
print(text)
提取多个文本(getall())
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
result = selector.xpath("//li[contains(@class, "item-0")]//text()").getall()
print(result)
result = selector.css(".item-0 *::text").getall()
print(result)
4.4 提取属性
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
result = selector.css(".item-0.active a::attr(href)").get()
print(result)
result = selector.xpath("//li[contains(@class, "item-0") and contains(@class, "active")]/a/@href").get()
print(result)
4.5 正则提取
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
result = selector.css(".item-0").re("link(.*?)"")
print(result)
result = selector.css(".item-0 *::text").re(".*item")
print(result)
# 获取第一个符合
result = selector.css(".item-0").re_first(">(.*?item)")
print(result)
原文地址:https://blog.csdn.net/BlackOrnate/article/details/134735339
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若转载,请注明出处:http://www.7code.cn/show_44136.html
如若内容造成侵权/违法违规/事实不符,请联系代码007邮箱:suwngjj01@126.com进行投诉反馈,一经查实,立即删除!
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。