欧美大尺度边吃奶边做,4hu四虎免费影院www,第一色影院

主頁(yè) > 知識(shí)庫(kù) > python 網(wǎng)頁(yè)解析器掌握第三方 lxml 擴(kuò)展庫(kù)與 xpath 的使用方法

python 網(wǎng)頁(yè)解析器掌握第三方 lxml 擴(kuò)展庫(kù)與 xpath 的使用方法

今天說(shuō)的則是使用另外一種擴(kuò)展庫(kù) lxml 來(lái)對(duì)網(wǎng)頁(yè)完成解析。同樣的，lxml 庫(kù)能完成對(duì) html、xml 格式的文件解析，并且能夠用來(lái)解析大型的文檔、解析速度也是相對(duì)比較快的。

要掌握 lxml 的使用，就需要掌握掌握 xpath 的使用方法，因?yàn)?lxml 擴(kuò)展庫(kù)就是基于 xpath 的，所以這一章的重點(diǎn)主要還是對(duì) xpath 語(yǔ)法使用的說(shuō)明。

1、導(dǎo)入 lxml 擴(kuò)展庫(kù)、并創(chuàng)建對(duì)象

# -*- coding: UTF-8 -*-

# 從 lxml 導(dǎo)入 etree
from lxml import etree

# 首先獲取到網(wǎng)頁(yè)下載器已經(jīng)下載到的網(wǎng)頁(yè)源代碼
# 這里直接取官方的案例
html_doc = """
html>head>title>The Dormouse's story/title>/head>
body>
p class="title">b>The Dormouse's story/b>/p>

p class="story">Once upon a time there were three little sisters; and their names were
a  rel="external nofollow" class="sister" id="link1">Elsie/a>,
a  rel="external nofollow" class="sister" id="link2">Lacie/a> and
a  rel="external nofollow" class="sister" id="link3">Tillie/a>;
and they lived at the bottom of a well./p>

p class="story">.../p>
"""

# 初始化網(wǎng)頁(yè)下載器的 html_doc 字符串,返回一個(gè) lxml 的對(duì)象
html = etree.HTML(html_doc)

2、使用 xpath 語(yǔ)法提取網(wǎng)頁(yè)元素

按照節(jié)點(diǎn)的方式獲取元素

# xpath() 使用標(biāo)簽節(jié)點(diǎn)的方式獲取元素
print html.xpath('/html/body/p')
# [Element p at 0x2ebc908>, Element p at 0x2ebc8c8>, Element p at 0x2eb9a48>]
print html.xpath('/html')
# [Element html at 0x34bc948>]
# 在當(dāng)前節(jié)點(diǎn)的子孫節(jié)點(diǎn)中查找 a 節(jié)點(diǎn)
print html.xpath('//a')
# 在當(dāng)前節(jié)點(diǎn)的子節(jié)點(diǎn)中查找 html 節(jié)點(diǎn)
print html.xpath('/html')

按照篩選的方式獲取元素

'''
根據(jù)單一屬性獲取元素
'''
# 獲取子孫節(jié)點(diǎn)中,屬性 class=bro 的 a 標(biāo)簽
print html.xpath('//a[@class="bro"]')

# 獲取子孫節(jié)點(diǎn)中,屬性 id=link3 的 a 標(biāo)簽
print html.xpath('//a[@id="link3"]')

'''
根據(jù)多個(gè)屬性獲取元素
'''
# 獲取class屬性等于sister，并且id等于link3的a標(biāo)簽
print html.xpath('//a[contains(@class,"sister") and contains(@id,"link1")]')

# 獲取class屬性等于bro，或者id等于link1的a標(biāo)簽
print html.xpath('//a[contains(@class,"bro") or contains(@id,"link1")]')

# 使用 last() 函數(shù)，獲取子孫代的a標(biāo)簽的最后一個(gè)a標(biāo)簽
print html.xpath('//a[last()]')
# 使用 1 函數(shù)，獲取子孫代的a標(biāo)簽的第一個(gè)a標(biāo)簽
print html.xpath('//a[1]')
# 標(biāo)簽篩選，position()獲取子孫代的a標(biāo)簽的前兩個(gè)a標(biāo)簽
print html.xpath('//a[position()  3]')

'''
使用計(jì)算的方式，獲取多個(gè)元素
'''
# 標(biāo)簽篩選，position()獲取子孫代的a標(biāo)簽的第一個(gè)與第三個(gè)標(biāo)簽
# 可以使用的計(jì)算表達(dá)式：>、、=、>=、=、+、-、and、or
print html.xpath('//a[position() = 1 or position() = 3]')

獲取元素的屬性與文本

'''
使用@獲取屬性值，使用text() 獲取標(biāo)簽文本
'''
# 獲取屬性值
print html.xpath('//a[position() = 1]/@class')
# ['sister']
# 獲取標(biāo)簽的文本值
print html.xpath('//a[position() = 1]/text()')

到此這篇關(guān)于python 網(wǎng)頁(yè)解析器掌握第三方 lxml 擴(kuò)展庫(kù)與 xpath 的使用方法的文章就介紹到這了,更多相關(guān)python lxml 擴(kuò)展庫(kù)與 xpath內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

python網(wǎng)絡(luò)爬蟲精解之pyquery的使用說(shuō)明
python爬蟲之Appium爬取手機(jī)App數(shù)據(jù)及模擬用戶手勢(shì)
Python 給我一個(gè)鏈接西瓜視頻隨便下載爬蟲
python網(wǎng)絡(luò)爬蟲精解之XPath的使用說(shuō)明

標(biāo)簽：股票駐馬店畢節(jié) 江蘇湖州衡水呼和浩特中山

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《python 網(wǎng)頁(yè)解析器掌握第三方 lxml 擴(kuò)展庫(kù)與 xpath 的使用方法》，本文關(guān)鍵詞 python,網(wǎng)頁(yè),解析,器,掌握,；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問(wèn)題，煩請(qǐng)?zhí)峁┫嚓P(guān)信息告之我們，我們將及時(shí)溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無(wú)關(guān)。