python爬虫--lxml,xpath

爬取豆瓣正在热映的电影:豆瓣电影

import requests
from lxml import etree

header = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/76.0.3809.132 Safari/537.36 '
}
url = 'https://movie.douban.com/cinema/nowplaying/foshan/'
req = requests.get(url, headers=header)
txt = req.text
html = etree.HTML(txt)
ul = html.xpath('//ul[@class="lists"]')[0]
#或者:
# ul = html.xpath('//div[@id="nowplaying"]//ul[@class="lists"]')[0]
#获取电影title存在lis中
lis = ul.xpath('./li/@data-title')
for li in lis:
    print(li)

结果:

需要注意的几点:

第一:xpath获取的是一个list,因此需要在后面加上要获取的是list里面的第几个元素,不然就会出现“无法序列化”的情况

报错:TypeError: Type 'list' cannot be serialized.

import requests
from lxml import etree

header = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/76.0.3809.132 Safari/537.36 '
}
url = 'https://movie.douban.com/cinema/nowplaying/foshan/'
req = requests.get(url, headers=header)
txt = req.text
html = etree.HTML(txt)
#错误:
ul = html.xpath('//ul[@class="lists"]')
#正确:
#ul = html.xpath('//ul[@class="lists"]')[0]
print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))

第二:在某个标签下,再次执行xpath,获取这个标签的子孙标签,应该在 // 前面加一个点,代表在当前元素下查找,否则,会跳出当前元素,直接在整个页面查找所有符合的标签

import requests
from lxml import etree

header = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/76.0.3809.132 Safari/537.36 '
}
url = 'https://movie.douban.com/cinema/nowplaying/foshan/'
req = requests.get(url, headers=header)
txt = req.text
html = etree.HTML(txt)
ul = html.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('//li/@data-title')#获取的是当前页面所有符合的title,包括即将上映的电影
#lis = ul.xpath('.//li/@data-title')#获取的是在满足ul下的所有title,不包括即将上映的电影
for li in lis:
    print(li)

结果:

总结:还在学习中,会不断更新……