PyQuery用法详解

PyQuery是强大而又灵活的网页解析库,如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法那么,PyQuery就是你绝佳的选择。一、初始化方式,有三种,可以传入字符串,传入url,传入文件。字符串初始化html = '''<div>    <ul>         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)#声明pq对象print(doc('li'))#用css选择器来实现,如果要选id前面加#,如果选class,前面加.,如果选标签名,什么也不加URL初始化也可以直接传入URL,进行URL初始化,程序会自动请求URL,获得html并返回要查找的字符串from pyquery import PyQuery as pqdoc = pq(url='http://www.baidu.com')#程序会自动请求urlprint(doc('head'))#返回head标签文件初始化from pyquery import PyQuery as pqdoc = pq(filename='D://demo.html')#直接传入文件名称及路径,程序会自动寻找并请求print(doc('li'))二、基本css选择器html = '''<div id="container">    <ul class="list">         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)print(doc('#container .list li'))#会查找id为container class为list,标签为li的对象,只是层级关系,没有后者一定是前者的子对象查找元素html = '''<div id="container">    <ul class="list">         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''子元素from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')#拿到itemsprint(type(items))print(items)lis = items.find('li')#利用find方法,查找items里面的li标签,得到的lis也可以继续调用find方法往下查找,层层剥离print(type(lis))print(lis)也可以用.children()查找直接子元素lis = items.children()print(type(lis))print(lis)lis = items.children('.active')print(lis)父元素html = '''<div id="container">    <ul class="list">         <li class="item-0">first item</li>         <li class="item-1"><a href="link2.html">second item</a></li>         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>         <li class="item-1 active"><a href="link4.html">fourth item</a></li>         <li class="item-0"><a href="link5.html">fifth item</a></li>     </ul> </div>'''from pyquery import PyQuery as pqdoc = pq(html)items = doc('.list')container = items.parent()#.parent()查找对象的父元素print(type(container))print(container)祖先节点parents = items.parents()#.parents()祖先节点parent = items.parents('.wrap')#当然也可以传入参数print(parent)兄弟元素html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.list .item-0.active')#空格表示里面,没有空格表示整体print(li.siblings())#.siblings()兄弟元素,即同级别的元素,不包括自己三、遍历html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)lis = doc('li').items()#.items会是一个生成器print(type(lis))for li in lis:    print(li)四、获取信息获取属性html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)a = doc('.item-0.active a')print(a)print(a.attr('href'))#定义a标签的href属性用于指定超链接目标的URL。 如果用户选择了a标签中的内容,那么浏览器会尝试检索并显示href属性指定的URL所表示的文档,或者执行JavaScript表达式、方法和函数的列表。print(a.attr.href)结果:<a href="link3.html"><span class="bold">third item</span></a>link3.htmllink3.html获取文本html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)a = doc('.item-0.active a')print(a)print(a.text)#.text()获取文本信息获取htmlhtml = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)print(li.html())#.html()获取所在html五、DOM操作addClass、removeClasshtml = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.removeClass('active')#删除print(li)li.addClass('active')#增加print(li)attr、csshtml = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('.item-0.active')print(li)li.attr('name', 'link')#增加一个属性print(li)li.css('font-size', '14px')#增加一个cssprint(li)结果:< li class ="item-0 active" > < a href="link3.html" > < span class ="bold" > third item < / span > < / a > < / li >< li class ="item-0 active" name="link" > < a href="link3.html" > < span class ="bold" > third item < / span > < / a > < / li >< li class ="item-0 active" name="link" style="font-size: 14px" > < a href="link3.html" > < span class ="bold" > third item < / span > < / a > < / li >removehtml = '''<div class="wrap">    Hello, World    <p>This is a paragraph.</p> </div>'''from pyquery import PyQuery as pqdoc = pq(html)wrap = doc('.wrap')print(wrap.text())wrap.find('p').remove()#找到p标签然后删除print(wrap.text())结果:Hello, World This is a paragraph.Hello, World其他DOM方法http://pyquery.readthedocs.io/en/latest/api.html六、伪类选择器html = '''<div class="wrap">    <div id="container">        <ul class="list">             <li class="item-0">first item</li>             <li class="item-1"><a href="link2.html">second item</a></li>             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>             <li class="item-1 active"><a href="link4.html">fourth item</a></li>             <li class="item-0"><a href="link5.html">fifth item</a></li>         </ul>     </div> </div>'''from pyquery import PyQuery as pqdoc = pq(html)li = doc('li:first-child')print(li)li = doc('li:last-child')print(li)li = doc('li:nth-child(2)')print(li)li = doc('li:gt(2)')print(li)li = doc('li:nth-child(2n)')print(li)li = doc('li:contains(second)')print(li)结果:< li class ="item-0" > first item < / li >< li class ="item-0" > < a href="link5.html" > fifth item < / a > < / li >< li class ="item-1" > < a href="link2.html" > second item < / a > < / li >< li class ="item-1 active" > < a href="link4.html" > fourth item < / a > < / li >< li class ="item-0" > < a href="link5.html" > fifth item < / a > < / li >< li class ="item-1" > < a href="link2.html" > second item < / a > < / li >< li class ="item-1 active" > < a href="link4.html" > fourth item < / a > < / li >< li class ="item-1" > < a href="link2.html" > second item < / a > < / li >更多CSS选择器可以查看 http://www.w3school.com.cn/css/index.asp官方文档http://pyquery.readthedocs.io/
(0)

相关推荐