python - strip html tags from an xpath @attribute -
i'm trying extract text webpage using lxml , xpath - there 2 bits need
the main text body:
page = requests.get(url) pageopen = lxml.html.fromstring(page) body_one = pageopen.xpath('/html/body//div/div/div//div/p[@class="body"]/text()')
which working fine
the second body of text (which reveals after mouse click) have managed using
pageopen.xpath('/html/body//div/div/div//div//span/@data-description')
but text returned still has html junk in it.
using /text() function on above statement returns empty list.
i've spent hours reading lxml documentation greek me.
how strip html tags xpath @attribute?
but text returned still has html junk in it
if mean string html, use technique understand extracting text html:
descriptionhtml = pageopen.xpath('/html/body//div/div/div//div//span/@data-description') descriptionbody = lxml.html.fromstring(descriptionhtml) descriptiontext = descriptionbody.xpath('text()')
Comments
Post a Comment