python - strip html tags from an xpath @attribute -


i'm trying extract text webpage using lxml , xpath - there 2 bits need

the main text body:

page = requests.get(url) pageopen = lxml.html.fromstring(page)  body_one = pageopen.xpath('/html/body//div/div/div//div/p[@class="body"]/text()') 

which working fine

the second body of text (which reveals after mouse click) have managed using

pageopen.xpath('/html/body//div/div/div//div//span/@data-description') 

but text returned still has html junk in it.

using /text() function on above statement returns empty list.

i've spent hours reading lxml documentation greek me.

how strip html tags xpath @attribute?

but text returned still has html junk in it

if mean string html, use technique understand extracting text html:

descriptionhtml = pageopen.xpath('/html/body//div/div/div//div//span/@data-description') descriptionbody = lxml.html.fromstring(descriptionhtml) descriptiontext = descriptionbody.xpath('text()') 

Comments

Popular posts from this blog

C# random value from dictionary and tuple -

cgi - How do I interpret URLs without extension as files rather than missing directories in nginx? -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -