python - strip html tags from an xpath @attribute -

- January 15, 2010

i'm trying extract text webpage using lxml , xpath - there 2 bits need

the main text body:

page = requests.get(url) pageopen = lxml.html.fromstring(page)  body_one = pageopen.xpath('/html/body//div/div/div//div/p[@class="body"]/text()')

which working fine

the second body of text (which reveals after mouse click) have managed using

pageopen.xpath('/html/body//div/div/div//div//span/@data-description')

but text returned still has html junk in it.

using /text() function on above statement returns empty list.

i've spent hours reading lxml documentation greek me.

how strip html tags xpath @attribute?

but text returned still has html junk in it

if mean string html, use technique understand extracting text html:

descriptionhtml = pageopen.xpath('/html/body//div/div/div//div//span/@data-description') descriptionbody = lxml.html.fromstring(descriptionhtml) descriptiontext = descriptionbody.xpath('text()')

Search This Blog

Backgorund

python - strip html tags from an xpath @attribute -

Comments

Post a Comment

Popular posts from this blog

c++ - Visual Leak Detector detects leak on new blank MFC project -

cgi - How do I interpret URLs without extension as files rather than missing directories in nginx? -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -