beautifulsoup - Extracting data from a web page using BS4 in Python -

- February 15, 2015

i trying extract data site: http://www.afl.com.au/fixture

in way such have dictionary having date key , "preview" links values in list, like

dict = {saturday, june 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}

please me it, have used code below:

def extractdata():     ldateinfomatchcase = false #     ldateinfomatchcase = []     global gdict     row in table_for_players.findall("tr"):         ldaterowindex in row.findall("th", {"colspan" : "4"}):             ldatelist.append(ldaterowindex.text)      print ldatelist     index in ldatelist:         #print index         lpreviewlinklist = []         row in table_for_players.findall("tr"):             ldaterowindex in row.findall("th", {"colspan" : "4"}):                  if ldaterowindex.text == index:                     ldateinfomatchcase = true                 else:                     ldateinfomatchcase = false               if ldateinfomatchcase == true:                      linforowindex in row.findall("td", {"class": "info"}):                          link in linforowindex.findall("a", {"class" : "preview"}):                              lpreviewlinklist.append("http://www.afl.com.au/" + link.get('href'))         print lpreviewlinklist         gdict[index] = lpreviewlinklist

my main aim player names playing match in home , in away team according date in data structure.

i prefer using css selectors. select first table, rows in tbody ease of processing; rows 'grouped' tr th rows. there can select next siblings don't contain th headers , scan these preview links:

previews = {}  table = soup.select('table.fixture')[0] group_header in table.select('tbody tr th'):     date = group_header.string     next_sibling in group_header.parent.find_next_siblings('tr'):         if next_sibling.th:             # found next group, end scan             break         preview in next_sibling.select('a.preview'):             previews.setdefault(date, []).append(                 "http://www.afl.com.au" + preview.get('href'))

this builds dictionary of lists; current version of page produces:

{u'monday, june 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],  u'sunday, june 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',                       'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',                       'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}

Search This Blog

Backgorund

beautifulsoup - Extracting data from a web page using BS4 in Python -

Comments

Post a Comment

Popular posts from this blog

.htaccess - htaccess convert request to clean url and add slash at the end of the url -

c++ - Visual Leak Detector detects leak on new blank MFC project -

php - facebook and github login HWIOAuthBundle and FOSUserBundle in Symfony2.1 -