Skip to content Skip to sidebar Skip to footer

Cant Scrape Webpage With Python Requests Library

I am trying to get some info from a webpage (link below) using Requests in python; however, the HTML data that I see in my browser doesn't seem to exist when I connect via python's

Solution 1:

The element is generated using javascript, you can use selenium to get the source, to get headless browsing combine it with phantomjs:

url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'

from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get(url)
_html = browser.page_source

from bs4 import BeautifulSoup

print(BeautifulSoup(_html).find("span",{"id":"ourPrice"}).text)
$50

Solution 2:

here is the code, how i scrap a table from one site. in that site, they didn't define id or class in table so you no need to put anything. if id or class there means just use html.xpath('//table[@id=id_val]/tr') instead of html.xpath('//table/tr')

from lxml import etree
import urllib
web = urllib.urlopen("http://www.yourpage.com/")
html = etree.HTML(web.read())
tr_nodes = html.xpath('//table/tr')
td_content = [tr.xpath('td') for tr in tr_nodes  if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India'  or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ]
main_list = []
for i in td_content:
    if i[5].text == 'Freshers' or  'Freshers' in i[5].text.split('/') or  '0' in i[5].text.split(' '):
       sub_list = [td.text for td in i]
       sub_list.insert(6,'http://yourpage.com/%s'%i[6].xpath('a')[0].get('href'))
       main_list.append(sub_list)
print 'main_list',main_list

Post a Comment for "Cant Scrape Webpage With Python Requests Library"