Cant Scrape Webpage With Python Requests Library
I am trying to get some info from a webpage (link below) using Requests in python; however, the HTML data that I see in my browser doesn't seem to exist when I connect via python's
Solution 1:
The element is generated using javascript, you can use selenium to get the source, to get headless browsing combine it with phantomjs:
url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get(url)
_html = browser.page_source
from bs4 import BeautifulSoup
print(BeautifulSoup(_html).find("span",{"id":"ourPrice"}).text)
$50
Solution 2:
here is the code, how i scrap a table from one site. in that site, they didn't define id or class in table so you no need to put anything. if id or class there means just use html.xpath('//table[@id=id_val]/tr') instead of html.xpath('//table/tr')
from lxml import etree
import urllib
web = urllib.urlopen("http://www.yourpage.com/")
html = etree.HTML(web.read())
tr_nodes = html.xpath('//table/tr')
td_content = [tr.xpath('td') for tr in tr_nodes if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India' or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ]
main_list = []
for i in td_content:
if i[5].text == 'Freshers' or 'Freshers' in i[5].text.split('/') or '0' in i[5].text.split(' '):
sub_list = [td.text for td in i]
sub_list.insert(6,'http://yourpage.com/%s'%i[6].xpath('a')[0].get('href'))
main_list.append(sub_list)
print 'main_list',main_list
Post a Comment for "Cant Scrape Webpage With Python Requests Library"