need to get the exact redirect link - python-2.7

I need to get the final url of the link. But this code is only giving me a link to its store
It is returning me the link: http://www.amazon.in/electronics/b?ie=UTF8&node=976419031
But what I need is: http://www.amazon.in/Samsung-G-550FY-On5-Pro-Gold/dp/B01FM7GGFI?tag=prdeskdetailmob-21&ascsubtag=desktop-mobile-15920-blank-27092016
import mechanize
br = mechanize.Browser()
br.open("https://priceraja.com/r/go2store.php?mpc=mobile--1178916--15920--deskdetail")
br.select_form(nr=0)
br.submit()
x=br.geturl()
print x

from selenium import webdriver
chrome_path = r"C:\Users\Bhanwar\Desktop\price raja mobile\working\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
link = "https://priceraja.com/r/go2store.php?mpc=mobile--1185105--15236--deskdetail"
driver.get(link)
while(link == driver.current_url):
time.sleep(3)
redirected_url = driver.current_url
print redirected_url

Related

web scraping with BeautifulSoup and python

I'm triyng to print out all the Ip address from this website https://hidemy.name/es/proxy-list/#list
but nothing happens
code in python 2.7:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages): #go throw max pages of the website starting from 1
page = 0
value = 0
print('proxies')
while page <= 18:
value += 64
url = 'https://hidemy.name/es/proxy-list/?start=' + str(value) + '#list' #add page number to link
source_code = requests.get(url) #get website html code
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('td',{'class': 'tdl'}): #get the link of this class
proxy = link.string #get the string of the link
print(proxy)
page += 1
trade_spider(1)
You don't seeing any output because there is no matching elements in your soup.
I've tried to dump all the variables to output stream and figured out that this website is blocking crawlers. Try to print plain_text variable. It'll probably only contain warning message like:
It seems you are bot. If so, please use separate API interface. It
cheap and easy to use.

Regular expression to find precise pdf links in a webpage

Given url='http://normanpd.normanok.gov/content/daily-activity', the website has three types of arrests, incidents, and case summaries. I was asked to use regular expressions to discover the URL strings of all the Incidents pdf documents in Python.
The pdfs are to be downloaded in a defined location.
I have gone through the link and found that Incident pdf files URLs are in the form of:
normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
I have written code :
import urllib.request
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)
But in the URLs list, the values are empty.
I am a beginner in python3 and regex commands. Can anyone help me?
This is not an advisable method. Instead, use an HTML parsing library like bs4 (BeautifulSoup) to find the links and then only regex to filter the results.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))
for el in links:
print("http://normanpd.normanok.gov" + el['href'])
Output :
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf
But if you were asked to use only regexes, then try something simpler:
import urllib.request
import re
url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
print("http://normanpd.normanok.gov/" + link)
Using BeautifulSoup this is an easy way:
soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
current = link.get('href')
if current.endswith('pdf') and "Incident" in current:
links.append('{0}{1}'.format(url,current))

With selenium I do not get the data

I have with successfully navigated to an iframe with selenium + phantomJS but I do not get the data.
If I look the iframe url in Midori browser I can see the result.
But with webdriver without the table.
Here is my test code:
link = 'http://ebelediye.fatih.bel.tr/alfa/servlet/hariciprogramlar.online.rayic?caid=1449'
def get_site():
driver = webdriver.PhantomJS()
driver.get(link)
driver.find_element_by_name('btnlistele').click()
src = driver.find_element_by_tag_name('iframe').get_attribute('src')
driver.get(src)
print driver.page_source
This seem to be security issue, because of the high frequency you're sending requests.
FloodGuard Güvenlik uyarısı !!!
Bu kadar sık istek gönderemezsiniz !!!
Just add some delay as below:
import time
link = 'http://ebelediye.fatih.bel.tr/alfa/servlet/hariciprogramlar.online.rayic?caid=1449'
def get_site():
driver = webdriver.PhantomJS()
driver.get(link)
time.sleep(1)
driver.find_element_by_name('btnlistele').click()
src = driver.find_element_by_tag_name('iframe').get_attribute('src')
driver.get(src.replace('ISSK_KOD=', 'ISSK_KOD=999'))
print driver.page_source

working with hrefs extracted from Beautifulsoup

I am a Python beginner learning web crawling.
On this one project, I had to retrieve some hrefs and then to print out the text content within each of these href links. Here is my code so far:
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
for a in soup.select('.link'):
links = a.find('a').attrs['href']
I tried many things with the links but it would say "unicode is not callable".
How can I work with these links and eventually iterate over them to extract the text within?
Thanks
you code is almost done, just a little change
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = []
for div in soup.select('.link'):
link = div.a.get('href')
links.append(link)
print(links)
out:
['http://www.constructeursdefrance.com/concept-habitat/',
'http://www.constructeursdefrance.com/maisons-bois-cruard/',
'http://www.constructeursdefrance.com/passiva-concept/',
'http://www.constructeursdefrance.com/les-constructions-de-la-mayenne/',
'http://www.constructeursdefrance.com/maisonsdenfrance53/',
'http://www.constructeursdefrance.com/lemasson53/',
'http://www.constructeursdefrance.com/ecb53/',
'http://www.constructeursdefrance.com/villadeale-53/',
'http://www.constructeursdefrance.com/habitat-plus-53/']
select('.link') will return a list of div tag which has a child tag a,
so you can get a tag by div.a and than get href by div.a.get('href')
Try the following:
import requests, bs4, os, webbrowser
url = 'http://www.constructeursdefrance.com/resultat/?dpt=53'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
links = soup.findAll('a')
for link in links:
try:
print link.attrs['href']
except:
pass
Hope this helps.

saving Urls to list

I want to make a list of urls I have gotten from a webpage, I know it's simple but I can't get my head around it (had a headache since yesterday lunch!)
Anyway here is my code
from bs4 import BeautifulSoup
from urllib import request
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
for line in incidents.find_all('a'):
link = line.get('href')
Links = Links + Link
I know it isn't working but I'm not sure how to make it lol sorry and thanks in advance!
Raif
Change this:
for line in incidents.find_all('a'):
link = line.get('href')
Links = Links + Link
To this:
Links = []
for line in incidents.find_all('a'):
Links.append(line.get('href'))