python 2 - cookie and headers with urllib2 - cookies

With python 2 :
I have 2 different problems.
With an url, I have this error:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
So I am trying to set up cookielib
But then I got this error
urllib2.HTTPError: HTTP Error 403: Forbidden
I tried to combine the 2, without success. It's always this error urllib2.HTTPError: HTTP Error 403: Forbidden
which is displayed
import urllib2, sys
from bs4 import BeautifulSoup
import cookielib
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.9 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Connection': 'close'}
req = urllib2.Request(row['url'], None, hdr)
cookie = cookielib.CookieJar() # CookieJar object to store cookie
handler = urllib2.HTTPCookieProcessor(cookie) # create cookie processor
opener = urllib2.build_opener(handler) # a general opener
page = opener.open(req)
pagedata = BeautifulSoup(page,"html.parser")
Or :
req = urllib2.Request(row['url'],None,headers=hdr)
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open(req)
pagedata = BeautifulSoup(page,"html.parser")
And many ...

Related

403 Response when I use requests to make a post request

My core code is as follows:
import requests
url='https://www.xxxx.top' #for example
data=dict()
session = requests.session()
session.get(url)
token = session.cookies.get('csrftoken')
data['csrfmiddlewaretoken'] = token
res = session.post(url=url, data=data, headers=session.headers, cookies=session.cookies)
print(res)
# <Response [403]>
The variable url is my own website, which is based on Django. I know I can use #csrf_exempt to disable CSRF, but I don't want to do that.
However, it return 403 response when I use requests to make a post request. I wish someone could tell me what was wrong with my approach.
I have solved the problem. In this case, just add Referer to headers
import requests
url='https://www.xxxx.top' #for example
data=dict()
session = requests.session()
session.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/51.0.2704.63 Safari/537.36','Referer':self.url}
session.get(url)
token = session.cookies.get('csrftoken')
data['csrfmiddlewaretoken'] = token
res = session.post(url=url, data=data, headers=session.headers, cookies=session.cookies)
print(res)

When I am web scraping in python using findAll from BeautifulSoup and regex (re.compile), I cannot loop it properly using css classes

I am new to BeautifulSoup and Python. So, On this WP website, there are 4 articles on the homepage, but it only gives me 3 articles and hence 3 images attached to it. Is there a simpler way to do this?
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, features="html5lib")
articles = bsObj.findAll("article", {"class": "post"})
print(len(articles))
for article in articles:
image = bsObj.findAll("img", {"src": re.compile("/wp-content/uploads/.*.jpg")})
print(image)
Now you have figured out article count thing, there is no simpler solution indeed. There maybe some other versions if you wanna check out.
Your code in simplified version is:
from urllib.request import urlopen
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
articles = bsObj.findAll("article", {"class": "post"})
for article in articles:
print(article.find("img").get("src"))
And there is this version, which utilizes inline for loop
from urllib.request import urlopen
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
images = [article.find("img").get("src") for article in bsObj.findAll("article", {"class": "post"})]
print(images)
There is the approach with lxml, it's not quite good, but you can use this to find elements easily if they are in some weird places using xpath:
from urllib.request import urlopen
from lxml import etree
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
lxmlHtml = etree.HTMLParser()
htmlPage = etree.parse(html, lxmlHtml)
images = htmlPage.xpath("//article[contains(#class, 'post') and not(contains(#class, 'page'))]//img")
for image in images:
print(image.attrib["src"])
I finally came up with the right solution without using regex
thank you Guven Degirmenci, please see code below
from urllib.request import urlopen
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, features="html5lib")
images = [article.find("img").get("src")
for article
in bsObj.findAll
("article", {"class": "post"})]
for image in images:
print(image)

Connection Error : Max retries exceeded with url:/API/Admin/login.php

I am running my django application in local host and i tried with my IP address as well. I am getting the connection error. My views.py file is below
def user_login(request):
datas= {'log':False}
if request.method == "POST":
usern=request.POST.get('username')
passw=request.POST.get('password')
response = requests.post(url='https://www.onebookingsystem.com/API/Admin/login.php',data={"Username":usern,"password":passw},timeout=5)
json_data = response.json()
The error which i am getting is given below.
File "C:\Users\Android V\Anaconda3\envs\djangoenv\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.onebookingsystem.com', port=443): Max retries exceeded with url: /productionApi/API/Admin/login.php (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000000054620B8>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refusedit',))
[09/Oct/2019 18:28:21] "POST / HTTP/1.1" 500 156251
Can you try to use headers like this:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = requests.post("https://www.onebookingsystem.com/API/Admin/login.php",headers={'User-Agent': user_agent})

BS4 error 'NoneType' object has no attribute 'find_all'. Cannot parse html data

BS4 error 'NoneType' object has no attribute 'find_all'. Cannot parse html data.
import requests
from bs4 import BeautifulSoup as bs
session = requests.session()
def get_sizes_in_stock():
global session
endpoint = 'https://www.jimmyjazz.com/mens/footwear/nike-air-max-270/AH8050-100?color=White'
response = session.get(endpoint)
soup = bs(response.text,'html.parser')
div = soup.find('div',{'class':'box_wrapper'})
all_sizes = div.find_all('a')
sizes_in_stock = []
for size in all_sizes:
if 'piunavailable' not in size['class']:
size_id = size['id']
sizes_in_stock.append(size_id.split('_')[1])
return sizes_in_stock
print (get_sizes_in_stock())
enter image description here
try adding in the headers parameter:
change:
response = session.get(endpoint)
to:
response = session.get(endpoint, headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
import requests
from bs4 import BeautifulSoup as bs
session = requests.session()
def get_sizes_in_stock():
global session
endpoint = "https://www.sneakers76.com/en/nike/5111-nike-af1-type-ci0054-001-.html"
response = session.get(endpoint, headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Mobile Safari/537.36'})
soup = bs(response.text,"html.parser")
var = soup.find("var",{"blockwishlist_viewwishlist":"View your wishlist"})
all_sizes = var.find_all("var combinations")
sizes_in_stock = []
for size in all_sizes:
if "0" not in size["quantity"]:
size_id = size["attributes"]
sizes_in_stock.append(size_id)
return sizes_in_stock
print (get_sizes_in_stock())

Below POST Method is not working in scrapy

I have tried with headers, cookies, Formdata and body too, but i got 401 and 500 status code. In this site First Page is in GET method & gives HTML response and further pages are in POST method & gives JSON response. But these status codes arrives for unauthorisation but i have searched and i couldn't find any CSRF token or auth token in web page headers.
import scrapy
from SouthShore.items import Product
from scrapy.http import Request, FormRequest
class OjcommerceDSpider(scrapy.Spider):
handle_httpstatus_list = [401,500]
name = "ojcommerce_d"
allowed_domains = ["ojcommerce.com"]
#start_urls = ['http://www.ojcommerce.com/search?k=south%20shore%20furniture']
def start_requests(self):
return [FormRequest('http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method ="POST",
body = '''{"searchTitle" : "south shore furniture","pageIndex" : '2',"sortBy":"1"}''',
headers={'Content-Type': 'application/json; charset=UTF-8', 'Accept' : 'application/json, text/javascript, */*; q=0.01',
'Cookie' :'''vid=eAZZP6XwbmybjpTWQCLS+g==;
_ga=GA1.2.1154881264.1480509732;
ASP.NET_SessionId=rkklowbpaxzpp50btpira1yp'''},callback=self.parse)]
def parse(self,response):
with open("ojcommerce.json","wb") as f:
f.write(response.body)
I got it working with the following code:
import json
from scrapy import Request, Spider
class OjcommerceDSpider(Spider):
name = "ojcommerce"
allowed_domains = ["ojcommerce.com"]
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'COOKIES_DEBUG': True,
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
},
}
def start_requests(self):
yield Request(
url='http://www.ojcommerce.com/search?k=furniture',
callback=self.parse_search_page,
)
def parse_search_page(self, response):
yield Request(
url='http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method='POST',
body=json.dumps({'searchTitle': 'furniture', 'pageIndex': '2', 'sortBy': '1'}),
callback=self.parse_json_page,
headers={
'Content-Type': 'application/json; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
},
)
def parse_json_page(self,response):
data = json.loads(response.body)
with open('ojcommerce.json', 'wb') as f:
json.dump(data, f, indent=4)
Two observations:
a previous request to another site page is needed to get a "fresh" ASP.NET_SessionId cookie
I couldn't make it work using FormRequest, use Request instead.