I have tried with headers, cookies, Formdata and body too, but i got 401 and 500 status code. In this site First Page is in GET method & gives HTML response and further pages are in POST method & gives JSON response. But these status codes arrives for unauthorisation but i have searched and i couldn't find any CSRF token or auth token in web page headers.
import scrapy
from SouthShore.items import Product
from scrapy.http import Request, FormRequest
class OjcommerceDSpider(scrapy.Spider):
handle_httpstatus_list = [401,500]
name = "ojcommerce_d"
allowed_domains = ["ojcommerce.com"]
#start_urls = ['http://www.ojcommerce.com/search?k=south%20shore%20furniture']
def start_requests(self):
return [FormRequest('http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method ="POST",
body = '''{"searchTitle" : "south shore furniture","pageIndex" : '2',"sortBy":"1"}''',
headers={'Content-Type': 'application/json; charset=UTF-8', 'Accept' : 'application/json, text/javascript, */*; q=0.01',
'Cookie' :'''vid=eAZZP6XwbmybjpTWQCLS+g==;
_ga=GA1.2.1154881264.1480509732;
ASP.NET_SessionId=rkklowbpaxzpp50btpira1yp'''},callback=self.parse)]
def parse(self,response):
with open("ojcommerce.json","wb") as f:
f.write(response.body)
I got it working with the following code:
import json
from scrapy import Request, Spider
class OjcommerceDSpider(Spider):
name = "ojcommerce"
allowed_domains = ["ojcommerce.com"]
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'COOKIES_DEBUG': True,
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
},
}
def start_requests(self):
yield Request(
url='http://www.ojcommerce.com/search?k=furniture',
callback=self.parse_search_page,
)
def parse_search_page(self, response):
yield Request(
url='http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method='POST',
body=json.dumps({'searchTitle': 'furniture', 'pageIndex': '2', 'sortBy': '1'}),
callback=self.parse_json_page,
headers={
'Content-Type': 'application/json; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
},
)
def parse_json_page(self,response):
data = json.loads(response.body)
with open('ojcommerce.json', 'wb') as f:
json.dump(data, f, indent=4)
Two observations:
a previous request to another site page is needed to get a "fresh" ASP.NET_SessionId cookie
I couldn't make it work using FormRequest, use Request instead.
Related
With python 2 :
I have 2 different problems.
With an url, I have this error:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
So I am trying to set up cookielib
But then I got this error
urllib2.HTTPError: HTTP Error 403: Forbidden
I tried to combine the 2, without success. It's always this error urllib2.HTTPError: HTTP Error 403: Forbidden
which is displayed
import urllib2, sys
from bs4 import BeautifulSoup
import cookielib
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.9 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Connection': 'close'}
req = urllib2.Request(row['url'], None, hdr)
cookie = cookielib.CookieJar() # CookieJar object to store cookie
handler = urllib2.HTTPCookieProcessor(cookie) # create cookie processor
opener = urllib2.build_opener(handler) # a general opener
page = opener.open(req)
pagedata = BeautifulSoup(page,"html.parser")
Or :
req = urllib2.Request(row['url'],None,headers=hdr)
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open(req)
pagedata = BeautifulSoup(page,"html.parser")
And many ...
When I use the Superset's API to import a dashboard the response shows me a login page.
I am doing the request using Python.
import requests
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {jwt_token}',
'X-CSRFToken': csrf_token,
'Referer': url
}
files = {
'formData': (
dashboard_path,
open(dashboard_path, 'rb'),
'application/json'
)
}
response = requests.get(url, files=files, headers=headers)
Does anyone know how to solve this problem?
I had some trouble with the Superset API myself, mostly because I did not handle the CSRF Token correctly:
It seems to be important that the retrieval of the JWT Token, the CSRF Token and the actual request happen in the same session.
If I don't do that, I can reproduce your error and are also sent to the login page (also you use a GET request in this example, but it should be POST).
Here an example from my local test-setup:
import requests
session = requests.session()
jwt_token = session.post(
url='http://localhost:8088/api/v1/security/login',
json={
"username": "admin",
"password": "admin",
"refresh": False,
"provider": "db"
}
).json()["access_token"]
csrf_token = session.get(
url='http://localhost:8088/api/v1/security/csrf_token/',
headers={
'Authorization': f'Bearer {jwt_token}',
}
).json()["result"]
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {jwt_token}',
'X-CSRFToken': csrf_token,
}
response = requests.post(
'http://localhost:8088/api/v1/dashboard/import',
headers=headers,
files={
"formData": ('dashboards.json', open("dashboards.json", "rb"), 'application/json')
},
)
session.close()
I'm simply trying to send a json post request using axios to Flask. But I get 'OPTIONS' in the server console which I understood is the preflight request. And I found if I use x-www-form-urlencoded instead of application/json in the headers of axios, the browser doesn't do a preflight request, so I was getting a POST request finally. But the block of POST request (as you can see in the code below) still doesn't get hit. I keep getting a CORS issue even though I've set the Access control allow origins in the server. What could be the problem here?
//FLASK SERVER
#bp.route("/", methods=["GET", "POST"])
def recipes():
if request.method == "GET":
# show all the recipes
recipes = [
{'name': 'BURGER', 'ingredients': ['this', 'that', 'blah']},
{'name': 'CHICKEN'}
]
return jsonify(recipes)
elif request.method == "POST":
# save a recipe
print('SEE HEREEE'+ str(request.data))
print(request.is_json)
content = request.get_json()
print(content)
return jsonify(content), 201, {'Access-Control-Allow-Origin': '*', 'Access-Control-Request-Method': "*", 'Access-Control-Allow-Headers': "*"}
//FRONTEND
try{
let response = await axios({
method: "POST",
url: "http://localhost:5000/recipes/",
headers: {
"Content-Type": "*"
},
data: {"hello": "HI"}
});
console.log("RESPONSE HERE", response)
}catch(err){
throw new Error("ERROR", err)
}
//Browser Console
If there is any error in Python code it shows similar error in front end side. From your screenshot, I see that there is an error in LoginForm. I think that is why the front end is not working as expected.
To handle CORS error, I use flask_cors Flask extension. Details of the package can be found in this Pypi package repository.
I have simplified the code to test if the CORS error occurs when there is no error in back end. I can add a new recipe using POST request from Axios.
Backend:
from flask import Flask, render_template, jsonify, request
from flask_cors import CORS
app = Flask(__name__)
CORS(app)
#app.route("/recipes", methods = ['GET', 'POST'])
def recipes():
# recipes
recipes = [
{'name': 'BURGER', 'ingredients': ['this', 'that', 'blah']},
{'name': 'HOTDOG', 'ingredients': ['Chicken', 'Bread']}
]
if request.method == "GET":
return jsonify(recipes)
elif request.method == "POST":
content = request.get_json()
recipes.append(content)
return jsonify(recipes)
Frontend:
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Frontend Application</title>
<script src="https://unpkg.com/axios/dist/axios.min.js"></script>
</head>
<body>
<div id="result">
</div>
<script type="text/javascript">
axios.post('http://127.0.0.1:5000/recipes', {
name: 'Sandwich',
ingredients: ['Vegetables', 'Sliced cheese', 'Meat']
})
.then(function (response) {
var result = document.getElementById("result");
const data = response.data;
for(var i=0; i<data.length; i++){
const item = data[i];
result.innerHTML+=item.name+": "+item.ingredients+"<br>";
}
})
.catch(function (error) {
console.log(error);
});
</script>
</body>
</html>
Output:
I am trying to get to details page of this site
To get there from the web one should click 1. Consula Titlulo 2. Select ORO from Minerals dropdown and 3. click Buscar. 4. Then click the very first item in the list.
Dev tools and Fiddler show that I should make POST request with item id as a payload and this POST request is then redirected to details page.
In my case Im being redirected to homepage. What do I miss?
Here is my Scrapy spider.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.shell import inspect_response
class CodeSpider(scrapy.Spider):
name = "col"
start_urls =['http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc']
headers ={
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Origin": "http://www.cmc.gov.co:8080",
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer":"http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",
}
def parse(self, response):
inspect_response(response, self)
payload = {'expediente': '29', 'tipoSolicitud': ''}
url = 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc'
yield scrapy.FormRequest(url, formdata = payload, headers=self.headers, callback = self.parse, dont_filter=True)
Here is the log with redirect.
2018-08-23 13:58:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <POST http://
www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc>
2018-08-23 13:58:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
From what I see scrapy also assigns correct Cookie before sending request.
In [2]: request.headers
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8,uk;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'PHPSESSID=1um6r67md5qpdcqs9g2n15g605',
'Dnt': '1',
'Origin': 'http://www.cmc.gov.co:8080',
'Referer': 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36'}
What do I miss?
Moreover if I use Postman code with GET for details page it works fine and returns the page.
Same code in Scrapy redirects.
In [1]: url = "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc"^M
...: ^M
...: headers = {^M
...: 'upgrade-insecure-requests': "1",^M
...: 'user-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",^M
...: 'dnt': "1",^M
...: 'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",^M
...: 'referer': "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",^M
...: 'accept-encoding': "gzip, deflate",^M
...: 'accept-language': "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",^M
...: 'cookie': "PHPSESSID=2ba8dsre6l42un95qu33k09ud6",^M
...: 'cache-control': "no-cache",^M
...: ^M
...: }^M
...:
In [2]: fetch(url, headers=headers)
2018-08-23 14:47:13 [scrapy.core.engine] INFO: Spider opened
2018-08-23 14:47:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <GET http://w
ww.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc>
2018-08-23 14:47:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
It appears that I missed POST request in the very beggining. This post request generates correct session ID which is to be new for every other search.
After some hours of reading manuals and other helps, finally i got my nginx+uwsgi1.9+django1.6+python3.3 server.
But now I have problems (mb problem in my understanding something) - how to get POST data from request ? I mean how to get it correctly.
The code in django view:
def info(request):
print(request)
return HttpResponse(request)
request to server:
http://127.0.0.1:8000/info/
{
"test":"test"
}
and interesting part - output in uwsgi log (POST and GET dicts):
<WSGIRequest
path:/info/,
GET:<QueryDict: {}>,
POST:<QueryDict: {}>,
COOKIES:{},
META:{'CONTENT_LENGTH': '24',
'CONTENT_TYPE': 'application/json',
'CSRF_COOKIE': 'upJxA8TWO0nhKACr0dfU46Qyu0DzzUTR',
'DOCUMENT_ROOT': '/usr/share/nginx/html',
'HTTP_ACCEPT': '*/*',
'HTTP_ACCEPT_ENCODING': 'gzip,deflate,sdch',
'HTTP_ACCEPT_LANGUAGE': 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4',
'HTTP_CONNECTION': 'keep-alive',
'HTTP_CONTENT_LENGTH': '24',
'HTTP_CONTENT_TYPE': 'application/json',
'HTTP_COOKIE': '',
'HTTP_HOST': '127.0.0.1:8000',
'HTTP_ORIGIN': 'chrome-extension://hgmloofddffdnphfgcellkdfbfbjeloo',
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36',
'PATH_INFO': '/info/',
'QUERY_STRING': '',
'REMOTE_ADDR': '127.0.0.1',
'REMOTE_PORT': '53315',
'REQUEST_METHOD': 'POST',
'REQUEST_URI': '/info/',
'SCRIPT_NAME': '',
'SERVER_NAME': 'django',
'SERVER_PORT': '8000',
'SERVER_PROTOCOL': 'HTTP/1.1',
'uwsgi.node': b'',
'uwsgi.version': b'1.9.18.2',
'wsgi.errors': <_io.TextIOWrapper name=2 mode='w' encoding='UTF-8'>,
'wsgi.file_wrapper': <built-in function uwsgi_sendfile>,
'wsgi.input': <uwsgi._Input object at 0x7fa4ec6b8ee8>,
'wsgi.multiprocess': True,
'wsgi.multithread': False,
'wsgi.run_once': False,
'wsgi.url_scheme': 'http',
'wsgi.version': (1, 0)}>
and response in browser (same as request to server):
{
"test":"test"
}
What I'm doing wrong ?
request.POST
and
request.POSt.dict()
returns empty dicts.
So questions is how to get POST data in code and why it looks so different when I use environment ?
UPD:
return HttpResponse(str(request)) - returns WSGIRequest object instead of POST data to me. But still dont know how to get POST data in code.
UPD2:
uwsgi config:
[uwsgi]
module = mysite.wsgi
master = true
processes = 5
socket = :8001
daemonize = /var/log/uwsgi/mysite.log
touch-reload = /tmp/uwsgi-touch
post-buffering = 1
UPD3:
versions of software:
Python 3.3.2
Django 1.6.2
uWSGI 1.9.18.2
UPD4:
Final code:
if request.method == "POST":
if request.META["CONTENT_TYPE"] == "application/json":
result = json.loads(request.body.decode())
else: result = request.POST.dict()
return HttpResponse(json.dumps(result), content_type="application/json")
Thats what I exactly want.
When I send data to server with
POST
Content-Type: application/x-www-form-urlencoded
a=1
or:
POST
Content-Type: application/json
{"a":"1"}
I see same response (and in code variables):
Content-Type: application/json
{"a": "1"}
Looks like you a POSTing JSON data, not HTML form data.
In that case you are looking for the raw post data, which is accessed like this:
request.body
See HttpRequest.body in the documentation.
If you want to parse that JSON string, use this:
import json
data = json.loads(request.body)
The request.POST dictionary is only populated when the request contains form data. This is when the Content-Type header is application/x-www-form-urlencoded or multipart/form-data.