My client-server application is mainly based on special purpose http server that communicates with client in an Ajax like fashion, ie. the client GUI is refreshed upon asynchronous http request/response cycles.
Evolvability of the special purpose http server is limited and as the application grows, more and more standard features are needed which are provided by Django for instance.
Hence, I would like to add a Django application as a facade/reverse-proxy in order to hide the non-standard special purpose server and be able to gain from Django. I would like to have the Django app as a gateway and not use http-redirect for security reasons and to hide complexity.
However, my concern is that tunneling the traffic through Django on the serer might spoil performance. Is this a valid concern?
Would there be an alternative solution to the problem?
Usually in production, you are hosting Django behind a web container like Apache httpd or nginx. These have modules designed for proxying requests (e.g. proxy_pass for a location in nginx). They give you some extras out of the box like caching if you need it. Compared with proxying through a Django application's request pipeline this may save you development time while delivering better performance. However, you sacrifice the power to completely manipulate the request or proxied response when you use a solution like this.
For local testing with ./manage.py runserver, I add a url pattern via urls.py in an if settings.DEBUG: ... section. Here's the view function code I use, which supports GET, PUT, and POST using the requests Python library: https://gist.github.com/JustinTArthur/5710254
I went ahead and built a simple prototype. It was relatively simple, I just had to set up a view that maps all URLs I want to redirect. The view function looks something like this:
def redirect(request):
url = "http://%s%s" % (server, request.path)
# add get parameters
if request.GET:
url += '?' + urlencode(request.GET)
# add headers of the incoming request
# see https://docs.djangoproject.com/en/1.7/ref/request-response/#django.http.HttpRequest.META for details about the request.META dict
def convert(s):
s = s.replace('HTTP_','',1)
s = s.replace('_','-')
return s
request_headers = dict((convert(k),v) for k,v in request.META.iteritems() if k.startswith('HTTP_'))
# add content-type and and content-length
request_headers['CONTENT-TYPE'] = request.META.get('CONTENT_TYPE', '')
request_headers['CONTENT-LENGTH'] = request.META.get('CONTENT_LENGTH', '')
# get original request payload
if request.method == "GET":
data = None
else:
data = request.raw_post_data
downstream_request = urllib2.Request(url, data, headers=request_headers)
page = urllib2.urlopen(downstream_request)
response = django.http.HttpResponse(page)
return response
So it is actually quite simple and the performance is good enough, in particular if the redirect goes to the loopback interface on the same host.
Related
In my project, I use many FastAPI microservices to conduct the specified tasks and a Django app to handle user actions, connect with the front-end, etc.
In Django database I store information about the file paths that a microservice should obtain. I want to provide it to a microservice and this microservice only. So I want to restrict access to the endpoint to a specified port on the local network.
Other endpoints should be normally available for the web app. Using just Django cors headers will not work since I want access to most of the endpoints normally and restrict it to localhost for only a tiny subset.
CORS_ORIGIN_WHITELIST = [
"http://my_frontend_app" # most of the endpoints available from the front end
"http://localhost:9877", #the microservice that I want to provide with the path
]
Another solution might be to use from corsheaders.signals import check_request_enabled, but in the use cases that I have seen, it was usually used to broaden the access , not to restrict it (front-end should not have access to the subset of endpoints). I’m not sure whether it is a good solution.
# myapp/handlers.py
from corsheaders.signals import check_request_enabled
def cors_allow_particular_urls(sender, request, **kwargs):
return request.path.startswith('/public-api/')
check_request_enabled.connect(cors_allow_mysites)
Is there any way to create a “local cors”, e.g., in the form of a decorator? It would look somehow like:
#local_cors([“localhost:9877”])
#decorators.action(detail=False, methods=["post"])
def get_data(self, request):
return response.Response(status=200)
where localhost:9877 is the address of a microservice
Is such a solution good enough?
def get_data(self, request):
request_host = request.get_host()
data = request.data
if request_host != MAP_UPLOAD_HOST:
# we don't show the endpoint to the outside return
response.Response(status=404)
return response.Response(status=200)
Checking the Host header with get_host() may offer sufficient protection, depending on your server setup.
get_host() will tell you the value of the Host header in the request, which is data provided by the client so could be manipulated in any way. The Host header is an integral part of HTTP 1.1 in allowing multiple domains to be hosted at a single address so you might be able to depend on your server rejecting requests that aren't actually arriving from localhost with a matching header, but it's difficult to be certain.
It would likely be more reliable to check the client's network address and reject requests from all clients except those that are specifically allowed.
I'm developing multitenant application using django. Every things works fine. But in the case of ALLOWED_HOST , I've little bit confusion, that how to manage dynamic domain name. I know i can use * for any numbers for domain, But i didn't wan't to use * in allowed host.
Here is my question is there any way to manage allowed host dynamically.
According to the Django doc,
Values in this list can be fully qualified names (e.g. 'www.example.com'), in which case they will be matched against the request’s Host header exactly (case-insensitive, not including port). A value beginning with a period can be used as a subdomain wildcard: '.example.com' will match example.com, www.example.com, and any other subdomain of example.com. A value of '*' will match anything; in this case you are responsible to provide your own validation of the Host header (perhaps in a middleware; if so this middleware must be listed first in MIDDLEWARE).
So, if you want to match a certain subdomain, you can use the subdomain wildcard as explained above. If you want a different validation, as the documentation says, the right way to do it would be to allow all hosts and write your own middleware.
If you don't know about middlewares, they are a simple mechanism to add functionality when certain events occur. For example, you can create a middleware function that executes whenever you get a request. This would be the right place for you to put your Host validation. You can get the host using the request object. Refer to the official docs on middleware if you want to know more.
A simple middleware class for your problem could look something like -
import requests
from django.http import HttpResponse
class HostValidationMiddleware(object):
def process_view(self, request, view_func, *args, **kwargs):
host = request.get_host()
is_host_valid = # Perform host validation
if is_host_valid:
# Django will continue as usual
return None
else:
response = HttpResponse
response.status_code = 403
return response
If you need more information, comment below and I will try to help.
The setting should be loaded once on startup - generally those settings don't change in a production server setup. You can allow multiple hosts, or allow multiple subdomains.
*.example.com would work, or you can pass a list of domains if I remember correctly.
Is it possible to get the http request as bytestring like it gets transferred over the wire if you have a django request object?
Of course the plain text (not encrypted if https gets used).
I would like to store the bytestring to analyze it later.
At best I would like to access the real bytestring. Creating a bytestring from request.META, request.GET and friends will likely not be the same like the original.
Update: it seems that it is impossible to get to the original bytes. Then the question is: how to construct a bytestring which roughly looks like the original?
As others pointed out it is not possible because Django doesn't interact with raw requests.
You could just try reconstructing the request like this.
def reconstruct_request(request):
headers = ''
for header, value in request.META.items():
if not header.startswith('HTTP'):
continue
header = '-'.join([h.capitalize() for h in header[5:].lower().split('_')])
headers += '{}: {}\n'.format(header, value)
return (
'{method} HTTP/1.1\n'
'Content-Length: {content_length}\n'
'Content-Type: {content_type}\n'
'{headers}\n\n'
'{body}'
).format(
method=request.method,
content_length=request.META['CONTENT_LENGTH'],
content_type=request.META['CONTENT_TYPE'],
headers=headers,
body=request.body,
)
NOTE this is not a complete example only proof of concept
The basic answer is no, Django doesn't have access to the raw request, in fact it doesn't even have code to parse raw HTTP request.
This is because Django's (like many other Python web frameworks) HTTP request/response handling is, in it's core, a WSGI application (WSGI specification).
It's the job of the frontend/proxy server (like Apache or nginx) and application server (like uWSGI or gunicorn) to "massage" the request (like transforming and stripping headers) and convert it into an object that can be handled by Django.
As an experiment you can actually wrap Django's WSGI application yourself and see what Django gets to work with when a request comes in.
Edit your project's wsgi.py and add some extremely basic WSGI "middleware":
import os
from django.core.wsgi import get_wsgi_application
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'project.settings')
class MyMiddleware:
def __init__(self, app):
self._app = app
def __call__(self, environ, start_response):
import pdb; pdb.set_trace()
return self._app(environ, start_response)
# Wrap Django's WSGI application
application = MyMiddleware(get_wsgi_application())
Now if you start your devserver (./manage.py runserver) and send a request to your Django application. You'll drop into a debugger.
The only thing of interest here is the environ dict. Poke around it and you'll see that it's pretty much the same as what you'll find in Django's request.META. (The contents of the environ dict is detailed in this section of the WSGI spec.)
Knowing this, the best thing you can get is piecing together items form the environ dict to something that remotely resembles an HTTP request.
But why? If you have an environ dict, you have all the information you need to replicate a Django request. There's no actual need to translate this back to a HTTP request.
In fact, as you now known, you don't need a HTTP request at all to call Django's WSGI application. All you need is a environ dict with the required keys and a callable so that Django can relay the response.
So, to analyze requests (and even be able to replay them) you only need to be able to recreate a valid environ dict.
To do so in Django the easiest option would be to serialize request.META and request.body to a JSON dict.
If you really need something that resembles an HTTP request (and you are unable to go a level up to e.g. the webserver to log this information) you'll just have to piece this together from the information available in request.META and request.body, with the caveats that this is not a realistic representation of the original HTTP request.
I have Scrapy spider that runs well.
What I need to do is make an API call from inside parse method and use results from response in the same method with the same items. How do I do this? The only simple thing comes to mind is use python requests library but I am not sure if this works in scrapy and moreover at scrapinghub. Is there any built in solution?
Here is an example.
def agency(self, response):
# inspect_response(response, self)
agents = response.xpath('//a[contains(#class,"agency-carousel__item")]')
Agencie_Name = response.xpath('//h1[#class = "agency-header__name"]/text()').extract_first()
Business_Adress = response.xpath('//div[#class = "agency-header__address"]//text()').extract()
Phone = response.xpath('//span[#class = "modal-item__text"]/text()').extract_first()
Website = response.xpath('//span[#class = "modal-item__text"][contains(text(),"Website")]/../#href').extract_first()
if Website:
pass
# 1 send request to hunter io and get pattern. Apply to entire team. Pass as meta
# do smth with this pattern in here using info from this page.
So here i normaly extract all info from scrapy response, and if Website variable is populated I need to send api call to hunter io to get email pattern for this domain and use it to generate emails in the same method.
Hopes that makes sence.
As for vanilla scrapy on your own PC / server, there is no problem accessing third party libraries inside a scraper. You can just do whatever you want, so something like this is no problem at all (which would fetch a mail address from an API using requests and then send out a mail using smtplib).
import requests
import smtplib
from email.mime.text import MIMEText
[...]
if Website:
r = requests.get('https://example.com/mail_for_site?url=%s' % Website, auth=('user', 'pass'))
mail = r.json()['Mail']
msg = MIMEText('This will be the perfect job offer for you. ......')
msg['Subject'] = 'Perfect job for you!'
msg['From'] = 'sender#example.com'
msg['To'] = mail
s = smtplib.SMTP('example.com')
s.sendmail('sender#example.com', [mail], msg.as_string())
However, as for scrapinghub I do not know. For this, I can just give you a developer's point of view, because I also develop a managed scraping platform.
I assume that sending a HTTP(S) request using requests would not be any problem at all. They do not gain security by blocking it, because HTTP(S) traffic is allowed for scrapy anyway. So if somebody would want to do harmful attacks with requests through HTTP(S), they could just call the same requests with scrapy.
However, SMTP might be another point, you'd have to try. It's possible that they do not allow SMTP traffic from their servers, because it is not required for scraping tasks and can be abused for sending spam. However, since there are legitimate uses for sending mails during a scraping process (e.g. errors), it might as well be possible that SMTP is perfectly fine on scrapinghub, too (and they employ rate limiting or something else against spam).
I have a two-layer backend architecture:
a "front" server, which serves web clients. This server's codebase is shared with a 3rd party developer
a "back" server, which holds top-secret-proprietary-kick-ass-algorithms, and has a single endpoint to do its calculation
When a client sends a request to a specific endpoint in the "front" server, the server should pass the request to the "back" server. The back server then crunches some numbers, and returns the result.
One way of achieving it is to use the requests library. A simpler way would be to have the "front" server simply redirect the request to the "back" server. I'm using DRF throughout both servers.
Is redirecting an ajax request possible using DRF?
You don't even need the DRF to add a redirection to urlconf. All you need to redirect is a simple rule:
urlconf = [
url("^secret-computation/$",
RedirectView.as_view(url=settings.BACKEND_SECRET_COMPUTATION_URL))),
url("^", include(your_drf_router.urls)),
]
Of course, you may extend this to a proper DRF view, register it with the DRF's router (instead of directly adding url to urlconf), etc etc - but there isn't much sense in doing so to just return a redirect response.
However, the code above would only work for GET requests. You may subclass HttpResponseRedirect to return HTTP 307 (replacing RedirectView with your own simple view class or function), and depending on your clients, things may or may not work. If your clients are web browsers and those may include IE9 (or worse) then 307 won't help.
So, unless your clients are known to be all well-behaving (and on non-hostile networks without any weird way-too-smart proxies - you'll never believe what kinds of insanity those may do to HTTP requests), I'd suggest to actually proxy the request.
Proxying can be done either in Django - write a GenericViewSet subclass that uses requests library - or by using something in front of it, e.g. nginx or Caddy (or any other HTTP server/load balancer that you know best).
For production purposes, as you probably have a fronting webserver, I suggest to use that. This would save implementation time and also a little bit of server resources, as your "front" Django project won't even have to handle the request and keep the worker busy as it waits for the response.
For development purposes, your options may vary. If you use bare runserver then a proxy view may be your best option. If you use e.g. Docker, you may just throw in an HTTP server container in front of your Django container.
For example, I currently have a two-project setup (legacy Django 1.6 project and newer Django 1.11 project, sharing the same database) and a Caddy server in front of those, routing on per-URL basis. With a simple 9-line Caddyfile things just work:
:80
tls off
log / stdout "{common}"
proxy /foo project1:8000 {
transparent
}
proxy / project2:8000 {
transparent
}
(This is a development-mode config.) If you can have something similar, then, I guess, that would be the simplest option.