Script for checking server side runtime - server-side

Netcraft is able to get statistics for millions of sites what web technology (PHP, ASP.NET, Java Servlet etc.) they are using on their servers. I was wondering how they do that from the outside and whether there is some script that can be used to check the server side technology for a random site?

They get a lot of information via the HTTP HEAD, other data from the whois registry, the uptime (I guess) from finger. See more on https://www.owasp.org/index.php/Testing_for_Web_Application_Fingerprint_%28OWASP-IG-004%29 - netcraft is mentioned at the very bottom of that page.

Related

How to reduce your fingerprint in browser for privacy and for web scraping

You can disable cookies, change your ip 500 times but can’t anyone just track you through fingerprinting?
You could disable Java and Flash. Though that would break the page and make you stand out anyway.
You could use Tor but I think if you use Tor you get blacklisted from some sites instantly.
What’s the workaround? Using Chrome is a big nono. Internet explorer maybe and firefox perhaps…
Are there any apps that deal with this? Or just design a good web scraper, have an ip and cross your fingers.
I realize the average site is not going to implement all these features, but I am how one would workaround a site that was extremely vigilant.
There are two types of browser fingerprinting:
1. static fingerprinting - can identify browsers (and probably operating systems) just based on details of their requests. That's the order and capitalization of http headers, browser specific headers etc.
One small aspect is described here: https://gwillem.gitlab.io/2017/05/02/http-header-order-is-important/
As this can be done without any javascript, I guess scrapy is identifyable this way.
How to get around this?
As mentioned in the above article you need to exactly emulate a particular browser's fingerprint by emulating its headers' order and capitalization (and it has to match the user agent, of course)
2. dynamic fingerprinting - uses Javascript to collect data on installed plugins, plugin versions etc ... As Granitosaurus wrote, that won't be triggered by scrapy. But sites that use fingerprinting for scraping protection will block the scraper if it doesn't get any data from its fingerprinting module.
As this type of fingerprinting yields much more dimensions it can be used to identify particular users with a high reliability (over 90%)
You can find a good example how this is done here: https://github.com/Valve/fingerprintjs2
How to get around this?
use a lot of different real browsers for scraping (for example through selenium, no phantomjs, it can be detected)
randomize these browsers' settings and installed plugins (ideally using different versions)
when scraping rotate these browser instances instead of rotating IPs (each browser instance should keep its IP over its livetime)
If one of the instances is "burnt" replace it with a new instance that has a fresh IP and randomized browser fingerprint
... as you'll need many browsers this has to be done in an automated way, of course.
Resetting cookies sounds like a good idea at first, but if the fingerprinting system is worth its salt it won't need cookies to identify each of these machines reliably.

Cloning PyQt app in django framework

I've designed a desktop app using PyQt GUI toolkit and now I need to embed this app on my Django website. Do I need to clone it using django's own logic or is there a way to get it up on website using some interface. Coz I need this to work on my website same way it works as desktop. Do I need to find out packages in django to remake it over the web or is there way to simplify the task?
Please help.
I'm not aware of any libraries to port a PyQT desktop app to a django webapp. Django certainly does nothing to enable this one way or another. I think, you'll find that you have to rewrite it for the web. Django is a great framework and depending on the complexity of your app, it might not be too difficult. If you haven't done much with web development, there is a lot to learn!
If it seemed like common sense to you that you should be able to run a desktop app as a webapp, consider this:
Almost all web communication that you likely encounter is done via HTTP. HTTP is a protocol for passing data between servers and clients (often, browsers). What this means is that any communication that takes place must be resolved into discrete chunks. Consider an example flow:
You go to google in your browser.
Your browser then hits a DNS server (or cache) that resolves the name google.com to some IP address.
Cool, now your browser makes a request to that IP address and says "get me some stuff".
Google decides to send you back a minimal amount of HTML and lots of minified JavaScript in the page.
Your browser realizes that there are some image links in the HTML and so it makes additional requests to google to get each of the images so that it can display them.
Now all the content is loaded on your browser so it starts to execute the JavaScript code, and that code needs some more data from google so it starts sending requests to google too.
This is just a small example of how fundamentally different a web application operates than how a desktop application does. On a desktop app you have the added convenience that any operation doesn't need to be "packaged up" and sent, then have an action taken, etc (unless you're using a messaging architecture, but that's relatively uncommon outside of enterprise apps).

Google Website Scraping getting blocked after few requests

We are developing a simple application that makes call to one of Google's services (Reverse Image Search http://www.google.com/insidesearch/features/images/searchbyimage.html by uploading images by url/image and getting the entity name for the image). Essentially, we were getting the results page (as html) that Google returned and scraping the results using a simple parser.
We hosted this on Google App Engine and found that after a while Google blocked our app (identified by the IP) and send out a message saying it is to prevent bots from sending requests to its websites. Below is the message I found in the web server's logs:
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the http://www.google.com/policies/terms/">Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. http://support.google.com/websearch/answer/86640">Learn moreSometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
I wanted to check if there is a way to solve this or any workaround, etc. Since Google doesn't expose any Reverse Image Search API's, we do not see any other way (other than creating a http request and scraping the response) to get the info we want.
Any leads will be helpful.
If you are in violation of the terms of service, that's that. Any "workaround" would be inappropriate.
This service is exactly the same and has an API you can legitimately use: http://services.tineye.com/TinEyeAPI
What is TinEye API? TinEye is a reverse image search engine. You can
submit an image to TinEye to find out where it came from, how it is
being used or if modified versions of the image exist. TinEye uses
image recognition to perform its searches. The TinEye API allows a
user to search the multi billion TinEye image index automatically.

Application server v/s HTTP server

So I have noticed that the docs for various Application Servers (think Unicorn, Puma for Ruby, Warp for Haskell etc) always mentioned something similar to "it is optimized as an app server.” Typically this is mentioned when describing the standard setup of using a HTTP server (like Ngnix) in reverse-proxy in front of app servers.
So my question is: What exactly does the programming of a web application server make it more performant for serving data generated by code v/s HTTP server? Is there any particular engineering trade-offs? Or is it more the case where HTTP servers are optimized for serving files from a disk, and so they're merely trying to say that HTTP servers are not optimized for application code?
First, this really belongs in ServerFault or SuperUser.
But basically, Apache & Nginx strictly deliver static web content. Yes, you can install PHP as a module & it will parse scripts when the page is requested. But it is all on demand. Meaning the program runs only when the page is requested.
In contrast application servers run programs that are active in memory all the time. Which can have some engineering benefits depending on what you want your system to do. So Tomcat or Passenger (for Ruby) run Java & Ruby apps, and are optimized to do it in a production server environment.
Why does Apache or Nginx get attached as a front end? Because at the end of the day Apache & Nginx still are the best tools for simply delivering web content. And have better optimizations & security in place to do so.
So the application server focuses on making Java or Ruby run as cleanly as possible & deliver basic web content. And Apache & Nginx concentrate on the front-end side of web delivery.
As a systems administrator, I prefer to proxy via Apache or Nginx since I already know how to configure & optimize those tools for my use. If I have to learn how to fine tune Passenger or Tomcat, it should only be enough to allow me to get it running so I can place Apache or Nginx in front of that.

Two way communication using AJAX from an HTML page to a C++ application running in same server

Is it possible to communicate from a web browser(Loaded an HTM page from server) to an application running in the same server using AJAX. Need to send the request from browser using a button click and update the page with responses received from one another application running in the same server machine?
I am using HTML pages to create website and not using any PHP or ASP like server side scripting. In server machine data are manipulated using a C++ application.
I think you can use any sort of Javascript functions to do that. But you might need to use jQuery or similar frameworks to make your live easier. You might need to search for "Comet Programming" to know exactly how to do 2-way communication between client and server
Updated:
Well, this kind of stuff requires you to read a lot (if you have not already known). Basically, what you need is a server that can do long-polling (or eventsource, websockets). There are many open-source ones that might help you to get started. I can list a several good ones here. There are a lot more
http://www.ape-project.org/
http://cometd.org/
http://socket.io/
http://code.google.com/p/erlycomet/
http://faye.jcoglan.com/
So after you have the comet server up and running you will need to setup the client side (probably Javascript). For those listed projects, most of them come with the client side code to interact with the server (Except for erlycomet). Therefore, you can just use the examples provided and run a quick prototype. If you want to use your raspberry pi, you can use nodejs which provide a lot of ease for dealing with real-time communication (socket.io, faye). And lately, http://www.meteor.com/
I would think of the problem this way: you want to provide a web front end to an existing c++ application. To achieve this you need to think about how your web server communicates with your c++ application. Communication between the browser and web server can be thought of as a separate problem - as you say AJAX calls can be used, or maybe have a look at websockets.
Once you have your request in the web server you need to communicate it to the C++ application (and/or visa versa). This can be done a number of ways, e.g. sockets or RPC. I found this question here which has some good advice.