Google Website Scraping getting blocked after few requests

Google Website Scraping getting blocked after few requests - web-services

We are developing a simple application that makes call to one of Google's services (Reverse Image Search http://www.google.com/insidesearch/features/images/searchbyimage.html by uploading images by url/image and getting the entity name for the image). Essentially, we were getting the results page (as html) that Google returned and scraping the results using a simple parser.
We hosted this on Google App Engine and found that after a while Google blocked our app (identified by the IP) and send out a message saying it is to prevent bots from sending requests to its websites. Below is the message I found in the web server's logs:
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the http://www.google.com/policies/terms/">Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. http://support.google.com/websearch/answer/86640">Learn moreSometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
I wanted to check if there is a way to solve this or any workaround, etc. Since Google doesn't expose any Reverse Image Search API's, we do not see any other way (other than creating a http request and scraping the response) to get the info we want.
Any leads will be helpful.

If you are in violation of the terms of service, that's that. Any "workaround" would be inappropriate.
This service is exactly the same and has an API you can legitimately use: http://services.tineye.com/TinEyeAPI
What is TinEye API? TinEye is a reverse image search engine. You can
submit an image to TinEye to find out where it came from, how it is
being used or if modified versions of the image exist. TinEye uses
image recognition to perform its searches. The TinEye API allows a
user to search the multi billion TinEye image index automatically.

Related

Chrome hitting my Django backend but I only made an iOS app

So I have a Django backend deployed on Google App Engine. This backend supports an iOS app. In my server logs I can see all the requests coming in and where they were made. It used to be that I would only get requests from Joon/7.** (which is the iOS app name + version). However, recently I've been getting requests from Chrome 72 which doesn't make sense cause the app shouldn't be able to be used on Chrome. Furthermore these requests are creating a lot of errors in my backend because it is not sending an authentication token. Does anyone know what is going on here? Are my servers being hacked?

Looks like someone discovered the URL to your App Engine app. You can use Ingress controls to only allow access via Cloud Load Balancing and then Google Cloud Armor in front to protect that with rules that look like:
has(request.headers['user-agent']) && request.headers['user-agent'].contains('Godzilla')

It is quite common to see all sorts of hits (from what I call spam bots) to an App Engine App. Technically, GCP expects you to use Google Firewall rules to block these. The challenge though is that these bots usually change their IP Addresses frequently or use multiple ones. I don't have a 'perfect' solution.
a) You can try the method by #jeff-williams (I've never tried that)
b) You can also try GCP's firewall rules (I use this but I try to block a range of IPs instead of blocking them one by one)
c) Sometimes I also put my service behind a specific non-intuitive path. This way, the spam bots will only hit the default/base url and then I have a separate service which returns 404 for all calls to that base url

How to Cast a Single website to multiple devices using WEBRTC?

I want to create a server on one device and the changes I make on a certain website should be visible on other devices in real time, I don't want to cast the entire screen just the website.
Can anyone help me with that?

If you want really to use WebRTC to send something that appears in a web browser to another web client, you should see Canvas to peer connection.
Anyway, if various clients should be informed about an event via web, I suggest that you see Your first Web Progressive Application

Sitecore GeoIP Service is not working with personalization

I'm working on Sitecore 7 and I have configured the Sitecore GeoIP module (Sitecore IP Geolocation Service Client 1.2 rev. 150602.zip) on our site.
Sitecore IP Geolocation Service is running on our site's App Center.
When I tried to use its functionality with the personalization, it seems not working.
I created the following condition for a component of a page using the presentation details --> personalize
But when I access the site from the give country, the item is still exists on the page (which need to be hidden).
I did test the GeoIp module using the TestIp.aspx page and it's tracking the ip data correctly.
Can someone please advice on this.
Thanks.
UPDATE
This actually works. There is a ip caching mechanism with the MaxMind service.
When the ip is cached the change that we made from sitecore client is not getting activated for certain time.
Is there are any config change that we can do to change or skip this caching mechanism ?
Thanks.

Sitecore's GeoIP/MaxMind module does not resolve GeoIP information in real time. It does this in batch background processes - for performance reasons, no doubt.
I can show you a way to change this, but I would not recommend you do this in practice on any real site as calls to the MaxMind service can take a while and will block your page load until they complete.
You need to add a processor to your httpRequest pipeline, early as possible, that forces a lookup for the client IP. It will then be cached for subsequent page loads.
Sitecore.Analytics.Lookups.LookupManager.GetInformationByIp(string ip)
Where the ip argument will be your request Host.
But as I said, I really would not recommend doing it like this, unless your site is very light weight.
My suggestion to you instead of this, will be to build something up around the GeoLite database that MaxMind provides, free of charge. You will then perform lookups in a local database (instead of a web service) - for an example of how this could be done, look here:
http://sitecoresnippets.blogspot.dk/2011/12/sitecore-geoip-country-resolving-jump.html#.Vhdui_l_NBc

Cloning PyQt app in django framework

I've designed a desktop app using PyQt GUI toolkit and now I need to embed this app on my Django website. Do I need to clone it using django's own logic or is there a way to get it up on website using some interface. Coz I need this to work on my website same way it works as desktop. Do I need to find out packages in django to remake it over the web or is there way to simplify the task?
Please help.

I'm not aware of any libraries to port a PyQT desktop app to a django webapp. Django certainly does nothing to enable this one way or another. I think, you'll find that you have to rewrite it for the web. Django is a great framework and depending on the complexity of your app, it might not be too difficult. If you haven't done much with web development, there is a lot to learn!
If it seemed like common sense to you that you should be able to run a desktop app as a webapp, consider this:
Almost all web communication that you likely encounter is done via HTTP. HTTP is a protocol for passing data between servers and clients (often, browsers). What this means is that any communication that takes place must be resolved into discrete chunks. Consider an example flow:
You go to google in your browser.
Your browser then hits a DNS server (or cache) that resolves the name google.com to some IP address.
Cool, now your browser makes a request to that IP address and says "get me some stuff".
Google decides to send you back a minimal amount of HTML and lots of minified JavaScript in the page.
Your browser realizes that there are some image links in the HTML and so it makes additional requests to google to get each of the images so that it can display them.
Now all the content is loaded on your browser so it starts to execute the JavaScript code, and that code needs some more data from google so it starts sending requests to google too.
This is just a small example of how fundamentally different a web application operates than how a desktop application does. On a desktop app you have the added convenience that any operation doesn't need to be "packaged up" and sent, then have an action taken, etc (unless you're using a messaging architecture, but that's relatively uncommon outside of enterprise apps).

webservice authentication and user identity management

My team and me are currently working on quite a large project. We are working on an online game, which will be accessible (for the moment), in two ways:
-Via a web browser, an application full JavaScript(client-side), full Ajax (basically meaning that the UI will be managed in JS client side).
-Via an iPhone application (the UI will be managed by the application itself).
Between the two different applications, the core logic remains the same, so I believe (I could be wrong), that the best solution would be to create a web service (if possible using standards such as RESTful or Rest) capable of perming all necessary operations.
Following this logic, I have encountered a problem: the authentication and identity management of the user. This poses problem as the applications users need to be authenticated to preform certain operations.
I’ve looked into WS-security, but this obviously requires passwords to be stored, unencrypted on the server, which is not acceptable!
I then looked into Oauth, but at first glance this seemed like a lot of work to set up and not particularly suited to my needs (the way that applications have to be accepted does not please me since it will be my application and my application only using the web service, not any external application).
I’ve read and heard about a lot of other ways to do what I want, but to be honest, I’m a little confused and I don’t know what information is reliable and what isn’t.
I would like to note that I’m using symfony2 for the backend and jquery for the client side JavaScript.
Furthermore, I would like a detailed, step-by-step response, because I really am confused with all that I have read and heard.
Thank you for your time, and I hope someone can help me as it’s quite urgent.
Good evening

I'm not entirely sure if this answers your request, but since the UI will always be handled on the client side, I think you could use stateless HTTP authentication:
This is the firewall in security.yml:
security:
firewalls:
api:
pattern: ^/api/ # or whatever path you want
http_basic: ~
stateless: true
And then the idea basically is that on the server, you use your normal user providers, encoders and whatnot to achieve maximal security, and on the client, you send the HTTP authentication headers, for example, in jQuery:
$.ajax("...", {
username: "Foo",
password: "bar"
});
Please note that since the authentication is stateless (no cookie is ever created), the headers have to be sent with every request, but, I figure, since the application is almost entirely client-side, this isn't a problem.
You can also check the Symfony2 security manual for further information on how to setup HTTP authentication. Also be sure to force HTTPS access in your ACL, so the requests containing the credentials are secured (requires_channel: https in your ACL definitions).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js