In Django, disable #login_required for search engine spiders

In Django, disable #login_required for search engine spiders - django

I'm looking for a clean way to let search engine spiders bypass #login_required, viewing pages which typically require a logged-in user. I could write middleware that would automatically log search engines into a dummy account, but that's not exactly what I'd call clean. Any suggestions for a better solution? Thanks.

Don't do this. This is 'cloaking', and can get you banned from Google's index.
Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.
Cloaking: http://www.google.com/support/webmasters/bin/answer.py?answer=66355
Instead, you need to implement Google's First Click Free solution. In this setup, the first click from a Google search result is able to see the full content, subsequent clicks are trapped. This can be done on a referrer basis, or a cookie basis. You can read more about First Click Free here:
First Click Free: http://www.google.com/support/webmasters/bin/answer.py?answer=74536

Why would you want to do this? If search engines can see the pages, then anyone can see them without being logged in, because the information would surface on the search engine's results page. In any case, the only way to identify a spider or bot is by its user agent string, which is trivial to spoof.

I don't get it. in "#login_required" you have an important word: "required". if it's "required", it's for a good reason. It means that, in order to see the webpage, your credentials are mandatory. Because the content is private, secret, etc.
If you want to make your pages to be available via search engines, you have to make them public, and thus, login is not required anymore. And so, your view should not be protected by the #login_required decorator.
Maybe your problem lies beyond the availabilty of your pages. Maybe your content is actually made to be public, and your views should not be protected by this decorator. Maybe the only thing you need is to load the public part for every user (logged or anonymous) and eventually load the private bits if your user is identified.
Otherwise, leaving a backdoor for spiders is definitely a bad idea, because your private content won't be private anymore.

Related

Tracking unauthenticated users in Django

I need to track unregistered users in my Django website. This is for conversion optimization purposes (e.g. registration funnel, etc).
A method I've used so far is using IP address as a proxy for user_id. For various well-known reasons, this has led to fudged/unreliable results.
Can I sufficiently solve my problem via setting a session variable at server-side? An illustrative example would be great.
For example, currently I have a couple of ways in my head. One is doing request.session["temp_id"] = random.randint(1,1000000), and then tracking based on temp_id.
Another is setting a session variable every time an unauthorized user hits my web app's landing page, like so:
if not request.session.exists(request.session.session_key):
request.session.create()
From here on, I'll simply track them via request.session.session_key. Would this be a sound strategy? What major edge-cases (if any) do I need to be aware of?

Cookies are the simplest approach, but take into consideration that some users can have cookies turned off in their browsers.
So for those users you can use javascript local storage to set some data. This information will get deleted once you close the browser, but it's ok for funneling purposes. Still others can have javascript turned off.
Another approach would be to put custom data(key) in every link of the page when generating the template. in other words you would have the session_id stored in html page and send through url parameters at click. Something similar happens with csrf token. Look into that.

Parallel website running to my original website

We have been working on a gaming website. Recently while making note of the major traffic sources I noticed a website that I found to be a carbon-copy of our website. It uses our logo,everything same as ours but a different domain name. It cannot be, that domain name is pointing to our domain name. This is because at several places links are like ccwebsite/our-links. That website even has links to some images as ccwebsite/our-images.
What has happened ? How could have they done that ? What can I do to stop this ?

There are a number of things they might have done to copy your site, including but not limited to:
Using a tool to scrape a complete copy of your site and place it on their server
Use their DNS name to point to your site
Manually re-create your site as their own
Respond to requests to their site by scraping yours real-time and returning that as the response
etc.
What can I do to stop this?
Not a whole lot. You can try to prevent direct linking to your content by requiring referrer headers for your images and other resources so that requests need to come from pages you serve, but 1) those can be faked and 2) not all browsers will send those so you'd break a small percentage of legitimate users. This also won't stop anybody from copying content, just from "deep linking" to it.
Ultimately, by having a website you are exposing that information to the internet. On a technical level anybody can get that information. If some information should be private you can secure that information behind a login or other authorization measures. But if the information is publicly available then anybody can copy it.
"Stopping this" is more of a legal/jurisdictional/interpersonal concern than a technical one I'm afraid. And Stack Overflow isn't in a position to offer that sort of advice.

You could run your site with some lightweight authentication. Just issue a cookie passively when they pull a page, and require the cookie to get access to resources. If a user visits your site and then the parallel site, they'll still be able to get in, but if a user only knows about the parallel site and has never visited the real site, they will just see a crap ton of broken links and images. This could be enough to discourage your doppelganger from keeping his site up.
Another (similar but more complex) option is to implement a CSRF mitigation. Even though this isn't a CSRF situation, the same mitigation will work. Essentially you'd issue a cookie as described above, but in addition insert the cookie value in the URLs for everything and require them to match. This requires a bit more work (you'll need a filter or module inserted into the pipeline) but will keep out everybody except your own users.

Bypass specific URL from Akamai if certain cookie exist

I would like Akamai not to cache certain URLs if a specified cookie exist (i.e) If user logged in on specific pages. Is there anyway we can do with Akamai?

The good news, is that I have done exactly this in the past for the Top Gear site (www.topgear.com/uk). The logic goes that if a cookie is present (in this case "TGCACHEKEY") then the Akamai cache is to be bypassed for certain url paths. This basically turns off Akamai caching of html pages when logged in.
The bad news is that you require an Akamai consultant to make this change for you.
If this isn't an option for you, then Peter's suggestions are all good ones. I considered all of these before implementing the cookie based approach for Top Gear, but in the end none were feasible.
Remember also that Akamai strips cookies for cached resources by default. That may or may not effect you in your situation.

The Edge Server doesn't check for a cookie before it does the request to your origin server and I have never seen anything like that in any of their menus, conf screens or documentation.
However, there are a few ways I can think of that you can get the effect that I think you're looking for.
You can specify in the configuration settings for the respective digital property what path(s) or URL(s) you don't want it to cache. If you're talking about a logged on user, you might have a path that only they would get to or you could set up such a thing server side. E.g. for an online course you would have www.course.com/php.html that anybody could get to whereas you might use www.course.com/student/php-lesson-1.html for the actual logged on lessons content. Specifying that /student/* would not be cached would solve that.
If you are serving the same pages to both logged on and not-logged on users and can't do it that way, you could check server-side if they're logged on and if so add a cache-breaker to the links so when they follow a link a cache-breaker is automatically added. You could also do this client side if you want, but it would be more secure and faster to do it server-side. As a note on this, this could be userid-random#. That would keep it unique enough when combined with the page that nobody else would request it and get the earlier 'cache-broken' page.
If neither of the above are workable, there is one other way I can think of, which is a bit unconventional to say the least, but it would work. Create symbolically linked directory in your document root with another name so that you can apply the first option and exempt it from cacheing. Then you check if the guy is logged on and if so prepend the extra directory to the links. From akamai's point of view www.mysite.com/logged-on/page.html can be exempt from cache where www.mysite.com/content/page.html is cached. On your server if /logged-on/ symbolically links over to /content/ then you're all set.
When they login you could send them to a subdomain which is set up as a ServerAlias, so on your side it's the same, but on Akamai has differnt cache handling rules.

Following the same answer than #llevera, you can use the cookies on CloudFlare without intervention of engineers to make the change for you.
Having that sort of cookies to bypass content is a technique that its becoming more popular with the time, and even bug companies like Magento are using it for Magento 2 platform.
But solutions from above still valid, Maybe Akamai supports that that already now, we are in 2017!

Managing multiple accounts in one session with multiple tabs open

Scenario:
I have an administration-application which manages the user accounts for another application. Now I would like to place an user-specific-link (e.g. Click here to login with user1) in the administration-application allowing the admin to directly log in with the user in a separate browser window or tab (target="_blank").
Problem:
When the admin clicks two or more links and opens two tabs with tab1=user1 and tab2=user2, the last clicked tab overwrites the session-variables of all other tabs. Sure... that's how sessions work, but I wonder if there is a way to let the admin manage multiple user interfaces with one session in multiple tabs? But I don't see a possibility to identify a specific tab in the browser so that I could say "in tab1 is user1 and in tab2 is user2 logged in ...
Question:
Has anyone done something similar and likes to share the basic idea of solving this?
EDIT:
One possible solution could be to add an parameter to the URL with the userid and hand it through to every page, right?

As your edit points out, the way to do this is with a url variable that specifies who the user should be.
There are a number of security issues with this approach tho.
I'm assuming that your initial link is doing some sort of security check to make sure that the initial "log in" of the user is an authorized request. You'll need to do a similar thing for this method. If your initial request is something like http://example.com/page.cfm?userid={id}&authtoken={encryptedtoken} I would then put that userid into the session scope as a valid userid that the admin can impersonate. The more links they click on the more users they can impersonate. On subsequent requests you check the requested userid against the allowed list in the session and either allow or deny the impersonation.
You'll also need to update all the links on the site so that they include the userid in them. The easier way to do this is to cheat and user jQuery or such to rewrite all internal urls with the userid appended. You would conditionally include that javascript based on the above check.
Lastly you'll likely want to prevent these urls that include the userid from appearing in search engines, if it's not a fully locked down site. You'll either need to use canonical urls to remove the userid, or set x-robots headers to tell search engines not to index the urls where the userid has been specified; or both.
That's the most primitive method of getting different "sessions" for multiple users in the same browser. However you'll then bump into issues if you're using the session scope for anything meaningful, because each tab will try overwriting the other. You'll need to overwrite the normal site session variables on each request, or you'll need to create different structures in the session scope for each userid that is used. How much of a problem this is depends on your application.
It's a do-able thing, but probably a lot more work then you were hoping for.
The other option is to get the admins to use Google Chrome with multiple profiles and copy and paste the login url into different profile windows. A slight inconvenience for them, but a lot less work for you.

Is it dangerous to leave your Django admin directory under the default url of admin?

Is it dangerous to have your admin interface in a Django app accessible by using just a plain old admin url? For security should it be hidden under an obfuscated url that is like a 64 bit unique uuid?
Also, if you create such an obfuscated link to your admin interface, how can you avoid having anyone find out where it is? Does the google-bot know how to find that url if there is no link to that url anywhere on your site or the internet?

You might want to watch out for dictionary attacks. The safest thing to do is IP restrict access to that URL using your web server configuration. You could also rate limit access to that URL - I posted an article about this last week.

If a URL is nowhere on the internet "the googlebot" can't know about it ... unless somebody tells it about it. Unfortunately many users have toolbars installed in their browser, which submit all URLs visited by the browser to various Servers (e.g. Alexa, Google).
So keeping an URL secret will not work in the long run.
Also an uuid is hard to remember and to type - leading to additional support ("What was the URL again?").
But I still strongly suggest to change the URL (e.g. to /myadmin/). This will foil automatic scanning and attack tools. So If one day an "great Django worm" hits the Internet, you have a much lower chance of being hit.
People using PHPmyAdmin had this experience for the last few years: changing the default URL avoids most attacks.

Whilst there is no harm in adding an extra layer of protection (an obfuscated url) enforcing good password choice (checking password strength and checking it's not in a large list of common passwords) would be a much better use of your time.

Assuming you've picked a good password, no, it's not dangerous. People may see the page, but they won't be able to get in anyway.
If you don't want Google to index a directory, you can use a robots.txt file to control that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js