Google Crawler Time Restriction - restriction

does anyone know that it is possible to setup any property in order to inform googlebot to just come and crawl the site during specific day or time period (eg. during the weekend only)?
thanks,

You can use an XML sitemap to give a hint about the appropriate crawling frequency, but it's only a hint, and requesting specific days is not possible.

You can advise googlebots that you prefer a slower crawl rate (if your site is being crawled at faster than the lowest rate) but this takes effect for 90 days (see http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=48620).
Changing the robots.txt could be problematic as that is cached by Google, so disallowing crawling could result in the site not being crawled for far longer than intended.
Google has more than one bot type so you may be able to be selective over which parts of the site are appropriate for each of them to crawl, using robots.txt as it is intended. See http://www.google.com/support/webmasters/bin/answer.py?answer=40360.

Don't think so, but googlebot does re-read your robots.txt quite frequently, so I wonder if it would work if you swapped in an alternative robots.txt at those times, e.g. with a script?

Related

S3 cloudfront expiry on images - performance very slow

I recently started serving my website images on cloudfront CDN instead of S3 thinking that it would be much faster. It is not. In fact it is much much slower. After a lot of investigation I'm being hinted that setting an expiry date on image objects is the key as Cloudfront will know how long to keep cached static content. Makes sense. But this is poorly documented by AWS and I can't figure out how to change the expiry date. people have said "you can change this on the aws console" Please show how dumb I am because I cannot see this. been at it for several hours. Needless to say I'm quite frustrated of fumbling around on this. Anyways any hints as small as they might be would be terrific. I like AWS, and what Cloudfront promised, but so far it's not what it seems.
EDIT ADITIONAL DETAIL:
Added expiry date headers per answer. In my case I had no headers. My hypothesis was that my slow Cloudfront performance serving images had to do with having NO expiry in the header. Having set an expiry date as shown in screenshot, and described in the answer, I'm seeing no noticeable difference in performance (going from no headers to adding an expiry date only). My site takes on average 7s to load with 10 core images (each <60Kbs). Those 10 images (served via cloudfront) account for 60-80% of the load time latency - depending on the performance tool used. Obviously something is wrong given that serving files on my VPS is faster. I hate to conclude that cloudfront is the problem given that so many people use it and I'd hate to break of from EC2 and S3, but right now testing MAxCDN is showing better results. I'll keep testing over the next 24hrs, but my conclusion is that the expiry date header is just a confusing detail with no favorable impact. Hopefully I'm wrong because I'd like to keep it all in the AWS family. Perhaps I'm barking up the wrong tree on the expiry date thing?
You will need to set it in the meta-data while uploading the file into S3. This article describes how you can achieve this.
The format for the expiry date is the RFC1123 date which is formatted like this:
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Setting this to a far future date will enable caches like CloudFront to hold the file for a long time and in this case speed up the delivery as the single edge-locations (server all over the world delivering content for CloudFront) already have the data and don't need to fetch it again and again.
Even with a far future expiry header the first request of an object will be slow as the edge-location has to fetch the object once before it can be served from the cache.
Alternately you can omit the Expires header and use Cache-Control instead. CloudFront will understand that one too and you are more flexible with the expiry. In that case you can for example state the object should be held for one day from the first request the edge-location made to the object:
Cache-Control: public, max-age=86400
In that case the time is given using seconds instead of a fixed date.
While setting Cache-Control or Expires headers will improve the cache-ability of your objects, it won't improve your 60k/sec download speed. This requires some help from AWS. Therefore, I would recommend posting to the AWS CloudFront Forums with some sample response headers, traceroutes and resolver information.

Using JSON file to improve caching - good idea?

I'm a beginner to caching. I'm currently working on a small project with Django and will be implementing caching later via memcached.
I have a page with a video on it and the video has a bunch of comments. The only content on the page that is likely to change regularly is the comments and the "You are logged in as.../You are not logged in..." message.
I was thinking I could create a JSON file that serves the username and most recent comments, including it in the head with <script src="videojson.js"></script>. That way I could populate the HTML via Javascript instead of caching the whole page on a per-user basis.
Is this a suitable approach, or is the caching system smarter than I give it credit for?
How is the JavaScript going to get the json object? Are going to serve from a django view that the us calls? And in that view you will just pull out of memcached if available and DB if not?
That seems reasonable assuming your json isn't very big. If your comments change a lot and you have to spend a lot of time querying the db, building the json object and saving to memcache every time a new comment is written, it won't work well. But if you only fill the cache when your json expires, and you don't care about having the latest and greatest comments on there instantly, it should work.
One thing to point out is that if you aren't getting that much traffic now, you might be adding a level of complexity that won't give you much return on your time spent. But if you are using this to learn how to do caching then it is a good exercise.
Hope that helps

How does a tool like SEOMoz Rank Checker work?

It seems there are a number of tools that allow you to check a site's position in search results for long lists of keywords. I'd like to integrate a feature like that in an analytics project I'm working on, but I cannot think of a way to run queries at such high volumes (1000s per hour) without violating the Google TOS and potentially running afoul of their automatic query detection system (the one that institutes a CAPTCHA if search volume at your IP gets too high).
Is there an alternative method for running these automated searches, or is the only way forward to scrape search result pages?
Use a third party to scrape it if you're scared of Google's TOS.
Google is very hot on banning/blocking temporarily IP addresses that appear to be sending automated queries. And yes of course, this is against their TOS.
It's also quite difficult to know exactly how they are detecting them but the main reason is definitely identical keyword searches from the same IP address.
The short answer is basically: Get a lot of proxies
Some more tips:
Don't search further than you need to (e.g. the first 10 pages)
Wait around 4-5 seconds between queries for the same keyword
Make sure you use real browser headers and not something like "CURL..."
Stop scraping with an IP when you hit the road blocks and wait a few days before using the same proxy.
Try and make your program act like a real user would and you won't have too many issues.
You can scrape Google quite easily but to do it at a very high volume will be challenging.

How do pagerank checking services work?

How do pagerank checking services work?
There's a PHP script here which should return the pagerank for you http://www.pagerankcode.com/download-script.html
Almost all those services are hitting the same service that the Google Toolbar uses. However, people at Google have said over and over not to look at PageRank, and that it's such a small portion of ranking.
That said, you can grab someone's (open source) SEO toolbar (just search for it) and open up the javascript to see how they're doing it.
Most services just copy what the Google tool bar shows. But pagerank is usually not the important thing, the important thing is to get quality backlinks with relevant anchor text.
Nick is right - Google Page Rank is really not what you should be looking for. In fact, it might be going away. Instead, I would look at SEOmoz.org's metrics from their SEO toolbar. They use metrics called Page Authority (the general power of the site out of 100 - most comparable to Page Rank * 10), mozRank (how popular a site is, i.e. how many links it has and how good those links are), and mozTrust (how trustworthy the site is considered. For example, if a site is in a "bad neighborhood" and is linking to/linked to by a lot of spammy sites, it would have a low mozTrust). MozRank and mozTrust are out of 10.
The script at http://www.pagerankcode.com/download-script.html is not working on most wide known hosting providers, while it will run perfectly if you install a small Apache server on your own PC (XAMPP and similar).
I think the only way is to wait until Google releases a web service API capable of performing such a rank (incredibly there are APIs to query almost every Google service, except this PageRank function).

Django Performance

I am using a django with apache mod_wsgi, my site has dynamic data on every page and all of the media (css, images, js) is stored on amazon S3 buckets liked via "http://bucket.domain.com/images/*.jpg" inside the markup . . . . my question is, can varnish still help me speed up my web server?
I am trying to knock down all the stumbling blocks here. Is there something else I should be looking at? I have made a query profiler for my code at each page renders at about 0.120 CPU seconds which seems quick enough, but when I use ab -c 5 -n 100 http://mysite.com/ the results are only Requests per second: 12.70 [#/sec] (mean) . . .
I realize that there are lots of variables at play, but I am looking for some guidance on things I can do and thought Varnish might be the answer.
UPDATE
here is a screenshot of my profiler
The only way you can improve your performance is if you measure what is slowing you down. Though it's not the best profiler in the world, Django has good integration with the hotshot profiler (described here) and you can figure out what is taking those 0.120 cpu seconds.
Are you using 2 cpus? If that's the case than perhaps the limitation is in the db when you use ab? I only say that because 0.120 * 12.70 is 1.5 which means that there's .5 seconds waiting for something. This could also be IO or something.
Adding another layer for no reason such as varnish is generally not a good idea. The only case where something like varnish would help is if you have slow clients with poor connections hold onto threads, but the ab test is not hitting this condition and frankly it's not a large enough issue to warrant the extra layer.
Now, the next topic is caching, which varnish can help with. Are your pages customized for each user, or can it be static for long periods of time? Often times pages are static except for a simple login status screen -- in this case consider off loading that login status to javascript with cookies. If you are able to cache entire pages then they would be extremely fast in ab. However, the next problem is that ab is not really a good benchmark of your site, since users aren't going to just sit at one page and hit f5 repeatedly.
A few things to think about before you go installing varnish:
First off, have you enabled the page caching middleware within Django?
Are etags set up and working properly?
Is your database server performing optimally?
Have you considered setting up caching using memcached within your code for common results? (particularly front pages and static pages displayed to non-logged-in users)
Except for query heavy dynamic pages that absolutely must display substantially different data for each user, .12 seconds seems like a long time to serve a page. Look at how you can work caching into your app to improve that performance. If you have a page that is largely static other than a username or something similar, cache the computed part of the page.
When Django's caching is working properly, ab on public pages should be lightning fast. If you aren't using any of the other features of Apache, consider using something lighter and faster like lighttpd or nginx.
Eric Florenzano has put together a pretty useful project called django-newcache that implements some advanced caching behavior. If you're running into the limitations of the built-in caching, consider it. http://github.com/ericflo/django-newcache