Unreachable Robots.txt in Django app - django

Received a notice from google webmaster tools that googles crawler bot has stopped crawling one particular site because of an "Unreachable robots.txt." Unfortunately google doesn't give any additional details about the crawler error beyond that.
I have
<meta name="robots" content="index, follow">
included as one of my meta tags in base.html template, which I do for every django app and I'm not having this problem with any of my other sites. Correct me if I'm wrong but I also thought robots.txt isn't necessary to have for google to index you.
I tried to resolve by installing and configuring django-robots (https://github.com/jezdez/django-robots) and adding this to my url conf:
(r'^robots\.txt$', include('robots.urls')),
My latest google crawler fetch (after pushing django-robots to prod) is still returning the same error though.
I don't have any special crawl rules and would be fine without even including a robots.txt file so google indexes the entire site. Anyone have any thoughts on a quick fix before I just go experiment with the other two methods mentioned here: http://fredericiana.com/2010/06/09/three-ways-to-add-a-robots-txt-to-your-django-project/?

I tried removing the robots.txt line from urls.py completely and fetching as google but that didn't resolve the issue.
(r'^robots\.txt$', include('robots.urls')),
I fixed this by modifying my root urlconf slightly
from django.http import HttpResponse
(r'^robots\.txt$', lambda r: HttpResponse("User-agent: *\nDisallow: /*", mimetype="text/plain")),
now googlebot is crawling it ok. Wish I understood better why this specific solution was effective for me, but it works.
Thanks to Ludwik for the assistance.

if you have permission then
Alias /robots.txt /var/www/---your path ---/PyBot/robots.txt
add alias to your virtual host. (in apache config file )
similarly for favicon
Alias /favicon.ico /var/www/aktel/workspace1/PyBot/PyBot/static/favicon.ico

Related

Django admin Resource Policy COEP ERR_BLOCKED_BY_RESPONSE

The static files of my Django admin site are on a S3 bucket (DigitalOcean Spaces actually) and in the Console I get a ERR_BLOCKED_BY_RESPONSE.NotSameOriginAfterDefaultedToSameOriginByCoep 200
In the network panel all the static files are considered 3rd party and blocked for this reason (not same origin)
The response to any one of these files contains a not set cross-origin-resource-policy error which says:
To use this resource from a different origin, the server needs to specify a cross-origin resource policy in the response headers.
What I tried :
Following the error message, I tried to set a response header on the ressources, something like Cross-Origin-Resource-Policy: cross-origin. But in a DigitalOcean Spaces I cannot set headers other than Content-Type, Cache-Control, Content-Encoding, Content-Disposition and custom x-amz-meta- headers.
I tried to extend the Django admin/base.html template, duplicate a few link tags and manually set a crossorigin attribute to them. This way the resources are queried twice, one query is blocked as before and the other one is working. The only difference in the headers is that the Origin is set. Is there a way to tell Django to add a crossorigin attribute to all link and script and img tags of the Django admin templates ?
I tried to remove the Cross-Origin-Opener-Policy and Cross-Origin-Embeder-Policy headers on the ingress loadbalancer, which I guess cause the blocking, by setting them to unsafe-none. Even though I think it should work with the policy, the change had no effect on the problem which I don't understand.
What I didn't try:
I found this tutorial explaining how to set custom headers on S3 Responses. The idea is to have a Lambda function in front modifying a x-amz- header to a standard header. Not sure I can easily replicate this with DigitalOcean Functions.
My workaround:
The ugly hack is to duplicate all Django admin templates and manually add a crossorigin attribute where needed.
I don't know where this comes from, a few weeks ago it was all good. Any help appreciated.
With HTTP 2+ it's more efficient to serve assets from the same domain, as they can be served on a single connection. Most sites should be doing this. Whitenoise is a popular solution for doing so with minimal configuration: ​https://whitenoise.evans.io/en/stable/
This solves the problem.
Thanks to Adam Johnson from djangoproject.

Nuxt SSG app not routing correctly on google cloud bucket, goes to dir/index.html on reload

I followed this tutorial on how to host server side generated sites on Google Cloud Buckets. The issue I am having is that the nuxt app works fine when routing internally, but when I reload a route, the site is redirected to whatever route plus a /index.html at the end.
This causes things to break, as it conflicts with Nuxt routing. I set index to be my entry point ala
gsutil web set -m index.html -e 404.html gs://my-static-assets
but it seems to assume that to be the case always. I don't have this problem using Netlify.
According to the tutorial that you're doing, you're redirected to /route/index.html because of your MainPageSuffix index, this means that when you try to access for example https://example.com/route the service look for the target https:// example.com/route/index.html.
To fix this I suggest that you include an index.html file under the subdirectory in the subdirectory of the route.
I recommend, for further info on this to check this post which has a similar issue.

Subdirectory pages not found static site hosted on Google Cloud Storage Bucket

I'm setting up a static site on a Google Cloud Storage Bucket with Loadbalancer. The site gets generated with Gridsome and then the dist folder gets saved in the bucket.
I have set the index and error with gsutil like in the [documentation]: https://cloud.google.com/storage/docs/gsutil/commands/web
Now I am facing a problem with how every url for accessing subdirectories gets redirected to dir/index.html. This is desired behavior, the dir/index.html page even exists in the bucket. But I still get a 404 - not found.
If I do a curl to the url subdir/index.html I get the HTML
I do not know exactly how you are testing your subfolder but I think this link can help you with your issue Error 404 when loading subfolder on GCS. In addition, you maybe must to take a look here How subdirectories work.
Based on How subdirectories work on GCS, when browser request URL http://www.example.com/dir it will be redirect (301) to The object http://www.example.com/dir/index.html on content served.
My assumption is there is no route http://www.example.com/dir/index.html on Vue (vue-router). So it will be throw to Not Found 404 page.
The simple solution is try to change all subdirectories link from
http://www.example.com/dir, http://www.example.com/about etc, to
http://www.example.com/dir/, http://www.example.com/about/
It will not redirect to 404 page when you request subdirectories url or reload the browser. But we all know that it's not best practices.

Google couldn't crawl your site because we were unable to access your site's robots.txt file

I verified my site using google webmaster. I have made my website in django and I also added robots.txt.
Now google is showing green tick mark I think its good on DNS and Server Connectivity but and red warning mark on robots.txt fetch..
My robots.txt looks like
User-agent: *
Disallow:
Is that google takes time to crawl site ? or I have errors in my robots.txt or its settings.
When I open robots.txt from my site like mysite.com/robots.txt I can see the robots.txt file..
Also when I run robots.txt test in webmaster it gives allowed result..
My site is not even being searched in google..
But why its not crawling my site ?
Google generally caches robots.txt. It tries to recrawls mostly in 24 hrs. If you are sure there is nothing with your robots.txt, you probably have to wait.

Microsoft Azure appending extra query string to urls with query strings

In deploying a version of the Django website I'm working on to Microsoft's Azure service, I added a page which takes a query string like
http://<my_site_name>.azurewebsites.net/security/user/?username=<some_username>&password=<some_password>
However, I was getting 404 responses to this URL. So I turned on Django's Debug flag and the page I get returned said:
Page not found (404)
Request Method: GET
Request URL: http://<my_site_name>.azurewebsites.net/security/user/?username=<some_username>&password=<some_password>?username=<some_username>&password=<some_password>
Using the `URLconf` defined in `<my_project_name>.urls`, Django tried these URL patterns, in this order:
^$
^security/ ^user/$
^account/
^admin/
^api/
The current URL, `security/user/?username=<some_username>&password=<some_password>`, didn't match any of these.
So it seems to be appending the query string onto the end of the url that already has the same query string. I have the site running on my local machine and on an iis server on my internal network which I'm using for staging before pushing to Azure. Neither of these site deployments do this, so this seems to be something specific to Azure.
Is there something I need to set in the Azure website management interface to prevent it from modifying URLs with query strings? Is there something I'm doing wrong with regards to using query strings with Azure?
In speaking to the providers of wfastcgi.py they told me it may be an issue with wfastcgi.py that is causing this problem. While they look into it they gave me a work around that fixes the issue.
Download the latest copy of wfastcgi.py from http://pytools.codeplex.com/releases
In that file find this part of the code:
if 'HTTP_X_ORIGINAL_URL' in record.params:
# We've been re-written for shared FastCGI hosting, send the original URL as the PATH_INFO.
record.params['PATH_INFO'] = record.params['HTTP_X_ORIGINAL_URL']
And add right below it (still part of the if block):
# PATH_INFO is not supposed to include the query parameters, so remove them
record.params['PATH_INFO'] = record.params['PATH_INFO'].split('?')[0]
Then, upload/deploy this modified file to the Azure site (either use the ftp to put it somewhere or add it to your site deployment. I'm deploying it so that if I need to modify it further its versioned and backed up.
In the Azure management page for the site, go to the site's configure page and change the handler mapping to point to the modified wfastcgi.py file and save the configuration.
i.e. my handler used to be the default D:\python27\scripts\wfastcgi.py. Since I deployed my modified file, the handler path is now: D:\home\site\wwwroot\wfastcgi.py
I also restarted the site, but you may not have to.
This modified script should now strip the query string from PATH_INFO, and urls with query strings should work. I'll be using this until I hear from the wfastcgi.py devs that the default wfastcgi.py file in the Python27 install has been fixed/replaced.