django sitemap efficiency

django sitemap efficiency - django

I saw on this page that it's possible to lower the limit on your sitemap so that it is paginated differently:
Caching sitemaps in django
But when I try to generate my sitemap, it hangs and hangs, and never comes up.
Eventually, if I wait long enough, I get this error in Firefox:
XML Parsing Error: no element found
Location: http://sitename.com/sitemap.xml
Line Number 1, Column 1:
My site has about 70K pages at present, so I'm using the index generator in urls.py. For some reason though it's not working. I'm guessing it is because my server lacks the power to generate sitemaps containing 70K links, but I'm not at all sure.
Does anybody have any insight?

One thing you could do it to split your huge sitemap into several files.
Probably each file will contain a different content area of your site and you can use caching as probably not all sections update in the same frequency?
http://docs.djangoproject.com/en/dev/ref/contrib/sitemaps/#creating-a-sitemap-index

I finally figured this out. Turns out I had a misconfiguration in my urls.py. Ugh.

Big sitemaps? Try django-fastsitemaps

Related

How to serve a text file at root with Django

I got a site, say it's "www.site.com". I need to serve a text file at the root, so that "www.site.com/text.txt" will show up.
I read through this answer: Django download a file, and I used the "download" function.
However I am at a loss how to configure the URL. Can someone help me out?
Say I put this this into my url.py,
url('^(?P<path>.*)/$', san.views.FileDownload.as_view()),
This then supersedes all my other url patterns and render them useless. How do I make this work? Thanks!

Put it as the last urlpattern in your urls.py to ensure it doesn't sccop up everything. It should not have the trailing / either, i.e.
url('^(?P<path>.*)/?$', san.views.FileDownload.as_view()),
This will match the request "/YOUR_FILE.txt", and is also case-insensitive.
url(r'^(?i)YOUR_FILE.txt$', san.views.FileDownload.as_view()),

content empty when using scrapy

Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!

I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.

Solr admin shows number of indexes(numDocs) to be greater than the number of files I processed

When I process 56 files with Solr it says, 'numDoc:74'. I have no clue as to why more indexes would exists than files processed, but one explanation I came up with is that the indexes of a couple of the processed files are too big, so they are split up into multiple indexes(I use rich content extraction on all processed files) . It was just a thought, so I don't want to take it as true right off the bat. Can anyone give an alternate explanation or confirm this one?
using Django + Haystack + Solr.
Many thanks

Your terminology is unfortunately is all incorrect but the troubleshooting process should be simple enough. Solr comes with admin console. Usually at http:// [ localhost or domain ]:8983/solr/ . Go there, find your collection in the drop-down (I am assuming Solr 4) and run the default query in the Query screen. That should give you all your documents and you can see what the extras are.
I suspect you may have some issues with your unique ids and/or reindexing. But with the small number of documents you can really just review what you are actually storing in Solr and figure out what is not correct.

Selenium takes 30s to realize an element id doesn't exist. What should I do?

I'm using Selenium to scrape / parse an awful web site (if it wasn't awful, I might not use Selenium, and yes, respecting robots.txt).
I'm reading a set of links from a table of unknown size, with sequential element ids, using find_element_by_id(). I'm catching NoSuchElementException to tell me that I'm at the end of the table and there are no other elements to pick up.
This smoothly walks through the elements that exist, but takes about 30 seconds to throw the error when I request the non-existent element that tells me I'm at the end of the table.
The file is not that huge - the html dump from DOM Inspector delivers a 81kb file. The last link in the table (which Selenium finds quickly) is 7/8s of the way through the file, so (assuming Selenium is parsing sequentially) file size alone doesn't seem to explain this.
Can I speed up the failure of finding the missing element? Or is there a more elegant way to know I am in the last row of the table with content?

You might want to do this using css selectors instead.
driver.findElements( By.cssSelector( '[id^=id_name]' ) )

Is adding a robots.txt to my Django application the way to get listed by Google?

I have a website (Django) on a linux server, but Google isn't finding the site at all. I know that i don't have a robots.txt file on the server. Can someone tell me how to create one, what to write inside and where to place it? That would be a great help!

robot txt is not for google find your site. i think you must register your site to google and also add sitemap.xml
Webmaster Tools - Crawl URL ->
https://www.google.com/webmasters/tools/submit-url?continue=/addurl&pli=1
also see this for robot.txt
Three ways to add a robots.txt to your Django project | fredericiana
-> http://fredericiana.com/2010/06/09/three-ways-to-add-a-robots-txt-to-your-django-project/
what is robot.txt
It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, stylesheets and javascript from indexing, you also need a way to tell spiders to keep away from these items.
One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.
from What is Robots.txt -> http://www.webconfs.com/what-is-robots-txt-article-12.php

robot.txt files are used to tell search engines which content should or should not be indexed. The robot.txt files is in no way required to be indexed by a search engine.
There are a number of thing to note about being indexed by search engines.
There is no guarantee you will ever be indexed
Indexing takes time, a month, two months, 6 months
To get indexed quicker try sharing a link to your site through blog comments etc to increase the chances of being found.
submit your site through the http://google.com/webmasters site, this will also give you hints and tips to make your site better as well as crawling stats.

location of robots.txt is same as view.py and this code
in view
def robots(request):
import os.path
BASE = os.path.dirname(os.path.abspath(__file__))
json_file = open(os.path.join(BASE , 'robots.txt'))
json_file.close()
return HttpResponse(json_file);
in url
(r'^robots.txt', 'aktel.views.robots'),

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

django sitemap efficiency - django

I finally figured this out. Turns out I had a misconfiguration in my urls.py. Ugh.

Big sitemaps? Try django-fastsitemaps

Related

How to serve a text file at root with Django

content empty when using scrapy

Solr admin shows number of indexes(numDocs) to be greater than the number of files I processed

Selenium takes 30s to realize an element id doesn't exist. What should I do?

Is adding a robots.txt to my Django application the way to get listed by Google?

Categories

Resources