Is it possible to list sitemaps for different domains in the same robots.txt file? - sitecore

We have multiple websites served from the same Sitecore instance and same production web server. Each website has its own primary and Google-news sitemap, and up to now we have included a sitemap specification for each in the .NET site's single robots.txt file.
Our SEO expert has raised the presence of different domains in the same robots.txt as a possible issue, and I can't find any documentation definitely stating one way or the other. Thank you.

This should be OK for Google at least. It may not work for other search engines such as Bing, however.
According to https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt:
sitemap: [absoluteURL]
[absoluteURL] points to a Sitemap, Sitemap Index file or equivalent URL. The URL does not have to be on the same host as the robots.txt file. Multiple sitemap entries may exist. As non-group-member records, these are not tied to any specific user-agents and may be followed by all crawlers, provided it is not disallowed.

The best way to achieve this is to Handle the Robots.txt from Sitecore Content Tree.
We also have similar structure where we are delivering multiple websites from Single sitecore instance.
I have written a blog for such please find it below. It is exactly what you want.
http://darjimaulik.wordpress.com/2013/03/06/how-to-create-handler-in-sitecore/

Related

How can I find Drupal pages not listed in XML Sitemap module?

I recently inherited a Drupal 8/9 site. I've got the XML Sitemap module deployed. Is there a way for me to find pages that are NOT included in the sitemap? I'm concerned there's a lot of ROT that I can't see.
Thanks!
I can't think of a way to find pages that are not listed in the sitemap.xml.
Maybe, what you can do is just enable sitemap config on every content type, vocabulary, etc. Then you'd queue and generate sitemap again.

Decoupled CMS, Selective crawling of the db server

We are running a decoupled CMS (using Wordpress as our db) and want to prevent search engines from crawling our posts from this server. We have post templates on that server so writers can preview their posts and Google has found them.
Am I able to detect a that a crawler is trying to access these pages in my .htaccess file and redirect to the www server? Is redirecting the wrong solution here? Can robots.txt block a generic pattern such as category/post-title?
Three things need to still happen:
We need to still be able to access db.site.com/wp-admin
Writers still need to preview their posts, which means they cannot be redirected.
db.site.com/wp-content/uploads needs to be accessible so Social sites can pull
images.
Here are how the posts are setup. Basically I want to block or redirect posts from db.site.com
db.site.com/category/post-title
www.site.com/category/post-title

SIteCore - multisite - displaying page from wrong site

We have a multisite SiteCore setup with 2 sites within the same .Net solution.
This works by setting the rootPath property on a Site Definition in web.config to limit the site to part of the SiteCore folder structure.
This works well apart from when pages are created with the same name as in the other site then it's serving content from the other site! We have inherited a fair bit of custom code in this solution form the other site so this may be the cause but dont know what Im looking for ...
Thanks
How are you referencing the sites? Do they each have their own host name? Do you have the "hostName" property set for the site node in the Site Defintion?
I will assume that you are not referring to them this way and instead, the sites are using the "virtualFolder" property. If both sites have the same default value of "/" for virtualFolder, attempting to get to either site will result in Sitecore rendering the first site that it matches on, which would be the site listed first.
Try putting the actual site name for "virtualFolder" and "physicalFolder" (e.g. "Site1" and "Site2", respectively). Then you can address your sites as http://yourserver.com/Site1 and http://yourserver.com/Site2. The "virtualFolder" will match first and render the correct site.
See Configuring Sites in the web.config File on SDN for additional information.
Hope this helps.
It turns out this is happening in this case because of a System alias that is redirecting for a subset of pages

Sitecore Multisite using querystring instead of domain/subdomain?

Is there a way to setup mulitple sites to run using querystrings rather than domains/subdomains?
I am developing a site that has a Global site and multiple country specific sites (exact list of countries to be confirmed later). For development I have a Global and a Local site created and running on a temporary subdomain. If this works correctly we may run the entire application this way rather than on separate domains (similar to how apple.com appears to work)
I have successfully got the sites running locally as:
global.domain.com
a.domain.com
b.domain.com
but would like them to be able to run as:
www.domain.com/global
www.domain.com/a
www.domain.com/b
We will be implementing multiple languages on certain country sites aswell so locale will need to remain independant.
Could this be done using some sort of URL mapping rather than multiple sites or something? Where can I find information about URL mapping?
There are settings for using virtual folders (see web.config under sites node)
virtualFolder: The prefix to match for incoming URL's.
This value will be removed from the URL and the remainder will be treated as the item path.
How that works in practice I'm not sure - it's on a domain by domain basis, and all your sites will be operating from the same domain.
But I think you might want to reconsider your approach. Sub domains have several advantages. They're simple to configure in the web.config (just add a domain and point it at the right bit of the content tree).
They simplify search engine optimisation - e.g. telling google to target a specific subdomain to a geographical area in Google webmaster tools.
They're simple for visitors to understand.
Bear in mind that if you're going to use multiple languages per site then you will probably want to keep the language parameter in the URL as part of the (virtual) filepath (e.g. www.mysite.com/en-GB/products)
If you use both language and locale in the URL in that way you end up with something like www.mysite.com/UK/en-GB/products

How to make Django url dispatcher use subdomain?

I have a vague idea on how to solve this, but really need a push :)
I have a Django app running with apache (mod_wsgi). Today urls look like this:
http://site.com/category/A/product/B/
What I would like to do is this:
http://A.site.com/product/B
This means that the url dispatcher some how needs to pick up the value found in the subdomain and understand the context of this instead of only looking at the path. I see two approaches:
Use .htaccess and rewrites so that a.site.com is a rewrite. Not sure if this does the trick since I don't fully understand what the django url dispatcher framework will see in that case?
Understanding how the url dispatcher DO work I could write a filter that looks at valid sub domains and provides this in a rewritten format to the url dispatcher code.
Any hints or solutions are very much appreciated! Thanks.
Have you looked at django.contrib.sites? I think a combination of that, setting SITE_ID in your settings.py, and having one WSGI file per "site" can take care of things.
EDIT: -v set.
django.contrib.sites is meant to let you run multiple sites from the same Django project and database. It adds a table (django.contrib.sites.models.Site) that has domain and name fields. From what I can tell, the name can mean whatever you want it to, but it's usually the English name for the site. The domain is what should show up in the host part of the URL.
SITE_ID is set in settings.py to the id of the site being served. In the initial settings.py file, it is set to 1 (with no comments). You can replace this with whatever code you need to set it to the right value.
The obvious thing to do would be to check an environment variable, and look up that in the name or domain field in the Site table, but I'm not sure that will work from within the settings.py file, since that file sets up the database connection parameters (circular dependency?). So you'll probably have to settle for something like:
SITE_ID = int(os.environ.get('SITE_ID', 1)
Then in your WSGI file, you do something like:
os.environ['SITE_ID'] = 2
and set that last number to the appropriate value. You'll need one WSGI file per site, or maybe there's a way to set SITE_ID from within the Apache setup. Which path to choose depends on the site setup in question.
The sites framework is most powerful where you use Site as the target of a ForeignKey or ManyToManyField so that you can link your model instances (i.e. records) to specific sites.
Mikes solution is correct if you want to have multiple sites with same apps with different content (sites module) on multiple domains or subdomains, but it has a drawback that you need to be running multiple instances of the Django process.
A better solution for the main problem about multiple domains or subdomains is to use a simple middleware that handles incoming requests with the process_request() function and changing the documented urlconf attribute (link) of the request object to the URLconf you want to use.
More details and an example of the per-request or per-domain URL dispatcher can be found at:
http://gw.tnode.com/0483-Django/
Try adding a wildcard subdomain: usually *.