robots.txt handling a # in a URL - regex

Given the following URLs:
example.com/products
example.com/products#/page-2
example.com/products#/page-3
...
By using the robots.txt file, the first URL (example.com/products) is supposed to be indexed, every other one should be blocked from being indexed. How can this be done?
None of the following attempts work in the desired manner:
Noindex: /products#/page-*
Noindex: /products\#/page-*
Noindex: /*/page-*
Noindex: /*#/page-*
Noindex: /*\#/page-*

/products#/page is not a unique page. The actual url is simply /products.
# is abused to hook into javascript frameworks that dynamically load other pages, but, but normally /products#/page means that your /products page has an element such as this <a name="#/page">, and you can't block specific elements.
SPA's break the web. You're better off creating real, independent pages.

Everything after # is called "anchor". This information is NOT transferred to the server, hence you cannot read it from PHP, or any other language that is executed on the serverside.
As #Evert Outlines, the "anchor-tag" is commonly abused with javascript, as it can be modified WITHOUT the need of an actual redirect, allowing to generate deep-links, for dynamic content. (They are working, cause a client-side javascript will take care to use AJAX to dynamically load content based on the anchor-tag)

Related

How to prevent XSS attack in django

I'm trying to prevent XSS attack in Django, is it ok to use {{ body|escape }} in my HTML file only? How can I do some filtration in the backend?
Not every case is the same security-wise. So hard to give complete advice without seeing your application and the use cases and the Version of Django you use.
If you use the Django's template system and make sure that auto-escaping is enabled (it is enabled by default in recent versions), you're 9x% percent safe. Django provides an auto-escaping mechanism for stopping XSS: it'll automatically escape data that are dynamically inserted into the template. You still have to be aware of some issues:
All the attributes where dynamic data is inserted. Do <img alt="{{somevar}}"> instead of . Django's auto-escaping will not cover your unquoted attribute values.
Data inserted into CSS (style tags and attributes) or Javascript (script blocks, event handlers, and onclick attributes), you must manually escape the data using escaping rules that are appropriate for CSS or Javascript (Probably using a custom filter on the Python side).
Data inserted into a URL attribute (href, img src), you must manually validate the URL to make sure it is safe by checking the protocol against a whitelist of allowed protocols (e.g. https:, mailto:, ... but never javascript:).
Avoid setting html attributes from user input.
If you use mark_safe, make sure you know what you are doing and the data is really "safe".
There is always more but this covers the most known issues. Always make sure to refer to OWASP to understand the different XSS attacks and how they apply to your specific application:
https://owasp.org/www-community/attacks/xss/
https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html

How to store a dynamic site-wide variable

I have an html file which is the base,where other html documents extends.Its a static page but i want to have variable in the menu.I don't think it's wise to create a view for it,since i don't intend to let users visit the base alone.So where in my project can I store site-wide dynamic variables that can be called on any page without explicitly stating them in their views.
Thank you in advance.
For user specific variables, use session.
For global constants (not variables!), use settings.py.
For global variables, consider to store it in database so it can be multithreading & multiprocess safe.
I looked around and saw different approaches,but one that doesn't compromise the DRY philosophy the most for me is registering a tag in your project then input it in the base template.Its neater See here https://stackoverflow.com/a/21062774/6629594 for an example
Storage can take any number of places, I put mine in a stats model in the db so you get all the goodness of that (and make it easy to access in views).
I then have a context processor written as so:
#context_processors.py:
def my_custom_context_processor(request):
return {'custom_context_variable1':'foo','custom_context_variable2':'bar'}
Add this to your context processors in settings.py:
TEMPLATE_CONTEXT_PROCESSORS = (
...
"my_app.context_processors.ny_custom_context_processor",
)
Provided you use render() to render your templates you can then you can just use:
{{ custom_context_variable1 }}
to return 'foo' in your template. Obviously returning strings is for example only, you can use anything you like so long as your context processor returns a dict.
you can also try using php pages.
Then acces the variable on each page with an include 'file containing the var.php' on every page.
None of this will be visible in the source html as it is only processed on the server side.
If you you would like to try this, mail me and I will send you some sample code.

Django documentation, part3 understanding problems

I read the documentation of Django but now I am at a point where I need some explanation. It is on this site and I understand the views but I really don't get how the urls work. It looks pretty cryptic and confusing to me. Can anybody explain to me how the urls work and what their purpose is?
Your urls.py file is virtual. They do it this way so you don't need to worry about a static url to http://yoursite.com/polls/34. By using this number as a regular expression /(d+) you can keep it dynamic so one url with this regular expression can be millions of different polls.
when the url is requested that regular expression number (whether it's 1 or 13352) is sent to the view which then says, I need to query the database for a Poll that has a PrimaryKey (PK) of whatever this number is. If it's found the Poll object is sent to the template by the view. The template then displays all the data in the poll object.
The bottom line is using something like this you can have one line for a url which is essentially millions of different urls. I use this same format for a movies website I'm creating www.noobmovies.com. I follow the same structure for Stars, Movies and blogs. Essentially three lines of code has created urls for 10,000 pages or so.
There is a dedicated Django documentation page for that: https://docs.djangoproject.com/en/1.6/topics/http/urls/
Maybe it will help you?

Removing query string from url in django while keeping GET information

I am working on a Django setup where I can receive a url containining a query string as part of a GET. I would like to be able to process the data provided in the query string and return a page that is adjusted for that data but does not contain the query string in the URL.
Ordinarily I would just use reverse(), but I am not sure how to apply it in this case. Here are the details of the situation:
Example URL: .../test/123/?list_options=1&list_options=2&list_options=3
urls.py
urlpatterns = patterns('',
url(r'test/(P<testrun_id>\d+)/'), views.testrun, name='testrun')
)
views.py
def testrun(request, testrun_id):
if 'list_options' in request.GET.keys():
lopt = request.GET.getlist('list_options')
:
:
[process lopt list]
:
:
:
:
[other processing]
:
:
context = { ...stuff... }
return render(request, 'test_tracker/testview.html', context)
When the example URL is processed, Django will return the page I want but with the URL still containing the query string on the end. The standard way of stripping off the unwanted query string would be to return the testrun function with return HttpResponseRedirect(reverse('testrun', args=(testrun_id,))). However, if I do that here then I'm going to get an infinite loop through the testrun function. Furthermore, I am unsure if the list_options data that was on the original request will still be available after the redirect given that it has been removed from the URL.
How should I work around this? I can see that it might make sense to move the parsing of the list_options variable out into a separate function to avoid the infinite recursion, but I'm afraid that it will lose me the list_options data from the request if I do it that way. Is there a neat way of simultaneously lopping the query string off the end of the URL and returning the page I want in one place so I can avoid having separate things out into multiple functions?
EDIT: A little bit of extra background, since there have been a couple of "Why would you want to do this?" queries.
The website I'm designing is to report on the results of various tests of the software I'm working on. This particular page is for reporting on the results of a single test, and often I will link to it from a bigger list of tests.
The list_options array is a way of specifying the other tests in the list I have just come from. This allows me to populate a drop-down menu with other relevant tests to allow me to easily switch between them.
As such, I could easily end up passing in 15-20 different values and creating huge URLs, which I'd like to avoid. The page is designed to have a default set of other tests to fill in the menu in question if I don't suggest any others in the URL, so it's not a big deal if I remove the list_options. If the user wishes to come back to the page directly he won't care about the other tests in the list, so it's not a problem if that information is not available.
First a word of caution. This is probably not a good idea to do for various reasons:
Bookmarking. Imagine that .../link?q=bar&order=foo will filter some search results and also sort the results in particular order. If you will automatically strip out the querystring, then you will effectively disallow users to bookmark specific search queries.
Tests. Any time you add any automation, things can and will probably go wrong in ways you never imagined. It is always better to stick with simple yet effective approaches since they are widely used thus are less error-prone. Ill give an example for this below.
Maintenance. This is not a standard behavior model therefore this will make maintenance harder for future developers since first they will have to understand first what is going on.
If you still want to achieve this, one of the simplest methods is to use sessions. The idea is that when there is a querystring, you save its contents into a session and then you retrieve it later on when there is no querystring. For example:
def testrun(request, testrun_id):
# save the get data
if request.META['QUERY_STRING']:
request.session['testrun_get'] = request.GET
# the following will not have querystring hence no infinite loop
return HttpResponseRedirect(reverse('testrun', args=(testrun_id,)))
# there is no querystring so retreive it from session
# however someone could visit the url without the querystring
# without visiting the querystring version first hence
# you have to test for it
get_data = request.session.get('testrun_get', None)
if get_data:
if 'list_options' in get_data.keys():
...
else:
# do some default option
...
context = { ...stuff... }
return render(request, 'test_tracker/testview.html', context)
That should work however it can break rather easily and there is no way to easily fix it. This should illustrate the second bullet from above. For example, imagine a user wants to compare two search queries side-by-side. So he will try to visit .../link?q=bar&order=foo and `.../link?q=cat&order=dog in different tabs of the same browser. So far so good because each page will open correct results however as soon as the user will try to refresh the first opened tab, he will get results from the second tab since that is what is currently stored in the session and because browser will have a single session token for both tabs.
Even if you will find some other method to achieve what you want without using sessions, I imagine that you will encounter similar issues because HTTP is stateless hence you will have to store state on the server.
There is actually a way to do this without breaking much of the functionality - store state on client instead of server-side. So you will have a url without a querystring and then let javascript query some API for whatever you will need to display on that page. That however will force you to make some sort of API and use some javascript which does not exactly fall into the scope of your question. So it is possible to do cleanly however that will involve more than just using Django.

Recursive URL Patterns CMS Style

Whenever I learn a new language/framework, I always make a content management system...
I'm learning Python & Django and I'm stuck with making a URL pattern that will pick the right page.
For example, for a single-level URL pattern, I have:
url(r'^(?P<segment>[-\w]+)/$', views.page_by_slug, name='pg_slug'),
Which works great for urls like:
http://localhost:8000/page/
Now, I'm not sure if I can get Django's URL system to bring back a list of slugs ala:
http://localhost:8000/parent/child/grandchild/
would return parent, child, grandchild.
So is this something that Django does already? Or do I modify my original URL pattern to allow slashes and extract the URL data there?
Thanks for the help in advance.
That's because your regular expression does not allow middle '/' characters. Recursive definition of url segments pattern may be possible, but anyway it would be passed as a chunk to your view function.
Try this
url(r'^(?P<segments>[-/\w]+)/$', views.page_by_slug, name='pg_slug'),
and split segments argument passed to page_by_slug() by '/', then you will get ['parent', 'child', 'grandchild']. I'm not sure how you've organized the page model, but if it is not much sophiscated, consider using or improving flatpages package that is already included in Django.
Note that if you have other kind of urls that does not indicate user-generated pages but system's own pages, you should put them before the pattern you listed because Django's url matching mechanism follows the given order.